客户在比较繁忙的时刻执行了索引的REBUILD操作,导致大量会话被锁,最终对ONLINE REBUILD操作执行了取消操作,引发了一系列更严重的ORA-600错误。
登录数据库后,发现两个节点上ORACLE_BASE所在目录都已经100%占用。数据库无法正常通过/ AS SYSDBA方式登录。
查询告警日志发现大量的ORA-600和ORA-7445错误:
Tue May 08 21:20:26 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_1555.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions for mapped object] [0x000000000] [] []
Tue May 08 21:20:27 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_3891.trc:
ORA-00600: internal error code, arguments: [kghstack_underflow_internal_3], [0x60000000003002F0], [keybuf], [], [], [], [], []
Tue May 08 21:20:27 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_26190.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions for mapped object] [0x000000000] [] []
Tue May 08 21:20:27 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_2873.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions for mapped object] [0x000000000] [] []
Tue May 08 21:20:27 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_4518.trc:
ORA-00600: internal error code, arguments: [kghstack_underflow_internal_3], [0x60000000003002F0], [keybuf], [], [], [], [], []
Tue May 08 21:20:27 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_22469.trc:
ORA-00600: internal error code, arguments: [kghstack_underflow_internal_3], [0x60000000003002F0], [keybuf], [], [], [], [], []
Tue May 08 21:20:28 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_26440.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions for mapped object] [0x000000000] [] []
Tue May 08 21:20:29 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_762.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions for mapped object] [0x000000000] [] []
Tue May 08 21:20:29 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_26106.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions for mapped object] [0x000000000] [] []
Tue May 08 21:20:30 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_1597.trc:
ORA-07445:出现异常错误:核心转储[0000000000000000] [SIGSEGV] [Invalid permissions for mapped object] [0x000000000] [] []
Tue May 08 21:20:30 EAT 2012
Errors in file /oracle/app/admin/orcl/udump/orcl2_ora_856.trc:
ORA-07445: exception encountered: core dump [0000000000000000] [SIGSEGV] [Invalid permissions for mapped object] [0x000000000] [] []
.
.
.
检查对应的TRACE文件,发现导致错误语句执行的是TABLE_A表的插入:
ksedmp: internal or fatal error tb
ORA-00600:内部错误代码,参数: [kghstack_underflow_internal_3], [0x60000000003002F0], [keybuf], [], [], [], [], []
Current SQL statement for this session:
INSERT INTO TABLE_A (O_ID, P_ID, S_ID, F_ITEM, F_NAME, T_MON, D_MON, I_MON, P_STATE, P_TIME, R1, R2) VALUES (:B10 , SUBSTR(:B10 , LENGTH(:B10 ) - 1, 2), :B9 , :B8 , :B7 , :B6 , :B5 , :B4 , :B3 , NULL, :B2 , :B1 )
----- PL/SQL Call Stack -----
object line object
handle number name
c00000203c2cc550 119 package body U1.P_O_I
c00000203b788200 288 procedure U1.U_B_O_I
c00000203719b8d0 1 anonymous block
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+64 call ksedst1() 000000000 ? 000000001 ?
ksedmp()+2176 call ksedst() 000000000 ?
C000000000000D20 ?
4000000004037940 ?
000000000 ? 000000000 ?
000000000 ?
ksfdmp()+112 call ksedmp() 000000003 ?
9FFFFFFFFFFF1230 ?
60000000000BA290 ?
9FFFFFFFFFFF1800 ?
C000000000000999 ?
400000000407F9B0 ?
kgerinv()+304 call ksfdmp() 9FFFFFFFFFFF1D90 ?
000000003 ?
9FFFFFFFFFFF1810 ?
60000000000BA290 ?
C000000000000612 ?
40000000098C38B0 ?
kgeasnmierr()+144 call kgerinv() 60000000000318D0 ?
4000000001AD98A0 ?
6000000000032988 ?
4000000001AD98A0 ?
9FFFFFFFFFFF1DD0 ?
$cold_kghstack_unde call kgeasnmierr() 60000000000318D0 ?
rflow_internal()+36 9FFFFFFFBF3B1168 ?
8 9FFFFFFFBF3B1178 ?
6000000000032D00 ?
000000002 ?
60000000003002F0 ?
000000001 ? 000000006 ?
kghstack_free()+208 call $cold_kghstack_unde 60000000000318D0 ?
rflow_internal() 60000000003002F0 ?
60000000000BA290 ?
C000000000000B1D ?
4000000002F7A510 ?
00002C87B ?
6000000000031A10 ?
ksmfrs()+48 call kghstack_free() 60000000000318D0 ?
60000000003002F0 ?
rpiswu2()+1312 call ksmfrs() 60000000003002F0 ?
C000000000001026 ?
4000000002F78960 ?
000000000 ? 000000000 ?
000000000 ?
rpidrv()+2352 call rpiswu2() 9FFFFFFFFFFF2AF0 ?
4000000002F7AE60 ?
00002F833 ?
60000000000A7D20 ?
9FFFFFFFFFFF1E20 ?
C000000000001ABD ?
4000000000F14558 ?
60000000000C6CF0 ?
psddr0()+864 call rpidrv() 000000018 ? 000000066 ?
9FFFFFFFFFFF3700 ?
000000038 ?
9FFFFFFFFFFF2B20 ?
60000000000BA290 ?
psdnal()+736 call psddr0() 000000018 ? 000000066 ?
9FFFFFFFFFFF3700 ?
000000030 ?
pevm_EXECC()+832 call psdnal() 9FFFFFFFFFFF54D0 ?
C00000203489A9F8 ?
C0000000000011AA ?
9FFFFFFFBE832840 ?
C00000203C2CC550 ?
400000000313C770 ?
000026035 ?
pfrinstr_EXECC()+16 call pevm_EXECC() 9FFFFFFFBE83D1D0 ?
0 9FFFFFFFBE8328B0 ?
000000020 ?
pfrrun_no_tool()+19 call pfrinstr_EXECC() 9FFFFFFFBE83D1D0 ?
2 C000001DA198D61C ?
9FFFFFFFBE83D238 ?
pfrrun()+1376 call pfrrun_no_tool() 9FFFFFFFBE83D1D0 ?
000002001 ?
9FFFFFFFBE83D238 ?
60000000000BA290 ?
C000000000000A1C ?
4000000003163040 ?
9FFFFFFFBE83D620 ?
9FFFFFFFBE83D298 ?
plsql_run()+1328 call pfrrun() 9FFFFFFFFFFF3788 ?
9FFFFFFFFFFF3770 ?
60000000000BA290 ?
9FFFFFFFFFFF4370 ?
9FFFFFFFFFFF4370 ?
C000000000000E23 ?
4000000002C34D70 ?
peicnt()+560 call plsql_run() 9FFFFFFFFFFF4380 ?
000000001 ? 000000000 ?
9FFFFFFFFFFF4380 ?
60000000000BA290 ?
9FFFFFFFFFFF4E90 ?
kkxexe()+1008 call peicnt() 9FFFFFFFFFFF54D0 ?
9FFFFFFFBE83D1D0 ?
9FFFFFFFFFFF4EA0 ?
60000000000BA290 ?
9FFFFFFFFFFF5420 ?
C000000000000A1C ?
000000020 ?
9FFFFFFFFFFF4F00 ?
opiexe()+8016 call kkxexe() 0000051F0 ?
.
.
.
这个SQL语句是正常的业务操作,而这种简单的INSERT都会导致ORA-600和ORA-7445错误,说明插入的表或索引本身存在问题。
检查表和索引的状态,发现索引部分分区状态异常:
CREATE INDEX "U1"."IDX_F_2"
ON "U1"."TABLE_A" ("S_ID")
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(
BUFFER_POOL DEFAULT) LOCAL
(PARTITION "P1"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS7" ,
PARTITION "P2"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS8" ,
PARTITION "P3"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS9" ,
PARTITION "P4"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS10" ,
PARTITION "P5"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS1" ,
PARTITION "P6"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS2" ,
PARTITION "P7"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS3" ,
PARTITION "P8"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS4" ,
PARTITION "P9"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS5" ,
PARTITION "P10"
PCTFREE 10 INITRANS 2 MAXTRANS 255 NOLOGGING
STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT)
TABLESPACE "TBS6" )
ALTER INDEX "U1"."IDX_F_2" MODIFY PARTITION "P6" UNUSABLE
ALTER INDEX "U1"."IDX_F_2" MODIFY PARTITION "P7" UNUSABLE
和客户沟通后,确认今天问题发生之前有人对索引进行维护操作,操作语句包括:
alter index u1.IDX_F_2 rebuild partition p6 parallel
alter index U1.IDX_T_2 rebuild partition P9 nologging parallel online
alter index U1.IDX_T_1 rebuild partition P3 nologging parallel online
alter index U1.IDX_F_2 rebuild partition P1 nologging parallel online
alter index U1.IDX_T_4 rebuild partition P1 nologging parallel online
alter index U1.IDX_F_4 rebuild partition P5 nologging parallel online
在对索引进行维护之后,数据库中出现了大量的enq: TM – contention等待,TM锁出现的原和是索引IDX_F_2的PARTITION P6重建语句没有添加ONLINE参数有直接关系。随后大量会话被阻塞,而整个索引重建操作被人为中止,其中包括正在运行的ONLINE REBUILD操作,而ONLINE REBUILD操作被中止是十分危险的,很容易导致bug的产生,而当前就导致了ORA-600和ORA-7445的产生。
由于客户着急解决问题,因此对数据库进行了重启。重启后,ORA-600错误信息不再出现。但是前台应用报错,索引失效。
SELECT 'ALTER INDEX U1.' || INDEX_NAME || ' REBUILD PARTITION ' || PARTITION_NAME || ';'
FROM DBA_IND_PARTITIONS
WHERE INDEX_OWNER = 'U1'
AND STATUS != 'USABLE';
获取所有失效的索引重建语句,将索引重新编译后,数据库恢复正常。
在Oracle的MOS上没有找到与之最相关的bug信息,反而是找到了一个相关性很高的补丁信息,而对应的bug信息并没有公开:补丁程序13737888: ONLINE DDL:ORA-600[KGHSTACK_UNDERFLOW_INTERNAL_3], [0X2B7F4E1E7678], [KEYBUF]。