tbwshc

ORA-600(kfnsBackground03)错误

客户的数据库出现了ORA-600(kfnsBackground03)错误。

数据库版本为10.2.0.3 RAC for HP-UX 11.23。这个错误在ASM实例和数据库实例都可能出现,如果发生在ASM实例,并不会导致ASM实例的崩溃,但是如果发生在数据库实例,则会导致数据库实例被强制关闭:

Tue May 15 10:28:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:32:50 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:33:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:34:44 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:43:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:46:13 2012
Errors in file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_18846.trc:
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Tue May  tb 15 10:46:14 2012
Trace dumping is performing id=[cdmp_20120515104614]

 

上面是ASM实例的报错,下面是对应时刻数据库实例的报错:

Tue May 15 10:38:12 2012
kkjcre1p: unable to spawn jobq slave process
Tue May 15 10:38:12 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_cjq0_17957.trc:

Tue May 15 10:42:19 2012
PMON failed to acquire latch, see PMON dump
Tue May 15 10:43:04 2012
found dead shared server 'S006', pid = (90, 4)
Tue May 15 10:43:10 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j000_19938.trc:
ORA-12012: error on auto execute of job 42579
ORA-27468: "EXFSYS.RLM$EVTCLEANUP" is locked by another process
Tue May 15 10:45:06 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j002_23628.trc:
ORA-12012: error on auto execute of job 8888975
ORA-27468: "ORCL.P_DATA_C" is locked by another process
Tue May 15 10:45:10 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j003_23959.trc:
ORA-12012: error on auto execute of job 8855572
ORA-27468: "ORCL.P_DATA" is locked by another process
Tue May 15 10:46:14 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_18844.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Tue May 15 10:46:14 2012
ASMB: terminating instance due to error 15064
Tue May 15 10:46:15 2012
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_17903.trc
Tue May 15 10:46:16 2012
Shutting down instance (abort)
License high water mark = 52

如果从这次数据库的实例崩溃看,问题似乎和主机上的资源耗尽有关。在问题发生之前,数据库实例已经出现了kkjcre1p: unable to spawn jobq slave process和PMON failed to acquire latch的问题。

当时其他时刻出现这个错误时,似乎并没有确定的资源不足的信息:

Sat May 26 09:47:49 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Sat May 26 09:49:44 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Sat May 26 09:52:23 2012
Errors in file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_21722.trc:
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Sat May 26 09:52:25 2012
Trace dumping is performing id=[cdmp_20120526095225]

对应这个时刻的数据库告警信息为:

Sat May 26 09:52:24 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_21720.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Sat May 26 09:52:24 2012
ASMB: terminating instance due to error 15064
Sat May 26 09:52:25 2012
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_20837.trc
Sat May 26 09:52:26 2012
Shutting down instance (abort)
License high water mark = 46
Sat May 26 09:52:30 2012
Instance terminated by ASMB, pid = 21720
Sat May 26 09:52:31 2012
Instance terminated by USER, pid = 536

这次错误的出现并没有任何其他的信息,数据库实例就直接DOWN掉了。不过每次在出现这个错误时,ASM实例上都会存在告警信息:NOTE: database ORCL1:ORCL failed during msg 19, reply 2。这说明ASM实例和数据库的通信存在了问题。kfnsBackground是Kernel Files Network Service Background的缩写。其中MSG 19是指IOSTAT,而reply 2指的是TIMEOUT,这说明ASM在进行io操作是出现了timeout导致了ASM的异常并导致实例的崩溃。

这个错误相对比较罕见,整个METALINK中,只有3篇文章和这个错误相关,其中两篇是和归档路径空间不足导致系统HANG住,最终导致IO的TIMEOUT,并产生了错误;而另外一篇则没有进一步的信息。其中这三次错误对应的版本分别是10.2.0.4 FOR AIX、10.2.0.4 FOR SOLARIS和10.2.0.3 FOR HPUX,这说明这个错误和平台没有关系,但是问题集中在10.2.0.3和10.2.0.4版本上。

 根据上面的分析,应该部署操作系统信息监控工具,以便于随时观察系统资源的使用情况,在出现类似的错误可以进行辅助分析。由于这个问题没有出现在10.2.0.5中的记录,因此把数据库升级到10.2.0.5有可能避开这个问题。

posted on 2012-08-24 14:40 chen11-1 阅读(315) 评论(0)  编辑  收藏


只有注册用户登录后才能发表评论。


网站导航: