#
Oracle10g的DBA_TAB_STATISTICS视图的STATTYPE_LOCKED列没有正确的显示结果。
看一个简单的例子:
SQL> select * from v$version;
BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bi
PL/SQL Release 10.2.0.5.0 - Production
CORE 10.2.0.5.0 Production
TNS for Linux: Version 10.2.0.5.0 - Production
NLSRTL Version 10.2.0.5.0 - Production
SQL> create table t_part (id number, name varchar2(30))
2 partition by range (id)
3 (partition p1 values less than (10),
4 partition p2 values less than (20),
5 partition pmax values less than (maxvalue));
Table created.
SQL> select table_name, partition_name, stattype_locked from tb all_tab_statistics where wner = user and table_name = 'T_PART';
TABLE_NAME PARTITION_NAME STATT
------------------------------ ------------------------------ -----
T_PART
T_PART P1
T_PART P2
T_PART PMAX
SQL> exec dbms_stats.lock_partition_stats(user, 'T_PART', 'P1')
PL/SQL procedure successfully completed.
SQL> select table_name, partition_name, stattype_locked from all_tab_statistics where wner = user and table_name = 'T_PART';
TABLE_NAME PARTITION_NAME STATT
------------------------------ ------------------------------ -----
T_PART
T_PART P1
T_PART P2
T_PART PMAX
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART')
PL/SQL procedure successfully completed.
SQL> select table_name, partition_name, last_analyzed, stattype_locked from all_tab_statistics where wner = user and table_name = 'T_PART';
TABLE_NAME PARTITION_NAME LAST_ANAL STATT
------------------------------ ------------------------------ --------- -----
T_PART 16-JUL-12
T_PART P1
T_PART P2 16-JUL-12
T_PART PMAX 16-JUL-12
可以看到在10.2环境中,LOCK_PARTITION_STATS过程是正常工作的,但是DBA_TAB_STATISTICS视图的STATTYPE_LOCKED列并没有正确的显示分区被锁定的结果。
而对于表来说,LOCK_TABLE_STATS过程执行后,STATTYPE_LOCKED的结果显示是正常的:
SQL> exec dbms_stats.lock_table_stats(user, 'T_PART')
PL/SQL procedure successfully completed.
SQL> select table_name, partition_name, last_analyzed, stattype_locked from all_tab_statistics where wner = user and table_name = 'T_PART';
TABLE_NAME PARTITION_NAME LAST_ANAL STATT
------------------------------ ------------------------------ --------- -----
T_PART 16-JUL-12 ALL
T_PART P1 ALL
T_PART P2 16-JUL-12 ALL
T_PART PMAX 16-JUL-12 ALL
这说明在10.2中,Oracle对于分区列的锁定的支持是存在问题的。查询了一下MOS,Oracle将这个问题确认为内部BUG:7240460,这个问题在11.1.0.7中被FIXED。
而在11.2中,这个问题以及不存在了:
SQL> select * from v$version;
BANNER
----------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE 11.2.0.3.0 Production
TNS for Solaris: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production
SQL> select owner, table_name, partition_name, stattype_locked
2 from dba_tab_statistics
3 where wner = 'TEST'
4 and table_name = 'T_PART';
OWNER TABLE_NAME PARTITION_NAME STATT
---------- ------------ --------------- -----
TEST T_PART
TEST T_PART P2
TEST T_PART P3
TEST T_PART P4
TEST T_PART P5
TEST T_PART PMAX
6 rows selected.
SQL> exec dbms_stats.lock_partition_stats('TEST', 'T_PART', 'P2')
PL/SQL procedure successfully completed.
SQL> select owner, table_name, partition_name, stattype_locked
2 from dba_tab_statistics
3 where wner = 'TEST'
4 and table_name = 'T_PART';
OWNER TABLE_NAME PARTITION_NAME STATT
---------- ------------ --------------- -----
TEST T_PART
TEST T_PART P2 ALL
TEST T_PART P3
TEST T_PART P4
TEST T_PART P5
TEST T_PART PMAX
6 rows selected
客户数据库出现ORA-4031错误,随后出现了大量的ORA-1003和ORA-604错误。
数据库版本为10.2.0.3 RAC for HP-UX,详细的报错信息为:
Mon Jul 16 15:30:30 2012
Errors in file /u01/app/oracle/admin/ORCL/udump/orcl2_ora_2389.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select name,online$,contents...","sql area","tmp")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select name,online$,contents...","sql area","tmp")
Mon Jul 16 15:30:32 2012
Errors in file /u01/app/oracle/admin/ORCL/udump/orcl2_ora_2878.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 144 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","coldef: qcopCreateCol")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 144 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","coldef: qcopCreateCol")
Mon Jul 16 15:30:32 2012
Errors in file /u01/app/oracle/admin/ORCL/udump/orcl2_ora_10030.trc:
ORA-00603: ORACLE server session terminated by fatal error tb
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 144 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","coldef: qcopCreateCol")
ORA-00604: error occurred at recursive SQL level 1
ORA-04031: unable to allocate 144 bytes of shared memory ("shared pool","select name,online$,contents...","Typecheck","coldef: qcopCreateCol")
Mon Jul 16 15:30:39 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select o.owner#,o.name,o.nam...","sql area","kobjn : kkdcchs")
Mon Jul 16 15:30:40 2012
Trace dumping is performing id=[cdmp_20120716153040]
Mon Jul 16 15:32:19 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:33:59 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:35:39 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:37:19 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:39:00 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:40:40 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQ
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: err
Mon Jul 16 15:41:59 2012
Thread 2 advanced to log sequence 61522
Current log# 7 seq# 61522 mem# 0: +ORCL_CTL/orcl/onlinelog/group_7.263.611598065
Mon Jul 16 15:42:20 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:44:00 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
Mon Jul 16 15:45:40 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl2_cjq0_29269.trc:
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQ
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: error occurred at recursive SQL level 2
ORA-01003: no statement parsed
ORA-00604: error occurred at recursive SQL level ORA-00604: err
可以看得,开始还是比较“正常”的ORA-4031错误,但是随着CJQ进程出现了ORA-4031的错误,数据库开始每1分40秒报一次ORA-1003和ORA-604的错误。这里CJQ进程似乎陷入了一个死循环中,持续不停的报错。查询MOS发现,在9.2中Oracle存在类似的BUG:Bug 3289063 ORA-1003 every 5 seconds after CJQ hits ORA-4031。这个bug已经在9.2.0.5和10.1.0.2中被FIXED,但是这个bug的现象和当前的错误完全一致,从这点上看,这个问题显然在10.2.0.3中仍然存在。
Oracle给出的解决方案也很简单,重启CJQ进程既可。除了在操作系统去kill -9杀掉cjq0对应的进程之外,直接通过alter system set job_queue_processes=0,然后在设置回原值,也有可能解决该问题。此外,尽量避免ORA-4031错误的产生是最根本的问题解决之道。
可以利用tar命令向磁盘备份文件,同时可以利用tar从磁带中恢复指定文件。
备份的命令为:
tar cvf /dev/rmt/0m filename…
恢复的命令为:
tar xvf /dev/rmt/0m filename…
从磁带上恢复文件的例子:
> tar xvf /dev/rmt/0m /archive/orcl_exp_report_1.dmp /archive/orcl_exp_report_2.dmp
x /archive/temp_exp/acdb_exp_report_1.dmp, 3198353408 bytes, 6246784 tape blocks
x /archive/temp_exp/acdb_exp_report_2.dmp, 3078950912 bytes, 6013576 tape blocks
tb恢复时指定的文件名必须是备份到磁带的文件名。
当用户密码即将过期时,在登录时Oracle会提示ORA-28002错误,但是并不会影响正常的登录。
本来认为这个信息并没有太大的影响,但是没想到这个错误会导致SET AUTOTRACE功能失效:
solaris*orcl-/home/oracle$ sqlplus test/test
SQL*Plus: Release 11.2.0.3.0 Production on Fri Jul 13 11:27:28 2012
Copyright (c) 1982, 2011, Oracle. tb All rights reserved.
ERROR:
ORA-28002: the password will expire within 1 days
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security and Real Application Testing options
SQL> set autot on
ERROR:
ORA-28002: the password will expire within 1 days
SP2-0619: Error while connecting
SP2-0611: Error enabling STATISTICS report
SQL> alter user test identified by test;
User altered.
SQL> set autot on
ERROR:
ORA-24315: illegal attribute type
SP2-0619: Error while connecting
SP2-0611: Error enabling STATISTICS report
SQL> exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security and Real Application Testing options
solaris*orcl-/home/oracle$ sqlplus test/test
SQL*Plus: Release 11.2.0.3.0 Production on Fri Jul 13 11:27:52 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security and Real Application Testing options
SQL> set autot on
SQL> insert into t values (1, 'a');
1 row created.
Execution Plan
----------------------------------------------------------
---------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------
| 0 | INSERT STATEMENT | | 1 | 12 | tb 1 (0)| 00:00:01 |
| 1 | LOAD TABLE CONVENTIONAL | T | | | | |
---------------------------------------------------------------------------------
Statistics
----------------------------------------------------------
71 recursive calls
9 db block gets
75 consistent gets
10 physical reads
1284 redo size
829 bytes sent via SQL*Net to client
785 bytes received via SQL*Net from client
3 SQL*Net roundtrips to/from client
9 sorts (memory)
0 sorts (disk)
1 rows processed
很显然,由于ORA-28002错误导致了SET AUTOTRACE ON功能启用时碰到了错误。当修改了当前的用户密码,则ORA-28002不再出现后,SET AUTOTRACE ON的功能恢复正常。
根据上面的信息其实可以判断,在启用SET AUTOTRACE ON功能时,sqlplus会自动创建一个新的会话来记录当前会话的统计信息。
而启用的新的会话会使用当前会话登录时保存的密码来进行登录,因此可以通过下面的例子来验证这个推论:
solaris*orcl-/home/oracle$ sqlplus test/test
SQL*Plus: Release 11.2.0.3.0 Production on Sun Jul 15 01:28:38 2012
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Oracle Label Security and Real Application Testing options
SQL> set autot on
SQL> set autot off
SQL> alter user test identified by test1;
User altered.
SQL> set autot on
ERROR:
ORA-01017: invalid username/password; logon denied
SP2-0619: Error while connecting
SP2-0611: Error enabling STATISTICS report
果然,在用户登录后,如果当前的密码被修改,是会导致SET AUTOTRACE ON启用时报错不正确的用户名密码错误的。
一个客户碰到的具体需求,分区表中有些分区所在的表空间被OFFLINE,导致在删除统计信息时报错。
下面通过例子来说明这个问题:
SQL> create table t_part_read (id number)
2 partition by range (id)
3 (partition p1 values less than (10) tablespace ts1,
4 partition p2 values less than (20) tablespace ts2,
5 partition pmax values less than (maxvalue) tablespace users);
Table created.
SQL> insert into t_part_read select rownum from tab;
54 rows created.
SQL> commit;
Commit complete.
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ')
PL/SQL procedure successfully completed.
SQL> alter tablespace ts1 read only;
Tablespace altered.
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ')
PL/SQL procedure successfully completed.
SQL> alter tablespace ts1 offline;
Tablespace altered.
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ')
BEGIN dbms_stats.gather_table_stats(user, 'T_PART_READ'); END;
*
ERROR at line 1:
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: '/u01/app/oracle/oradata/ORCL/datafile/tbo1_mf_ts1_7w8l5fz1_.dbf'
ORA-06512: at "SYS.DBMS_STATS", line 23829
ORA-06512: at "SYS.DBMS_STATS", line 23880
ORA-06512: at line 1
如果将表空间只读,并不会影响到表空间上的表或分区的统计信息的收集,因为收集过程只是读取,而收集的结果信息是写到SYSTEM表空间的。
但是如果分区所在的表空间处于OFFLINE状态,那么在统计信息收集的过程中就会报错。
有一个很简单的方法可以解决这个问题,就是将被OFFLINE影响的分区的统计信息锁定,这样Oracle在收集统计信息时就会跳过锁定的分区,通过这个办法就可以避免统计信息收集过程中的报错:
SQL> exec dbms_stats.lock_partition_stats(user, 'T_PART_READ', 'P1')
PL/SQL procedure successfully completed.
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ')
BEGIN dbms_stats.gather_table_stats(user, 'T_PART_READ'); END;
*
ERROR at line 1:
ORA-00376: file 6 cannot be read at this time
ORA-01110: data file 6: '/u01/app/oracle/oradata/ORCL/datafile/o1_mf_ts1_7w8l5fz1_.dbf'
ORA-06512: at "SYS.DBMS_STATS", line 23829
ORA-06512: at "SYS.DBMS_STATS", line 23880
ORA-06512: at line 1
SQL> exec dbms_stats.gather_table_stats(user, 'T_PART_READ', granularity => 'PARTITION')
PL/SQL procedure successfully completed.
即使锁定分区后,尝试收集统计信息仍然报错,这是因为Oracle默认除了要收集分区上的统计信息以外,还要收集表级的统计信息,而这就会造成被OFFLINE影响的分区也要被读取。
解决方法就是在收集统计信息的时候指定收集的粒度是分区,不收集表上的GLOBAL信息。
第一次碰到Exadata上的bug。
数据库环境Exadata V2-2,数据库版本为11.2.0.2,错误信息为:
Wed Apr 25 11:32:35 2012
Errors in file /u01/app/oracle/diag/rdbms/ods/orcl2/trace/orcl2_ora_9495.trc (incident=304808):
ORA-00600: internal error code, arguments: [kffmXpGet], [145], [69784], [], [], [], [], [], [], [], [], []
ORA-03135: connection lost contact
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_304808/orcl2_ora_9495_i304808.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
opidcl aborting process tb unknown ospid (9495) as a result of ORA-600
Wed Apr 25 11:32:36 2012
Dumping diagnostic data in directory=[cdmp_20120425113236], requested by (instance=2, sid=9495), summary=[incident=304808].
Wed Apr 25 11:32:36 2012
Sweep [inc][304808]: completed
Sweep [inc2][304808]: completed
Wed Apr 25 11:32:39 2012
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_pmon_10797.trc (incident=302224):
ORA-00600: internal error code, arguments: [kffmXpGet], [181], [87276], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl2/incident/incdir_302224/orcl2_pmon_10797_i302224.trc
Dumping diagnostic data in directory=[cdmp_20120425113240], requested by (instance=2, sid=10797 (PMON)), summary=[incident=302224].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_pmon_10797.trc:
ORA-00600: internal error code, arguments: [kffmXpGet], [181], [87276], [], [], [], [], [], [], [], [], []
PMON (ospid: 10797): terminating the instance due to error 472
Wed Apr 25 11:32:41 2012
opiodr aborting process unknown ospid (11248) as a result of ORA-1092
Wed Apr 25 11:32:41 2012
opiodr aborting process unknown ospid (1967) as a result of ORA-1092
Wed Apr 25 11:32:42 2012
ORA-1092 : opitsk aborting process
Wed Apr 25 11:32:43 2012
License high water mark = 573
Instance terminated by PMON, pid = 10797
USER (ospid: 8755): terminating the instance
Instance terminated by USER, pid = 8755
由于ORA-600[kffmXpGet]错误的出现,最终出现了ORA-1092的错误,并致使opitsk进程退出,导致数据库实例崩溃。
Oracle在MOS文档Bug 12387467 instance crash by ORA-600 [kffmxpget]中描述了这个问题,确认影响的版本为11.2.0.1、11.2.0.2、11.2.0.3,Oracle在11.2.0.2的Bundle Patch 16 for Exadata以及11.2.0.3 Bundle Patch 5 for Exadata中fixed了这个问题,Oracle计划在11.2.0.4和12.1中彻底Fixed该问题。
这个错误只会发生在Exadata中,因为导致错误的原因和Smart Scan有关。当会话执行Smart Scan操作时使用了Ctrl + C中止该操作,会导致PMON进程出现这个600错误。Oracle建议在执行数据文件的SHRINK操作前,中止并退出所有执行Exadata Smart Scan的会话。
Eygle的新书经历了4个多月的等待,终于面世了。
我应该是这本书的第一个读者,4个月前Eygle刚刚完成初稿的时候,我就完整的看过一遍了。
我平常看书比较慢,不过那次却看得很快。一方面是由于大部分案例都比较熟悉;另一方面得益于Eygle的文笔,把故障原因、分析过程、解决思路和处理过程描述得非常清晰,给人一种一气呵成的感觉。导致我这个帮忙审稿的,多次都陷入到具体的内容中了,虽然对于我来说看得tb很爽,但是对于审稿而言并不是一件好事。审稿应该始终站在一个中立的角度,而如果在审稿的过程中过于关注内容,就会忽略掉一些细节的问题。好在Eygle对于自己文章的严谨程度很高,因此通篇看完也没有发现多少不妥之处,估计也不会遗漏太多的问题。
书中的所有内容都来自真实的案例,而且其中有三个个重要的案例都是来自2011年12月30日到2011年12月31日这两天。在2012年元旦马上要来临之前,Eygle接连帮助三个客户进行了数据库的恢复,这件事刺激了Eygle,于是元旦回来,Eygle就开始构思并执笔他的新作。一个多月的时间,这本《数据安全警示录》就基本上完成了。
上次看得是电子版,拿到实体书后感觉这次的印刷质量还是很不错的,等有空的话还要再把这本书再看一遍。
客户的数据库出现了ORA-600(kfnsBackground03)错误。
数据库版本为10.2.0.3 RAC for HP-UX 11.23。这个错误在ASM实例和数据库实例都可能出现,如果发生在ASM实例,并不会导致ASM实例的崩溃,但是如果发生在数据库实例,则会导致数据库实例被强制关闭:
Tue May 15 10:28:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:32:50 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:33:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:34:44 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:43:05 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Tue May 15 10:46:13 2012
Errors in file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_18846.trc:
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Tue May tb 15 10:46:14 2012
Trace dumping is performing id=[cdmp_20120515104614]
上面是ASM实例的报错,下面是对应时刻数据库实例的报错:
Tue May 15 10:38:12 2012
kkjcre1p: unable to spawn jobq slave process
Tue May 15 10:38:12 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_cjq0_17957.trc:
Tue May 15 10:42:19 2012
PMON failed to acquire latch, see PMON dump
Tue May 15 10:43:04 2012
found dead shared server 'S006', pid = (90, 4)
Tue May 15 10:43:10 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j000_19938.trc:
ORA-12012: error on auto execute of job 42579
ORA-27468: "EXFSYS.RLM$EVTCLEANUP" is locked by another process
Tue May 15 10:45:06 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j002_23628.trc:
ORA-12012: error on auto execute of job 8888975
ORA-27468: "ORCL.P_DATA_C" is locked by another process
Tue May 15 10:45:10 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_j003_23959.trc:
ORA-12012: error on auto execute of job 8855572
ORA-27468: "ORCL.P_DATA" is locked by another process
Tue May 15 10:46:14 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_18844.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Tue May 15 10:46:14 2012
ASMB: terminating instance due to error 15064
Tue May 15 10:46:15 2012
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_17903.trc
Tue May 15 10:46:16 2012
Shutting down instance (abort)
License high water mark = 52
如果从这次数据库的实例崩溃看,问题似乎和主机上的资源耗尽有关。在问题发生之前,数据库实例已经出现了kkjcre1p: unable to spawn jobq slave process和PMON failed to acquire latch的问题。
当时其他时刻出现这个错误时,似乎并没有确定的资源不足的信息:
Sat May 26 09:47:49 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Sat May 26 09:49:44 2012
NOTE: database ORCL1:ORCL failed during msg 19, reply 2
Sat May 26 09:52:23 2012
Errors in file /u01/app/oracle/admin/+ASM/udump/+asm1_ora_21722.trc:
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Sat May 26 09:52:25 2012
Trace dumping is performing id=[cdmp_20120526095225]
对应这个时刻的数据库告警信息为:
Sat May 26 09:52:24 2012
Errors in file /u01/app/oracle/admin/ORCL/bdump/orcl1_asmb_21720.trc:
ORA-15064: communication failure with ASM instance
ORA-00600: internal error code, arguments: [kfnsBackground03], [], [], [], [], [], [], []
Sat May 26 09:52:24 2012
ASMB: terminating instance due to error 15064
Sat May 26 09:52:25 2012
System state dump is made for local instance
System State dumped to trace file /u01/app/oracle/admin/ORCL/bdump/orcl1_diag_20837.trc
Sat May 26 09:52:26 2012
Shutting down instance (abort)
License high water mark = 46
Sat May 26 09:52:30 2012
Instance terminated by ASMB, pid = 21720
Sat May 26 09:52:31 2012
Instance terminated by USER, pid = 536
这次错误的出现并没有任何其他的信息,数据库实例就直接DOWN掉了。不过每次在出现这个错误时,ASM实例上都会存在告警信息:NOTE: database ORCL1:ORCL failed during msg 19, reply 2。这说明ASM实例和数据库的通信存在了问题。kfnsBackground是Kernel Files Network Service Background的缩写。其中MSG 19是指IOSTAT,而reply 2指的是TIMEOUT,这说明ASM在进行io操作是出现了timeout导致了ASM的异常并导致实例的崩溃。
这个错误相对比较罕见,整个METALINK中,只有3篇文章和这个错误相关,其中两篇是和归档路径空间不足导致系统HANG住,最终导致IO的TIMEOUT,并产生了错误;而另外一篇则没有进一步的信息。其中这三次错误对应的版本分别是10.2.0.4 FOR AIX、10.2.0.4 FOR SOLARIS和10.2.0.3 FOR HPUX,这说明这个错误和平台没有关系,但是问题集中在10.2.0.3和10.2.0.4版本上。
根据上面的分析,应该部署操作系统信息监控工具,以便于随时观察系统资源的使用情况,在出现类似的错误可以进行辅助分析。由于这个问题没有出现在10.2.0.5中的记录,因此把数据库升级到10.2.0.5有可能避开这个问题。
前几天在给公司的员工讲一个案例的提到这个问题。
其实当时提到了这个特点,DROP TABLE会进入回收站,但是DROP PARTITION并不会,因此DROP PARTITION之后,数据无法简单的回复,只能通过逻辑或物理备份的方式来进行数据的回复。
SQL> create table t_drop (id number);
Table created.
SQL> drop table t_drop;
Table dropped.
SQL> select object_name, original_name from recyclebin;
OBJECT_NAME ORIGINAL_NAME
------------------------------ --------------------------------
BIN$xJhZqpmfWZXgRDzZK0pZWw==$0 T_DROP
SQL> create table t_part_drop (id number) partition by range (id)
2 (partition p1 values less than (10),
3 partition p2 values less than (20),
4 partition p3 values less than (30),
5 partition pmax values less than (maxvalue));
Table created.
SQL> insert into t_part_drop tb select rownum from user_objects;
176 rows created.
SQL> commit;
Commit complete.
SQL> alter table t_part_drop drop partition p1;
Table altered.
SQL> select object_name, original_name from recyclebin;
OBJECT_NAME ORIGINAL_NAME
------------------------------ --------------------------------
BIN$xJhZqpmfWZXgRDzZK0pZWw==$0 T_DROP
本来只是普及一下这个常识,不过有人问我Oracle为什么没有实现将删除分区放在回收站中。这个问题问的很好,因为如果这个功能很容易实现,那么Oracle肯定早就实现了,而到了11.2中Oracle仍然没有实现这个功能,那么一定说明这个功能不是无法实现,就是实现的困难太大。
回收站的实现并不复杂,当一张表被删除的时候,Oracle没有直接释放表在表空间上的空间占用,而是将表简单的打了个标识,这样在正常查询数据字典时就不会看到这张被删除的表,而如果需要恢复这张表时,只需要将标识位改回来既可。
那么同样是修改数据字典,为什么不能将被删除的分区通过标识的方法放到回收站中呢,这是因为,对于表而言,删除操作是将一个整理完全删除。而对于分区的删除,是删除整体中的一个部分。对于删除这个动作其实并没有太大的影响,但是回收站的功能不是为了删除,而是为了可以快速的恢复。对表而言,直接恢复整体不存在任何的问题,即使同名对象存在,也只需改个名字既可。而对于删除分区的恢复而言, tb 问题就不那么简单了。由于分区表并没有删除,因此这个表仍然可以继续进行操作,虽然某个分区被删除了,但是除非是范围分区中的MAXVALUE分区和列表分区中的DEFAULT分区,否则再插入原分区对应的数据时,并不会报错,而是会插入到其他分区中:
SQL> select * from t_part_drop partition (p2);
ID
----------
10
11
12
13
14
15
16
17
18
19
10 rows selected.
SQL> insert into t_part_drop values (5);
1 row created.
SQL> select * from t_part_drop partition (p2);
ID
----------
10
11
12
13
14
15
16
17
18
19
5
11 rows selected.
原表应该插入分区P1的数据,由于分区P1被删除,因此现在满足分区P2的条件,被插入到分区P2中,考虑这种情况下,如果直接恢复P1分区会怎样。
显然这不是一个简单的数据字典的修改就能解决的问题,不但涉及到分区数据改变的问题,还必然会带来全局和本地索引失效的问题,更重要的是,可能带来主键冲突的情况。
这还只是分区表进行了DML的情况,如果删除分区后,分区表又进行了DDL,比如新SPLIT了P1分区,那么删除分区的恢复操作就更无法进行了。
如果一个功能觉得很简单就可以实现,但是Oracle却一直没有实现,那么很可能实现这个功能并不像想象的那么简单。
11.2对于CREATE OR REPLACE TYPE语句进行了增加,增加了FORCE选项。
在11.2之前,只要有其他的表或TYPE依赖了当前对象,这个对象就无法进行REPLACE了:
SQL> create type t_num_tab is table of number;
2 /
Type created.
SQL> create type t_record is object (id number, n_tab t_num_tab);
2 /
Type created.
SQL> create or replace type t_num_tab is table of number(5);
2 /
create or replace type t_num_tab is table of number(5);
*
ERROR at line 1:
ORA-02303: cannot drop or replace a type with type or table dependents
这是11.2之前的情况,尝试执行CREATE OR REPLACE将会导致ORA-2303错误,在11.2中Oracle增加了FORCE功能,使得一个仅被TYPE所依赖的对象仍然可以执行REPLACE的操作:
SQL> create or replace type t_num_tab force is tbable of number(5);
2 /
Type created.
SQL> select * from v$version;
BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE 11.2.0.3.0 Production
TNS for Solaris: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production
有意思的是,FORCE语句的位置不对则不会生效,这时同样会报ORA-2303的错误,而并不会导致语句错误:
SQL> create or replace force type t_num_tab is table of number(5);
create or replace force type t_num_tab is table of number(5)
*
ERROR at line 1:
ORA-02303: cannot drop or replace a type with type or table dependents
SQL> create or replace type t_num_tab is table of number(5) force;
2 /
create or replace type t_num_tab is table of number(5) force;
*
ERROR at line 1:
ORA-02303: cannot drop or replace a type with type or table dependents
最后这个功能只对被TYPE所依赖的对象有效,一旦对象被表所依赖,则FORCE功能也不起作用:
SQL> create table t_type_tab (id number, c_tab t_num_tab)
2 nested table c_tab store tb as c_tab_tab;
Table created.
SQL> create or replace type t_num_tab force is table of number(6);
2 /
create or replace type t_num_tab force is table of number(6);
*
ERROR at line 1:
ORA-22866: cannot replace a type with table dependents
Oracle的错误信息也变成了ORA-22866。其实这时可以预料的,因为一旦创建了表,就相当于进行了实体化的工作,如果依赖的类型发生了变化,将会影响表中已有的数据的读写。不过其实Oracle可以做到更进一步,就是如果表段没有创建或者表中没有插入数据的情况下,允许对依赖的对象进行修改。