朋友反馈他们单位Oracle RAC中的一个节点异常关闭了,操作环境是IBM小机720,EMC存储,Oracle是11.2.0.4。收集了一些信息给我,查看该节点实例的alert.log文件发现了如下信息:
Thu Mar 31 14:29:18 2016 NOTE:Waiting for all pending writes to complete before de-registering: grpnum 3 Thu Mar 31 14:29:20 2016 WARNING: group 3 dismounted: failed to write virtual extent 263 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 263 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 265 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 265 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 265 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 265 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 265 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 265 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 265 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 265 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 260 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 260 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 246 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 246 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 246 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 246 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 246 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 246 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 246 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 246 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 260 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 260 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 260 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 260 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 260 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 260 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 261 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 261 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 261 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 261 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 261 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 261 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 261 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 261 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 244 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 244 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 244 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 244 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 257 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 257 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 244 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 244 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 257 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 257 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 244 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 257 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 244 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 257 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 257 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 257 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 259 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 259 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 259 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 259 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 259 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 259 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 259 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 259 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 254 of file 264 WARNING: group 3 dismounted: failed to write virtual extent 254 of file 264 Thu Mar 31 14:29:21 2016 WARNING: requested mirror side 1 of virtual extent 0 logical extent 0 offset 16384 is not allocated; I/O request failed WARNING: requested mirror side 2 of virtual extent 0 logical extent 1 offset 16384 is not allocated; I/O request failed WARNING: requested mirror side 3 of virtual extent 0 logical extent 2 offset 16384 is not allocated; I/O request failed Errors in file /opt/app/oracle/diag/rdbms/jtpsoft/jtpsoft1/trace/jtpsoft1_arc1_10551306.trc: ORA-00202: control file: '+GJJ_DG/jtpsoft/controlfile/current.260.841581789' ORA-15079: ASM file is closed ORA-15079: ASM file is closed ORA-15079: ASM file is closed System state dump requested by (instance=1, osid=15007960 (CKPT)), summary=[abnormal instance termination]. System State dumped to trace file /opt/app/oracle/diag/rdbms/jtpsoft/jtpsoft1/trace/jtpsoft1_diag_4784254_20160331142927.trc Thu Mar 31 14:29:28 2016 ORA-1092 : opitsk aborting process Thu Mar 31 14:29:34 2016 ORA-1092 : opitsk aborting process Thu Mar 31 14:29:34 2016 License high water mark = 181 Thu Mar 31 14:29:37 2016 Instance terminated by CKPT, pid = 15007960 USER (ospid: 15073490): terminating the instance Instance terminated by USER, pid = 15073490
从上面信息可以看到磁盘组3(也就是磁盘组+GJJ_DG)已经dismounted了并且ckpt进程终止了实例。如是查看跟踪文件/opt/app/oracle/diag/rdbms/jtpsoft/jtpsoft1/trace/jtpsoft1_arc1_10551306.trc,可以看到如下内容:
Trace file /opt/app/oracle/diag/rdbms/jtpsoft/jtpsoft1/trace/jtpsoft1_arc1_10551306.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /opt/app/oracle/product/11.2.0/db_1
System name:	AIX
Node name:	p720a
Release:	1
Version:	6
Machine:	00C855104C00
Instance name: jtpsoft1
Redo thread mounted by this instance: 1
Oracle process number: 41
Unix process pid: 10551306, image: oracle@p720a (ARC1)
*** 2016-03-31 14:29:21.527
*** SESSION ID:(641.1) 2016-03-31 14:29:21.527
*** CLIENT ID:() 2016-03-31 14:29:21.527
*** SERVICE NAME:(SYS$BACKGROUND) 2016-03-31 14:29:21.527
*** MODULE NAME:() 2016-03-31 14:29:21.527
*** ACTION NAME:() 2016-03-31 14:29:21.527
 
*** TRACE FILE RECREATED AFTER BEING REMOVED ***
WARNING:failed xlate 2 
WARNING: requested mirror side 1 of virtual extent 0 logical extent 0 offset 16384 is not allocated; I/O request failed
WARNING:failed xlate 2 
WARNING: requested mirror side 2 of virtual extent 0 logical extent 1 offset 16384 is not allocated; I/O request failed
WARNING:failed xlate 2 
WARNING: requested mirror side 3 of virtual extent 0 logical extent 2 offset 16384 is not allocated; I/O request failed
DDE rules only execution for: ORA 202
----- START Event Driven Actions Dump ----
---- END Event Driven Actions Dump ----
----- START DDE Actions Dump -----
Executing SYNC actions
----- START DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (Async) -----
Successfully dispatched
----- END DDE Action: 'DB_STRUCTURE_INTEGRITY_CHECK' (SUCCESS, 0 csec) -----
Executing ASYNC actions
----- END DDE Actions Dump (total 0 csec) -----
*** 2016-03-31 14:29:21.533
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
ORA-00202: control file: '+GJJ_DG/jtpsoft/controlfile/current.260.841581789'
ORA-15079: ASM file is closed
ORA-15079: ASM file is closed
ORA-15079: ASM file is closed
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.
----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
skdstdst()+40        bl       0000000109B3BDC0     FFFFFFFFFFF6D60 ? 000000001 ?
                                                   000000001 ? 000000000 ?
                                                   000000000 ? 000000001 ?
                                                   000000001 ? 000000000 ?
ksedst1()+112        call     skdstdst()           FFFFFFFFFFF6E38 ? 000002004 ?
                                                   110720F48 ? 10A6D6A74 ?
                                                   10A6D5F50 ? FFFFFFFFFFF7190 ?
                                                   FFFFFFFFFFF6F40 ?
                                                   2050033FFFF6E18 ?
ksedst()+40          call     ksedst1()            10A6D6A68 ? 7000000000291 ?
                                                   10A6D6A3C ? B000000000000 ?
                                                   10A6D5F50 ? 000000000 ?
                                                   400000000 ? 1D25C0C474A0C ?
dbkedDefDump()+1516  call     ksedst()             000000001 ? FFFFFFFFFFF72F0 ?
                                                   000000001 ? 000000000 ?
                                                   000000003 ? 00001ED50 ?
                                                   000000000 ? 000000000 ?
ksedmp()+72          call     dbkedDefDump()       1A006FE88 ? 9001000A0000E00 ?
                                                   FFFFFFFFFFF7CD0 ? 004C3009B ?
                                                   000000000 ? 000000001 ?
                                                   000000001 ? 0000000CC ?
kcccsi()+416         call     ksedmp()             CA000000CA ? 000000001 ?
                                                   000000031 ? 7000104BCDE1D56 ?
                                                   000000000 ? 7000104C8CC4478 ?
                                                   000000006 ? 000000030 ?
kccrhd()+1488        call     kcccsi()             000000020 ? 000000000 ?
                                                   110738B50 ? 10A344FE0 ?
                                                   FFFFFFFFFFF7FA0 ? 004C3009B ?
                                                   102D0B800 ? 11298BF30 ?
kccgft_refresh_hdr(  call     kccrhd()             FFFFFFFFFFF8100 ? 004C3009B ?
)+124                                              11297B0B0 ? FFFFFFFFFFFA5E0 ?
                                                   6B736C636D617266 ?
                                                   112939850 ? 10A38F248 ?
                                                   10A38F258 ?
kcc_noenq_refresh()  call     kccgft_refresh_hdr(  110E9DA20 ? 000000004 ?
+40                           )                    000000018 ? 11012D7C8 ?
                                                   000000021 ? 00000374E ?
                                                   000000001 ? FFFFFFFFFFF81F0 ?
kccocx_noenq()+292   call     kcc_noenq_refresh()  20B6CECD0 ? 000000102 ?
                                                   000000000 ? 000000000 ?
                                                   FFFFFFFFFFF8070 ?
                                                   44244288000537D0 ?
                                                   9000000000051AC ?
                                                   FFFFFFFFFFF8030 ?
...省略...
上面的内容也就是不能对磁盘组执行IO操作,并没有得到更有用的信息。如是查看asm的alert.log文件,可以看到如下信息:
Thu Mar 31 14:29:18 2016 WARNING: Waited 15 secs for write IO to PST disk 0 in group 3. WARNING: Waited 15 secs for write IO to PST disk 1 in group 3. WARNING: Waited 15 secs for write IO to PST disk 2 in group 3. WARNING: Waited 15 secs for write IO to PST disk 0 in group 3. WARNING: Waited 15 secs for write IO to PST disk 1 in group 3. WARNING: Waited 15 secs for write IO to PST disk 2 in group 3. Thu Mar 31 14:29:18 2016 NOTE: process _b000_+asm1 (22544528) initiating offline of disk 0.3043853018 (GJJ_DG_0000) with mask 0x7e in group 3 NOTE: process _b000_+asm1 (22544528) initiating offline of disk 1.3043853019 (GJJ_DG_0001) with mask 0x7e in group 3 NOTE: process _b000_+asm1 (22544528) initiating offline of disk 2.3043853020 (GJJ_DG_0002) with mask 0x7e in group 3 NOTE: checking PST: grp = 3 GMON checking disk modes for group 3 at 10 for pid 30, osid 22544528 ERROR: no read quorum in group: required 2, found 0 disks NOTE: checking PST for grp 3 done. NOTE: initiating PST update: grp = 3, dsk = 0/0xb56d82da, mask = 0x6a, op = clear NOTE: initiating PST update: grp = 3, dsk = 1/0xb56d82db, mask = 0x6a, op = clear NOTE: initiating PST update: grp = 3, dsk = 2/0xb56d82dc, mask = 0x6a, op = clear GMON updating disk modes for group 3 at 11 for pid 30, osid 22544528 ERROR: no read quorum in group: required 2, found 0 disks Thu Mar 31 14:29:18 2016 NOTE: cache dismounting (not clean) group 3/0x46ED7239 (GJJ_DG) NOTE: messaging CKPT to quiesce pins Unix process pid: 30277648, image: oracle@p720a (B001) Thu Mar 31 14:29:18 2016 NOTE: halting all I/Os to diskgroup 3 (GJJ_DG) Thu Mar 31 14:29:18 2016 NOTE: LGWR doing non-clean dismount of group 3 (GJJ_DG) NOTE: LGWR sync ABA=22.4306 last written ABA 22.4306 WARNING: Offline for disk GJJ_DG_0000 in mode 0x7f failed. WARNING: Offline for disk GJJ_DG_0001 in mode 0x7f failed. WARNING: Offline for disk GJJ_DG_0002 in mode 0x7f failed. Thu Mar 31 14:29:18 2016 kjbdomdet send to inst 2 detach from dom 3, sending detach message to inst 2 Thu Mar 31 14:29:18 2016 List of instances: 1 2 Dirty detach reconfiguration started (new ddet inc 1, cluster inc 20) Global Resource Directory partially frozen for dirty detach * dirty detach - domain 3 invalid = TRUE 305 GCS resources traversed, 0 cancelled Thu Mar 31 14:29:18 2016 ERROR: ORA-15130 in COD recovery for diskgroup 3/0x46ed7239 (GJJ_DG) ERROR: ORA-15130 thrown in RBAL for group number 3 Dirty Detach Reconfiguration complete Errors in file /opt/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12386526.trc: ORA-15130: diskgroup "GJJ_DG" is being dismounted Thu Mar 31 14:29:18 2016 WARNING: dirty detached from domain 3 NOTE: cache dismounted group 3/0x46ED7239 (GJJ_DG) NOTE: cache deleting context for group GJJ_DG 3/0x46ed7239 SQL> alter diskgroup GJJ_DG dismount force /* ASM SERVER:1189966393 */ Thu Mar 31 14:29:18 2016 NOTE: No asm libraries found in the system ERROR: ORA-15130 in COD recovery for diskgroup 3/0x46ed7239 (GJJ_DG) ERROR: ORA-15130 thrown in RBAL for group number 3 Errors in file /opt/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12386526.trc: ORA-15130: diskgroup "" is being dismounted ERROR: ORA-15130 in COD recovery for diskgroup 3/0x46ed7239 (GJJ_DG) ERROR: ORA-15130 thrown in RBAL for group number 3 Errors in file /opt/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12386526.trc: ORA-15130: diskgroup "" is being dismounted ERROR: ORA-15130 in COD recovery for diskgroup 3/0x46ed7239 (GJJ_DG) ERROR: ORA-15130 thrown in RBAL for group number 3 Errors in file /opt/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12386526.trc: ORA-15130: diskgroup "" is being dismounted Thu Mar 31 14:29:27 2016 NOTE: ASM client jtpsoft1:jtpsoft disconnected unexpectedly. NOTE: check client alert log. NOTE: Trace records dumped in trace file /opt/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_14024932.trc ASM Health Checker found 1 new failures Thu Mar 31 14:29:30 2016 ERROR: ORA-15130 in COD recovery for diskgroup 3/0x46ed7239 (GJJ_DG) ERROR: ORA-15130 thrown in RBAL for group number 3 Errors in file /opt/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12386526.trc: ORA-15130: diskgroup "" is being dismounted ERROR: ORA-15130 in COD recovery for diskgroup 3/0x46ed7239 (GJJ_DG) ERROR: ORA-15130 thrown in RBAL for group number 3 Errors in file /opt/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12386526.trc: ORA-15130: diskgroup "" is being dismounted ERROR: ORA-15130 in COD recovery for diskgroup 3/0x46ed7239 (GJJ_DG) ERROR: ORA-15130 thrown in RBAL for group number 3 Errors in file /opt/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_12386526.trc: ORA-15130: diskgroup "" is being dismounted Thu Mar 31 14:29:37 2016 GMON dismounting group 3 at 12 for pid 31, osid 30277648 Thu Mar 31 14:29:37 2016 NOTE: Disk GJJ_DG_0000 in mode 0x7f marked for de-assignment NOTE: Disk GJJ_DG_0001 in mode 0x7f marked for de-assignment NOTE: Disk GJJ_DG_0002 in mode 0x7f marked for de-assignment SUCCESS: diskgroup GJJ_DG was dismounted SUCCESS: alter diskgroup GJJ_DG dismount force /* ASM SERVER:1189966393 */ SUCCESS: ASM-initiated MANDATORY DISMOUNT of group GJJ_DG Thu Mar 31 14:29:37 2016 NOTE: diskgroup resource ora.GJJ_DG.dg is offline Thu Mar 31 14:29:44 2016 Errors in file /opt/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_ora_15073504.trc: ORA-17503: ksfdopn:2 Failed to open file +GJJ_DG/jtpsoft/spfilejtpsoft.ora ORA-15001: diskgroup "GJJ_DG" does not exist or is not mounted
ASMCMD> lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED NORMAL N 512 4096 1048576 614400 606078 204800 200639 0 N BACK_DG/ MOUNTED NORMAL N 512 4096 1048576 10240 9310 2048 3631 0 Y CRS_DG/ MOUNTED NORMAL N 512 4096 1048576 1228800 858311 409600 224355 0 N GJJ_DG/
从上面的信息’Thu Mar 31 14:29:18 2016’与’WARNING: Waited 15 secs for write IO to PST disk 1 in group 3’可知在2016-03-31 14:29:18这个时间点,由于对ASM正常或高冗余磁盘所执行的ASM PST心跳检测出现了延迟,而且延迟时间超过了15秒,因此ASM实例dismount了ASM磁盘组。这种心跳方式会忽略对外部冗余磁盘组的检测,在ASM PST重新验证之前ASM实例会停止执行更多的PST心跳检测,但PST心跳延迟不会dismount外部冗余磁盘组。可能出现这种情况有以下几种原因:
1.由于多路径设备的某些物理路径脱机或丢失
2.多路径执行路径故障转移
3.服务器负载或存储/多路径/操作系统的维护,但这种情况可以排除因为负载不高,也没进行任何维护操作。
那么在出现这种情况时,可以通过以下几种方式来解决:
1.检查操作系统和存储对磁盘管理的响应时间
2.尽可能把磁盘的响应时间维持在15秒以下,但这将处于了操作系统,多路径软件与内核参数等多种因素,需要一一排查。
3.如果不能保证磁盘的响应时间维持在15秒以下,那么可以对ASM实例设置隐含参数_asm_hbetaiowait,对于这个参值数如果遇到bug 17274537可以设置120,并且这个bug在12.1.0.2中被修复了。
Thu Mar 31 14:30:05 2016
SQL> ALTER DISKGROUP GJJ_DG MOUNT  /* asm agent *//* {0:23:23972} */ 
NOTE: cache registered group GJJ_DG number=3 incarn=0x46ed72a8
NOTE: cache began mount (not first) of group GJJ_DG number=3 incarn=0x46ed72a8
NOTE: Assigning number (3,0) to disk (/dev/rhdiskpower5)
NOTE: Assigning number (3,1) to disk (/dev/rhdiskpower6)
NOTE: Assigning number (3,2) to disk (/dev/rhdiskpower7)
Thu Mar 31 14:30:05 2016
GMON querying group 3 at 14 for pid 27, osid 29163580
NOTE: cache opening disk 0 of grp 3: GJJ_DG_0000 path:/dev/rhdiskpower5
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache opening disk 1 of grp 3: GJJ_DG_0001 path:/dev/rhdiskpower6
NOTE: F1X0 found on disk 1 au 2 fcn 0.0
NOTE: cache opening disk 2 of grp 3: GJJ_DG_0002 path:/dev/rhdiskpower7
NOTE: F1X0 found on disk 2 au 2 fcn 0.0
NOTE: cache mounting (not first) normal redundancy group 3/0x46ED72A8 (GJJ_DG)
Thu Mar 31 14:30:05 2016
kjbdomatt send to inst 2
Thu Mar 31 14:30:05 2016
NOTE: attached to recovery domain 3
NOTE: redo buffer size is 256 blocks (1053184 bytes)
Thu Mar 31 14:30:05 2016
NOTE: LGWR attempting to mount thread 2 for diskgroup 3 (GJJ_DG)
NOTE: LGWR found thread 2 closed at ABA 22.4306
NOTE: LGWR mounted thread 2 for diskgroup 3 (GJJ_DG)
NOTE: LGWR opening thread 2 at fcn 0.383934 ABA 23.4307
NOTE: cache mounting group 3/0x46ED72A8 (GJJ_DG) succeeded
NOTE: cache ending mount (success) of group GJJ_DG number=3 incarn=0x46ed72a8
Thu Mar 31 14:30:05 2016
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 3
SUCCESS: diskgroup GJJ_DG was mounted
SUCCESS: ALTER DISKGROUP GJJ_DG MOUNT  /* asm agent *//* {0:23:23972} */
从上面的信息可以看到在 2016-03-31 14:30:05这个时间点asm_agent在执行mount磁盘组(GJJ_DG),在14:29:18 dismount磁盘组到14:30:05 mount磁盘之间的时间间隔是47秒。所以等我检查时ASM磁盘组(GJJ_DG)已经mount成功了,于是只能让客户去检查操作系统,多路径软件和光纤。