kksfbc child completion与ksdxexeotherwait引发CPU使用异常

某客户操作人员反应很慢不能操作,管理人员登录小机系统后发现CPU使用到了96%。而且这种情况持续了几个月。以下是登录后小机后载取的topas图,而且是周末,并没有人使用系统。小机是IBM的550,配置是2颗6核的CPU,内存是48G。
1

如是登录数据库执行以下脚本来查看当前数据库消耗CPU最多的进程在执行什么

Connected to Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 
Connected as gtp2
 


SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE,
  2  (select  c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext,
  3  (select  c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID
  4  from v$session s,v$process p
  5  where p.SPID in(491720,90116,127336,529102,987524,331990)
  6  and s.event not like'%SQL*Net%' and s.USERNAME='GTP2'
  7  order by s.wait_time desc
  8  ;
 
       SID SPID         USERNAME                       EVENT                                                             WAIT_TIME STATE               SECONDS_IN_WAIT PROGRAM                                          MACHINE                                                          SQL_FULLTEXT                                                                     BIND_DATA                                                                        SQL_ID
---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------
      1020 90116        GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 987524       GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 331990       GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 491720       GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
 
4 rows selected
 
SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE,
  2  (select  c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext,
  3  (select  c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID
  4  from v$session s,v$process p
  5  where p.SPID in(491720,90116,127336,529102,987524,331990)
  6  and s.event not like'%SQL*Net%' and s.USERNAME='GTP2'
  7  order by s.wait_time desc
  8  ;
 
       SID SPID         USERNAME                       EVENT                                                             WAIT_TIME STATE               SECONDS_IN_WAIT PROGRAM                                          MACHINE                                                          SQL_FULLTEXT                                                                     BIND_DATA                                                                        SQL_ID
---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------
      1020 90116        GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 987524       GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 331990       GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 491720       GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
 
4 rows selected
 

从上面的信息可以看到这些进程的等待事件为kksfbc child completion,ksdxexeotherwait。当看到这种情况时第一反应是不是遇到的BUG,以KKSFBC CHILD COMPLETION为关键字到MOS查询可以找到,该Bug的症状为进程不断spin且hang住、出现’KKSFBC CHILD COMPLETION’等待事件、还可能伴有’Waits for “cursor: pin S”‘等待事件,直接影响的版本有11.1.0.6、10.2.0.3和10.2.0.4。而我这里的版本是10.2.0.1。
2

对于该Bug的描述是在发生’kksfbc child completion’等待事件后会话陷入无休止的自旋(spins)中,这种自旋(spins)发生在由堆栈调用(stack call)kksSearchChildList->kkshgnc陷入对kksSearchChildList函数的无限循环中。需要更详细的stack call,如是对系统进程90116进行跟踪。

SQL> oradebug setospid 90116
Oracle pid: 40, Unix process pid: 90116, image: oracleorcl@dbserv
SQL> oradebug unlimit;
Statement processed.
SQL> oradebug short_stack;
ksdxfstk+002c< -ksdxcb+04e4<-sspuser+0068<-00004750<-kksfbc+0bb0<-kkspsc0+0f3c<-kksParseCursor+00d4<-opiosq0+0b10<-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1004<-opitsk+1000<-opiino+0990<-opiodr+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0070
SQL> oradebug dump processstate 10;
Statement processed.
SQL>  oradebug dump systemstate 266;
Statement processed.
SQL> oradebug tracefile_name
/oracle/admin/orcl/udump/orcl_ora_90116.trc

查看生成的跟踪文件orcl_ora_90116.trc有如下内容:

SO: 7000001486ab188, type: 4, owner: 70000014346c5a8, flag: INIT/-/-/0x00
    (session) sid: 1020 trans: 0, creator: 70000014346c5a8, flag: (41) USR/- BSY/-/-/-/-/-
              DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 7000001473dcf10, psql: 7000001225ac0c8, user: 82/GTP2
    O/S info: user: gtp-default, term: WIN-AUQ43P0UU9L, ospid: 6708:12196, machine: WORKGROUP\WIN-AUQ43P0UU9L
              program: w3wp.exe
    last wait for 'kksfbc child completion' blocking sess=0x0 seq=2831 wait_time=48850 seconds since wait started=572057
                =0, =0, =0
    Dumping Session Wait History
     for 'kksfbc child completion' count=1 wait_time=48850
                =0, =0, =0

可以从以上trace中看到会话确实曾长时间处于’kksfbc child completion’等待中,之后陷入无限自旋(spins)中消耗了大量CPU时间。但这里实际的表现又存有差异,引发无限循环的函数是kksfbc而不是kksSearchChildList(常规的调用序列是:kksParseCursor->kkspsc0->kksfbc ->kksSearchChildList->kkshgnc)。kksfbc意为K[Kernel]K[Kompile]S[Shared]F[Find]B[Best]C[Child]该函数用以在软解析时找寻合适的子游标,在10.2.0.2以后引入了mutex互斥体来取代原有的Cursor Pin机制,Mutex较Latch更为轻量级。虽然mutex的引入改变了众多cursor pin的内部机制,但kksfbc仍需要持有library cache latches才能扫描library cache hash chains。另一方面当kksfbc函数针对某个parent cursor找到合适child cursor后,可能使用KKSCHLPINx方法将该child cursor pin住,这个时候就需要exclusive地持有该child cursor相应的mutex。Oracle在10.2.0.4上提供了该Bug的one-off Patch
8575528,其在10.2.0.4 psu4以后的等价补丁为(Equivalent patch)为merge patch 9696904:8557428 9696904 7527908 Both fixes are needed. 6795880 superceded by 8575528 in 9696904 which includes extra files so may cause new conflicts。但merge patch 9696904目前仅有Linux x86/64平台上的版本,而问题数据库所在平台为IBM AIX on POWER Systems (64-bit),而且版本是10.2.0.1。那么要解决这个问题是不是没有办法了,其实不然,我们可以将数据库从10.2.0.1升级到10.2.0.5来解决这个BUG,在升级到10.2.0.5之后确实解决这个问题。