查询执行hash join出现ora-040300 (QERHJ hash-joi,kllcqas:kllsltba)的诊断案例

某公安系统开发人员在plsql中执行查询时报ORA-04030: 在尝试分配 123416 字节 (QERHJ hash-joi,kllcqas:kllsltba) 时进程内存不足。数据库是aix上的rac 11.2.0.4。当时朋友找到我,我当时给他的建议是直接把报错的查询直接在数据库服务器上执行看是否报错,因为在服务器上用sqlplus来执行是没有经过监听程序的,但后面给我的答复是在服务器上执行也报错。所以就远程连接来处理。

查看alert.log文件发现如下错误信息:

Errors in file /u01/app/oracle/diag/rdbms/test/test/incident/incdir_156708/test_ora_15859922_i156708.trc

ORA-04030: 在尝试分配 123416 字节 (QERHJ hash-joi,kllcqas:kllsltba) 时进程内存不足

从alert.log文件查看到的非缺省值参数信息如下 :

System parameters with non-default values:
  processes                = 1500
  sessions                 = 2272
  _realfree_heap_pagesize_hint= 256K
  _use_realfree_heap       = TRUE
  spfile                   = "+DATA_DG/xtrk/spfilextrk.ora"
  sga_target               = 14G
  control_files            = "+DATA_DG/xtrk/controlfile/current.261.851142581"
  control_files            = "+DATA_DG/xtrk/controlfile/current.260.851142581"
  db_block_size            = 8192
  _external_scn_rejection_threshold_hours= 1
  compatible               = "11.2.0.4.0"
  log_archive_format       = "%t_%s_%r.dbf"
  cluster_database         = TRUE
  db_create_file_dest      = "+DATA_DG"
  db_recovery_file_dest    = "+DATA_DG"
  db_recovery_file_dest_size= 400G
  thread                   = 1
  undo_tablespace          = "UNDOTBS1"
  instance_number          = 1
  remote_login_passwordfile= "EXCLUSIVE"
  db_domain                = ""
  dispatchers              = "(PROTOCOL=TCP) (SERVICE=xtrkXDB)"
  remote_listener          = "test-scan:1521"
  audit_file_dest          = "/u01/app/oracle/admin/test/adump"
  audit_trail              = "DB"
  db_name                  = "test"
  open_cursors             = 600
  pga_aggregate_target     = 10G
  diagnostic_dest          = "/u01/app/oracle"

如是查看错误跟踪文件:

Dump continued from file: /u01/app/oracle/diag/rdbms/xtrk/xtrk1/trace/xtrk1_ora_15859922.trc
ORA-04030: 在尝试分配 123416 字节 (QERHJ hash-joi,kllcqas:kllsltba) 时进程内存不足

========= Dump for incident 156708 (ORA 4030) ========
----- Beginning of Customized Incident Dump(s) -----
=======================================
TOP 10 MEMORY USES FOR THIS PROCESS
---------------------------------------

*** 2014-08-27 15:07:09.388
77%   86 MB, 727 chunks: "kllcqas:kllsltba          "  SQL
         QERHJ hash-joi  ds=11160a378  dsprt=110bcb0f8
 7% 8449 KB,  80 chunks: "QERHJ Bit vector          "  SQL
         QERHJ hash-joi  ds=11160a378  dsprt=110bcb0f8
 6% 6771 KB,   4 chunks: "kllcqc:kllcqslt           "  SQL
         QERHJ hash-joi  ds=11160a378  dsprt=110bcb0f8
 5% 5880 KB, 407 chunks: "free memory               "  SQL
         QERHJ hash-joi  ds=11160a378  dsprt=110bcb0f8
 2% 2056 KB,   2 chunks: "kllcqgf:kllsltba          "  SQL
         kxs-heap-w      ds=110bcb0f8  dsprt=1109f0120
 1%  651 KB,  28 chunks: "permanent memory          "  
         pga heap        ds=110005210  dsprt=0
 0%  376 KB,   1 chunk : "kfk_kfkio_freeq           "  
         KFK_IO_SUBHEAP  ds=1108dae38  dsprt=110005210
 0%  336 KB,   1 chunk : "kfkosd_p                  "  
         KFK_IO_SUBHEAP  ds=1108dae38  dsprt=110005210
 0%  259 KB,   8 chunks: "free memory               "  
         pga heap        ds=110005210  dsprt=0
 0%  253 KB,   2 chunks: "free memory               "  
         top call heap   ds=11011f360  dsprt=0

从上面报错进程内存分配信息来看报错的chunks: “kllcqas:kllsltba ” 大小是86M

=======================================
PRIVATE MEMORY SUMMARY FOR THIS PROCESS
---------------------------------------
******************************************************
PRIVATE HEAP SUMMARY DUMP
111 MB total:
   110 MB commented, 652 KB permanent
   606 KB free (0 KB in empty extents),
     109 MB,   1 heap:    "session heap   "            92 KB free held
------------------------------------------------------
Summary of subheaps at depth 1
110 MB total:
   110 MB commented, 96 KB permanent
    94 KB free (0 KB in empty extents),
     108 MB,   1 heap:    "kxs-heap-w     "            84 KB free held
------------------------------------------------------
Summary of subheaps at depth 2
109 MB total:
   108 MB commented, 45 KB permanent
    32 KB free (4 KB in empty extents),
     106 MB,   4 heaps:   "QERHJ hash-joi "            10 KB free held
------------------------------------------------------
Summary of subheaps at depth 3
106 MB total:
   100 MB commented, 41 KB permanent
  5880 KB free (0 KB in empty extents),
      91 MB, 727 chunks:  "kllcqas:kllsltba          " 5257 KB free held
    8786 KB,  80 chunks:  "QERHJ Bit vector          " 336 KB free held

从111 MB total可知出错进程只分配了111M的内存

=========================================
REAL-FREE ALLOCATOR DUMP FOR THIS PROCESS
-----------------------------------------
 
Dump of Real-Free Memory Allocator Heap [0x110959bf0]
mag=0xfefe0001 flg=0x5000003 fds=0x0 blksz=262144
blkdstbl=0x110959c00, iniblk=8192 maxblk=524288 numsegs=15
In-use num=402 siz=116391936, Freeable num=0 siz=0, Free num=0 siz=0
 
==========================================
INSTANCE-WIDE PRIVATE MEMORY USAGE SUMMARY
------------------------------------------
 
Dumping Work Area Table (level=1)
=====================================
 
  Global SGA Info
  ---------------
 
    global target:    10240 MB
    auto target:       8997 MB
    max pga:           2047 MB
    pga limit:         4095 MB
    pga limit known:  0
    pga limit errors:     0
 
    pga inuse:          275 MB
    pga alloc:          343 MB
    pga freeable:        11 MB
    pga freed:         4150 MB
    pga to free:          0 
    broker request:       0
 
    pga auto:            32 MB
    pga manual:           0 MB
 
    pga alloc  (max):  3043 MB
    pga auto   (max):    32 MB
    pga manual (max):     0 MB
 
    # workareas     :     4
    # workareas(max):     5
 
 

================================
PER-PROCESS PRIVATE MEMORY USAGE
--------------------------------
 
Private memory usage per Oracle process

当出现ora-04030时该实例分配的总PGA大小如下

-------------------------
Top 10 processes:
-------------------------
(percentage is of 420 MB total allocated memory)
27% pid 51: 111 MB used of 112 MB allocated  < = CURRENT PROC
 5% pid 13: 18 MB used of 22 MB allocated 
 5% pid 14: 18 MB used of 22 MB allocated 
 5% pid 36: 18 MB used of 22 MB allocated 
 5% pid 38: 18 MB used of 22 MB allocated 
 5% pid 39: 18 MB used of 22 MB allocated 
 5% pid 40: 18 MB used of 22 MB allocated 
 4% pid 10: 16 MB used of 17 MB allocated 
 4% pid 12: 13 MB used of 17 MB allocated 
 4% pid 20: 12 MB used of 15 MB allocated 

从上面的percentage is of 420 MB total allocated memory可知这个实例的PGA只分配了420M。从27% pid 51: 111 MB used of 112 MB allocated < = CURRENT PROC可知当前进程分配了112M,使用了111M。而pid=51,那么接下来查看pid=51的详细会话信息:

------------------------------------
Begin session detail for pid 51
  sid: 430 ser: 5 audsid: 400022 user: 239/TC_RKXT
    flags: (0x45) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x40009) -/-/INC
  pid: 51 O/S info: user: grid, term: UNKNOWN, ospid: 15859922
    image: oracle@xtrkdb1
  client details:
    O/S info: user: Administrator, term: RKXXWEB, ospid: 4120:2956
    machine: RKXX\RKXXWEB program: plsqldev.exe
    application name: PL/SQL Developer, hash value=1190136663
    action name: SQL Window - New, hash value=3399691616
  current SQL:
  SELECT
 b.master_relation 户关系, b.hu_id_new 户号,a.name 姓名,a.pid 身份证号码,b.when_in_logged 入户时间, b.in_category 入户理由, b.dob,
a.photo_flag 有无相片, substr(ORGNAME,7,10) 所属派出所,
b.quxcun_id,c.name 居委会名称
from t_person a, t_huji b, tc_jcyw.t_org o ,tc_jcyw.t_quxcun c
where a.person_id = b.person_id
      and b.org_id = o.sunit_code
      and b.quxcun_id=c.quxcun_id
      and b.zxbz = '0'
      and a.dob < =to_date( '1996.07.01','yyyy.mm.dd' )
and not Exists(select * from t_pid_accept m where m.iscancel is null and m.pid = a.pid )
End session detail for pid 51
------------------------------------
 
================
SWAP INFORMATION
----------------
swap info: free_mem = 22025.57M rsv = 128.00M 
           alloc = 44.48M avail = 32768.00M swap_free = 32723.52M
----- End of Customized Incident Dump(s) -----

*** 2014-08-27 15:07:09.417
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- Current SQL Statement for this session (sql_id=1gbsquwh58w4k) -----
SELECT
 b.master_relation 户关系, b.hu_id_new 户号,a.name 姓名,a.pid 身份证号码,b.when_in_logged 入户时间, b.in_category 入户理由, b.dob,
a.photo_flag 有无相片, substr(ORGNAME,7,10) 所属派出所,
b.quxcun_id,c.name 居委会名称
from t_person a, t_huji b, tc_jcyw.t_org o ,tc_jcyw.t_quxcun c
where a.person_id = b.person_id
      and b.org_id = o.sunit_code
      and b.quxcun_id=c.quxcun_id
      and b.zxbz = '0'
      and a.dob <=to_date( '1996.07.01','yyyy.mm.dd' )
and not Exists(select * from t_pid_accept m where m.iscancel is null and m.pid = a.pid )

从上面的信息可知swap大小为32768M,SGA为14G,PGA为10G,当这个实例的总PGA为分配了420M,当这个会话进程只分配111M内存时不应该会报告ora-04030错误.这时我还是亲自在服务器上执行上面的报错查询发现不报错,而开始我要朋友做测试,他是要别人做的,但给我的答案是报错,看来还是得亲自测试。这种情况在MOS有一篇文章(ID 758131.1)就是描述这种情况的并且说在aix下是一个特定的问题。造成这个问题的原因可能是因为操作系统资源限制,说当CRS启动它的资源时是使用root用户可能会出现ora-04030错误,如果是资源通过sqlplus,lsnrctl来启动则没有操作系统资源限制。如果我查看/etc/security/limits文件内容如下:

# cat limits

*
* Sizes are in multiples of 512 byte blocks, CPU time is in seconds
*
* fsize      - soft file size in blocks
* core       - soft core file size in blocks
* cpu        - soft per process CPU time limit in seconds
* data       - soft data segment size in blocks
* stack      - soft stack segment size in blocks
* rss        - soft real memory usage in blocks
* nofiles    - soft file descriptor limit
* fsize_hard - hard file size in blocks
* core_hard  - hard core file size in blocks
* cpu_hard   - hard per process CPU time limit in seconds
* data_hard  - hard data segment size in blocks
* stack_hard - hard stack segment size in blocks
* rss_hard   - hard real memory usage in blocks
* nofiles_hard - hard file descriptor limit
*
* The following table contains the default hard values if the
* hard values are not explicitly defined:
*
*   Attribute        Value
*   ==========    ============
*   fsize_hard    set to fsize
*   cpu_hard      set to cpu
*   core_hard         -1
*   data_hard         -1
*   stack_hard      8388608 
*   rss_hard          -1
*   nofiles_hard      -1
*
* NOTE:  A value of -1 implies "unlimited"
*

default:
        fsize = 2097151
        core = 2097151
        cpu = -1
        data = 262144
#       rss = 65536
        rss = -1
        stack = 65536
        nofiles = 2000

root:
        fsize = -1
        data = -1
        stack = -1
        core = -1
        rss = -1
        nofiles = -1
        stack = -1
        stack_hard = -1
        data_hard = -1
        cpu = -1
        core_hard = -1
        cpu_hard = -1
        fsize_hard = -1

oracle:
        fsize = -1
        data = -1
        stack = -1
        core = -1
        rss = -1
        nofiles = -1
        stack = -1
        stack_hard = -1
        data_hard = -1
        cpu = -1
        core_hard = -1
        cpu_hard = -1
        fsize_hard = -1

发现我朋友已经修改了root用户和oracle用户的权限但为什么还是报ora-04030错误了。原因为在oracle 11g rac安装不像以前的版本,现在安装Grid Infrastructure 和Oracle数据库软件推荐使用不同的用户来安装。通过查看这里用grid用户来安装Grid Infrastructure ,oracle用户来安装的oracle数据库软。下面的limits文件朋友只修改了root,oracle两个用户,但并没有修改grid用户对操作系统资源的使用限制。而在rac中,监听程序是集群软件来启动的也就是由grid用户来管理的。由于修改grid用户对操作系统资源的使用限制由于在limits文件中增加以下内容:

grid:
        fsize = -1
        data = -1
        stack = -1
        core = -1
        rss = -1
        nofiles = -1
        stack = -1
        stack_hard = -1
        data_hard = -1
        cpu = -1
        core_hard = -1
        cpu_hard = -1
        fsize_hard = -1

由于重启rac,在重启之后再执行上面报错语句一切正常了,造成这个问题的根本原因是没有修改oracle,grid用户对操作系统资源的使用限制所造成的,而这个问题应该是可以避免的,因为对这种操作系统资源的使用限制其实在安装rac之前是要做修改的在安装文档中也是有的,所以安装也是需要仔细阅读官方文档的。

发表评论

电子邮件地址不会被公开。