AIX filesystemcache引发的Oracle事故

今天上班同事说医保数据库expdp导出没有完成,同时医保业务人员报告登录系统有时能登录,有时不能登录。Expdp导出日志如下:

[IBMP740-1:root:/yb_oradata/RLZYbak/dpdump]#cat insur_changde_150921_2330.log
Export: Release 10.2.0.4.0 - 64bit Production on Monday, 21 September, 2015 23:30:00

Copyright (c) 2003, 2007, Oracle.  All rights reserved.
;;; 
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
Starting "INSUR_CHANGDE"."SYS_EXPORT_SCHEMA_07":  insur_changde/******** directory=dump_RLZY dumpfile=insur_changde_150921_2330.dmp logfile=insur_changde_150921_2330.log 
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 492.0 GB
Processing object type SCHEMA_EXPORT/USER
Processing object type SCHEMA_EXPORT/SYSTEM_GRANT
Processing object type SCHEMA_EXPORT/ROLE_GRANT
Processing object type SCHEMA_EXPORT/DEFAULT_ROLE
Processing object type SCHEMA_EXPORT/PRE_SCHEMA/PROCACT_SCHEMA
Processing object type SCHEMA_EXPORT/SYNONYM/SYNONYM
Processing object type SCHEMA_EXPORT/TYPE/TYPE_SPEC
Processing object type SCHEMA_EXPORT/DB_LINK
Processing object type SCHEMA_EXPORT/SEQUENCE/SEQUENCE
Processing object type SCHEMA_EXPORT/SEQUENCE/GRANT/OWNER_GRANT/OBJECT_GRANT
Processing object type SCHEMA_EXPORT/TABLE/TABLE
Processing object type SCHEMA_EXPORT/TABLE/GRANT/OWNER_GRANT/OBJECT_GRANT
Processing object type SCHEMA_EXPORT/TABLE/INDEX/INDEX
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type SCHEMA_EXPORT/TABLE/COMMENT
Processing object type SCHEMA_EXPORT/PACKAGE/PACKAGE_SPEC
Processing object type SCHEMA_EXPORT/FUNCTION/FUNCTION
Processing object type SCHEMA_EXPORT/FUNCTION/GRANT/OWNER_GRANT/OBJECT_GRANT
Processing object type SCHEMA_EXPORT/PROCEDURE/PROCEDURE
Processing object type SCHEMA_EXPORT/PROCEDURE/GRANT/OWNER_GRANT/OBJECT_GRANT
Processing object type SCHEMA_EXPORT/PACKAGE/COMPILE_PACKAGE/PACKAGE_SPEC/ALTER_PACKAGE_SPEC
Processing object type SCHEMA_EXPORT/FUNCTION/ALTER_FUNCTION
Processing object type SCHEMA_EXPORT/PROCEDURE/ALTER_PROCEDURE
Processing object type SCHEMA_EXPORT/VIEW/VIEW
Processing object type SCHEMA_EXPORT/VIEW/GRANT/OWNER_GRANT/OBJECT_GRANT
Processing object type SCHEMA_EXPORT/PACKAGE/PACKAGE_BODY
Processing object type SCHEMA_EXPORT/TYPE/TYPE_BODY
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/REF_CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/TRIGGER
Processing object type SCHEMA_EXPORT/TABLE/INDEX/FUNCTIONAL_AND_BITMAP/INDEX
Processing object type SCHEMA_EXPORT/TABLE/INDEX/STATISTICS/FUNCTIONAL_AND_BITMAP/INDEX_STATISTICS
Processing object type SCHEMA_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type SCHEMA_EXPORT/JOB
. . exported "INSUR_CHANGDE"."MT_BIZ_SCENE_FIN"          51.86 GB 1111873243 rows
. . exported "INSUR_CHANGDE"."MT_FEE_FIN"                22.76 GB 133817090 rows

从上面的expdp日志信息来看并没有错误,更像理导出进程停止,如果查看dba_datapump_job视图来查看有没有被异常终止的epxdp导出job
1

从上面的信息可以看到insur_changde用户出现了多个expdp导出异常终止的job。从job命名规则可以看到最近的异常终止job是sys_export_schema_07,而且状态是空闲的。那么重新连接sys_export_schema_07这个job来查看job状态。

[IBMP740-1:oracle:/yb_oradata/RLZYbak]$expdp 'insur_changde/"power$20140224"' attach=SYS_EXPORT_SCHEMA_07

Export: Release 10.2.0.4.0 - 64bit Production on Tuesday, 22 September, 2015 16:51:51

Copyright (c) 2003, 2007, Oracle.  All rights reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

Job: SYS_EXPORT_SCHEMA_07
  Owner: INSUR_CHANGDE                  
  Operation: EXPORT                         
  Creator Privs: FALSE                          
  GUID: 20448E2327C5015EE053C0A80201015E
  Start Time: Tuesday, 22 September, 2015 16:51:56
  Mode: SCHEMA                         
  Instance: RLZY
  Max Parallelism: 1
  EXPORT Job Parameters:
  Parameter Name      Parameter Value:
     CLIENT_COMMAND        insur_changde/******** directory=dump_RLZY dumpfile=insur_changde_150921_2330.dmp logfile=insur_changde_150921_2330.log 
  State: IDLING                         
  Bytes Processed: 80,139,523,792
  Percent Done: 41
  Current Parallelism: 1
  Job Error Count: 0
  Dump File: /yb_oradata/RLZYbak/dpdump/insur_changde_150921_2330.dmp
    bytes written: 80,145,354,752
  
Worker 1 Status:
  State: UNDEFINED                      
  Object Schema: INSUR_CHANGDE
  Object Name: LV_INDIPAR
  Object Type: SCHEMA_EXPORT/TABLE/TABLE_DATA
  Completed Objects: 3
  Total Objects: 1,225
  Completed Rows: 288,824,659
  Worker Parallelism: 1
  
Worker 1 Status:
  State: UNDEFINED                      
  Object Schema: INSUR_CHANGDE
  Object Name: LV_INDIPAR
  Object Type: SCHEMA_EXPORT/TABLE/TABLE_DATA
  Completed Objects: 3
  Total Objects: 1,225
  Completed Rows: 288,824,659
  Worker Parallelism: 1

从上面的信息可以看出现在expdp job正在导出的表为LV_INDIPAR表,状态为UNDEFINED,也没有其它有用信息。那么为什么expdp job会异常终止了。检查alert.log文件在执行expdp导出时出现了以下错误信息。

Starting control autobackup
Control autobackup written to SBT_TAPE device
	comment 'API Version 2.0,MMS Version 1.2.0.0',
	media 'backup_nw.023.RO'
	handle 'c-1589671076-20150921-00'
Mon Sep 21 23:30:02 2015
The value (30) of MAXTRANS parameter ignored.
kupprdp: master process DM00 started with pid=188, OS id=23527444
         to execute - SYS.KUPM$MCP.MAIN('SYS_EXPORT_SCHEMA_07', 'INSUR_CHANGDE', 'KUPC$C_1_20150921233002', 'KUPC$S_1_20150921233002', 0);
kupprdp: worker process DW01 started with worker id=1, pid=189, OS id=1704856
         to execute - SYS.KUPW$WORKER.MAIN('SYS_EXPORT_SCHEMA_07', 'INSUR_CHANGDE');
Tue Sep 22 00:24:18 2015
ksvcreate: Process(q001) creation failed
Tue Sep 22 00:24:38 2015
Process startup failed, error stack:
Tue Sep 22 00:24:39 2015
Errors in file /oracle/admin/RLZY/bdump/rlzy_psp0_7471450.trc:
ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3
Tue Sep 22 00:24:39 2015
Process q001 died, see its trace file
Tue Sep 22 00:24:39 2015
ksvcreate: Process(q001) creation failed
Tue Sep 22 00:24:51 2015

从上面信息可以看到expdp job是在21号的23:30开始执行,在22号的00:24:39出现了故障并在/oracle/admin/RLZY/bdump/rlzy_psp0_7471450.trc文件中生成了错误信息如下。

*** 2015-09-20 00:24:36.347
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 12
ORA-27301: OS failure message: Not enough space
ORA-27302: failure occurred at: skgpspawn3

根据MOS文章Troubleshooting ORA-27300 ORA-27301 ORA-27302 errors (Doc ID 579365.1),出现这种错误信息主要是因为内存或交换区被用尽的原因,如是检查系统内存与交换区的使用情况

[IBMP740-1:root:/]#topas_nmon
lqtopas_nmonqqh=HelpqqqqqqqqqqqqqHost=IBMP740-1qqqqqqRefresh=2 secsqqq16:58.25
x Memory x
x          Physical  PageSpace |        pages/sec  In     Out | FileSystemCache                                                                                       x
x% Used       99.8%     68.0%  | to Paging Space   0.0    0.0 | (numperm) 49.6%                                                                                       x
x% Free        0.2%     32.0%  | to File System  586.6   11.7 | Process   42.7%                                                                                       x
xMB Used   63572.0MB 11142.9MB | Page Scans      126.5        | System     7.6%                                                                                       x
xMB Free     108.0MB  5241.1MB | Page Cycles       0.0        | Free       0.2%                                                                                       x
xTotal(MB) 63680.0MB 16384.0MB | Page Steals     126.5        |           ------                                                                                      x
x                              | Page Faults    1317.1        | Total    100.0%                                                                                       x
x------------------------------------------------------------ | numclient 49.6%                                                                                       x
xMin/Maxperm     1853MB(  3%)  55589MB( 90%) < --% of RAM      | maxclient 90.0%                                                                                       x
xMin/Maxfree     960   1088       Total Virtual   78.2GB      | User      89.3%                                                                                       x
xMin/Maxpgahead    2      8    Accessed Virtual   33.6GB 43.0%| Pinned     9.4%                                                                                       x
x                                                             | lruable pages   15811872.0  

从上面的信息可以看到物理内存为63680.0MB,交换区为16384.0M了,物理内存使用了63572.0M,交换区使用了11142.9M,物理内存了可用内存只有108.0M占总物理内存的0.2%,交换区是5241.1M占总交换区的32%。FileSystemCache (numperm) 49.6% 说明AIX 文件系统缓存占用了物理内存的49.6%,Process 42.7%说明进程占用了物理内存的42.7%,System 7.6%说明系统占用了物理内存的7.6%, Free 0.2%说明了可用的物理内存只有0.2%。并且可以看到Maxperm=90%,maxclient=90%,说明文件系统缓存使用物理内存的最大限制为物理内存的90%。

检查AIX系统中消耗内存前10的进程,如下所示大部分是Oracle相关进程

[IBMP740-1:root:/]#ps -ealf | head -1 ; ps -ealf | sort -rn +9 | head 
       F S      UID      PID     PPID   C PRI NI ADDR    SZ    WCHAN    STIME    TTY  TIME CMD
  240001 A   oracle  6553662        1   0  60 20 a31123590 115936            Jun 27      - 22:05 ora_lgwr_RLZY
  240001 A   oracle 57671750        1   0  60 20 c41744590 111768 f1000e0004ee48c8   Sep 16      - 28:48 oracleRLZY (LOCAL=NO)
  240001 A   oracle 61735218        1   0  60 20 c44fc4590 109912 f1000e00100440c8   Sep 16      - 31:40 oracleRLZY (LOCAL=NO)
  240001 A   oracle 58982776        1   0  60 20 fb447b590 109528 f1000e0004a0b8c8   Sep 16      - 12:57 oracleRLZY (LOCAL=NO)
  240001 A   oracle 26935684        1   0  60 20 f416f4590 108264            Jun 27      -  2:07 ora_arc1_RLZY
  240001 A   oracle 26870144        1   0  60 20 cf16cf590 108264            Jun 27      -  2:37 ora_arc0_RLZY
  240001 A   oracle  7536818        1   0  60 20 a71127590 108248            Jun 27      - 15:59 ora_cjq0_RLZY
  240001 A   oracle  7733430        1   0  60 20 8a0e0a590 106096            Jun 27      -  8:54 ora_dbw0_RLZY
  240001 A   oracle  8913722        1  24  72 20 864c86590 104764            Sep 16      - 18:14 oracleRLZY (LOCAL=NO)
  240001 A   oracle 26214712        1   0  60 20 944194590 104584          16:51:55      -  0:00 ora_dm00_RLZY

[IBMP740-1:root:/]#topas -M
Topas Monitor for host:    IBMP740-1   Interval:   2    Tue Sep 22 17:13:05 2015
================================================================================
REF1    SRAD  TOTALMEM  INUSE    FREE    FILECACHE  HOMETHRDS  CPUS
--------------------------------------------------------------------------------
   0     0     60.4G    60.3G    106.5    30.8G        748      0-31
   1     1       0.0      0.0      0.0      0.0        625      32-63
================================================================================
CPU     SRAD  TOTALDISP   LOCALDISP%  NEARDISP%   FARDISP%
------------------------------------------------------------
  36       1       439      100.0         0.0        0.0
  60       1       345      100.0         0.0        0.0
  56       1       184      100.0         0.0        0.0
   0       0       144      100.0         0.0        0.0
  32       1        93      100.0         0.0        0.0
  16       0        88      100.0         0.0        0.0
   8       0        54      100.0         0.0        0.0
  40       1        43      100.0         0.0        0.0
  12       0        36      100.0         0.0        0.0
  20       0        28      100.0         0.0        0.0
   4       0        28      100.0         0.0        0.0
  28       0        21      100.0         0.0        0.0
  44       1        18      100.0         0.0        0.0
  24       0        12      100.0         0.0        0.0
  52       1        11      100.0         0.0        0.0
  48       1         1      100.0         0.0        0.0
  17       0         0      0.0           0.0        0.0
  18       0         0      0.0           0.0        0.0
  19       0         0      0.0           0.0        0.0
  10       0         0      0.0           0.0        0.0
  21       0         0      0.0           0.0        0.0
  22       0         0      0.0           0.0        0.0
  23       0         0      0.0           0.0        0.0
   9       0         0      0.0           0.0        0.0
  25       0         0      0.0           0.0        0.0
  26       0         0      0.0           0.0        0.0
  27       0         0      0.0           0.0        0.0
   7       0         0      0.0           0.0        0.0
  29       0         0      0.0           0.0        0.0
  30       0         0      0.0           0.0        0.0
  31       0         0      0.0           0.0        0.0
   6       0         0      0.0           0.0        0.0
  33       1         0      0.0           0.0        0.0
   5       0         0      0.0           0.0        0.0

从上面的信息可知除了系统所用的物理内存之外,总的可用物理内存是60.4G,使用了60.3G,可用106.5M,文件系统缓存是30.8G。
使用操作系统命令vmo -a –F来查看操作系统参数

[IBMP740-1:root:/]#vmo -a -F
             ame_cpus_per_pool = n/a
               ame_maxfree_mem = n/a
           ame_min_ucpool_size = n/a
               ame_minfree_mem = n/a
               ams_loan_policy = n/a
  enhanced_affinity_affin_time = 1
enhanced_affinity_vmpool_limit = 10
                esid_allocator = 0
           force_relalias_lite = 0
             kernel_heap_psize = 65536
                  lgpg_regions = 0
                     lgpg_size = 0
               low_ps_handling = 1
                       maxfree = 1088
                       maxperm = 14230680
                        maxpin = 13137354
                       maxpin% = 80
                 memory_frames = 16302080
                 memplace_data = 0
          memplace_mapped_file = 0
        memplace_shm_anonymous = 0
            memplace_shm_named = 0
                memplace_stack = 0
                 memplace_text = 0
        memplace_unmapped_file = 0
                       minfree = 960
                       minperm = 474353
                      minperm% = 3
                     nokilluid = 0
                       npskill = 32768
                       npswarn = 131072
           num_locks_per_semid = 1
                     numpsblks = 4194304
               pinnable_frames = 14750156
           relalias_percentage = 0
                         scrub = 0
                      v_pinshm = 0
              vmm_default_pspa = 0
                vmm_klock_mode = 1
            wlm_memlimit_nonpg = 1
##Restricted tunables
               ame_sys_memview = n/a
                cpu_scale_memp = 8
         data_stagger_interval = 161
                         defps = 1
enhanced_affinity_attach_limit = 100
     enhanced_affinity_balance = 100
     enhanced_affinity_private = 40
      enhanced_memory_affinity = 1
                     framesets = 2
                     htabscale = n/a
                  kernel_psize = 65536
          large_page_heap_size = 0
               lru_file_repage = 0
             lru_poll_interval = 10
                     lrubucket = 131072
                    maxclient% = 90
                      maxperm% = 90
               mbuf_heap_psize = 65536
               memory_affinity = 1
          multiple_semid_lists = 0
                 munmap_npages = 16384
                     npsrpgmax = 262144
                     npsrpgmin = 196608
                   npsscrubmax = 262144
                   npsscrubmin = 196608
            num_sem_undo_lists = 0
             num_sems_per_lock = 1
              num_spec_dataseg = 0
                numperm_global = 1
             page_steal_method = 1
          psm_timeout_interval = 20000
             relalias_lockmode = 1
                      rpgclean = 0
                    rpgcontrol = 2
                    scrubclean = 0
                shm_1tb_shared = 12
           shm_1tb_unsh_enable = 1
              shm_1tb_unshared = 256
         soft_min_lgpgs_vmpool = 0
              spec_dataseg_int = 512
              strict_maxclient = 1
                strict_maxperm = 0
                   sync_npages = 0
                 thrpgio_inval = 1024
                thrpgio_npages = 1024
               vm_mmap_areload = 0
          vm_modlist_threshold = -1
              vm_pvlist_dohard = 0
              vm_pvlist_szpcnt = 0
               vmm_fork_policy = 1
            vmm_mpsize_support = 2
               vmm_vmap_policy = 0
                  vtiol_avg_ms = 200
                  vtiol_minreq = 25
            vtiol_minth_active = 1
                    vtiol_mode = 0
               vtiol_pgin_mode = 2
              vtiol_pgout_mode = 2
               vtiol_q_cpu_pct = 2500
          vtiol_thread_cpu_pct = 5000

主要是maxclient% = 90,maxperm% = 90参数,说明文件系统缓存使用物理内存的最大限制为物理内存的90%。所以这里只需要将maxclient%与maxperm%参数调小,让系统有空闲内存来分配给新产生的进程来执行特定操作。调整maxclient%与maxperm%参数。

[IBMP740-1:root:/]#vmo -p -o maxclient%=20
Modification to restricted tunable maxclient%, confirmation required yes/no yes
Setting maxclient% to 20 in nextboot file
Setting maxclient% to 20
Warning: a restricted tunable has been modified
[IBMP740-1:root:/]#vmo -p -o maxperm%=20
Modification to restricted tunable maxperm%, confirmation required yes/no yes
Setting maxperm% to 20 in nextboot file
Setting maxperm% to 20
Warning: a restricted tunable has been modified

调整后再次查看操作系统参数

[IBMP740-1:root:/]#vmo -a -F
             ame_cpus_per_pool = n/a
               ame_maxfree_mem = n/a
           ame_min_ucpool_size = n/a
               ame_minfree_mem = n/a
               ams_loan_policy = n/a
  enhanced_affinity_affin_time = 1
enhanced_affinity_vmpool_limit = 10
                esid_allocator = 0
           force_relalias_lite = 0
             kernel_heap_psize = 65536
                  lgpg_regions = 0
                     lgpg_size = 0
               low_ps_handling = 1
                       maxfree = 1088
                       maxperm = 3162370
                        maxpin = 13137354
                       maxpin% = 80
                 memory_frames = 16302080
                 memplace_data = 0
          memplace_mapped_file = 0
        memplace_shm_anonymous = 0
            memplace_shm_named = 0
                memplace_stack = 0
                 memplace_text = 0
        memplace_unmapped_file = 0
                       minfree = 960
                       minperm = 790590
                      minperm% = 5
                     nokilluid = 0
                       npskill = 32768
                       npswarn = 131072
           num_locks_per_semid = 1
                     numpsblks = 4194304
               pinnable_frames = 14770780
           relalias_percentage = 0
                         scrub = 0
                      v_pinshm = 0
              vmm_default_pspa = 0
                vmm_klock_mode = 1
            wlm_memlimit_nonpg = 1
##Restricted tunables
               ame_sys_memview = n/a
                cpu_scale_memp = 8
         data_stagger_interval = 161
                         defps = 1
enhanced_affinity_attach_limit = 100
     enhanced_affinity_balance = 100
     enhanced_affinity_private = 40
      enhanced_memory_affinity = 1
                     framesets = 2
                     htabscale = n/a
                  kernel_psize = 65536
          large_page_heap_size = 0
               lru_file_repage = 0
             lru_poll_interval = 10
                     lrubucket = 131072
                    maxclient% = 20
                      maxperm% = 20
               mbuf_heap_psize = 65536
               memory_affinity = 1
          multiple_semid_lists = 0
                 munmap_npages = 16384
                     npsrpgmax = 262144
                     npsrpgmin = 196608
                   npsscrubmax = 262144
                   npsscrubmin = 196608
            num_sem_undo_lists = 0
             num_sems_per_lock = 1
              num_spec_dataseg = 0
                numperm_global = 1
             page_steal_method = 1
          psm_timeout_interval = 20000
             relalias_lockmode = 1
                      rpgclean = 0
                    rpgcontrol = 2
                    scrubclean = 0
                shm_1tb_shared = 12
           shm_1tb_unsh_enable = 1
              shm_1tb_unshared = 256
         soft_min_lgpgs_vmpool = 0
              spec_dataseg_int = 512
              strict_maxclient = 1
                strict_maxperm = 0
                   sync_npages = 0
                 thrpgio_inval = 1024
                thrpgio_npages = 1024
               vm_mmap_areload = 0
          vm_modlist_threshold = -1
              vm_pvlist_dohard = 0
              vm_pvlist_szpcnt = 0
               vmm_fork_policy = 1
            vmm_mpsize_support = 2
               vmm_vmap_policy = 0
                  vtiol_avg_ms = 200
                  vtiol_minreq = 25
            vtiol_minth_active = 1
                    vtiol_mode = 0
               vtiol_pgin_mode = 2
              vtiol_pgout_mode = 2
               vtiol_q_cpu_pct = 2500
          vtiol_thread_cpu_pct = 5000

从上面的结果看到修改生效了,文件系统缓存最大可以使用20%的物理内存。

[IBMP740-1:root:/]#topas_nmon
lqtopas_nmonqqh=HelpqqqqqqqqqqqqqHost=IBMP740-1qqqqqqRefresh=2 secsqqq17:44.52
x Memory x
x          Physical  PageSpace |        pages/sec  In     Out | FileSystemCache                                                                                       x
x% Used       69.8%     67.2%  | to Paging Space   0.5    0.0 | (numperm) 19.3%                                                                                       x
x% Free       30.2%     32.8%  | to File System 9455.5    8.4 | Process   42.9%                                                                                       x
xMB Used   44476.3MB 11010.5MB | Page Scans     9562.1        | System     7.6%                                                                                       x
xMB Free   19203.7MB  5373.5MB | Page Cycles       0.0        | Free      30.2%                                                                                       x
xTotal(MB) 63680.0MB 16384.0MB | Page Steals    9510.6        |           ------                                                                                      x
x                              | Page Faults    7478.9        | Total    100.0%                                                                                       x
x------------------------------------------------------------ | numclient 19.3%                                                                                       x
xMin/Maxperm     3088MB(  5%)  12353MB( 20%) < --% of RAM      | maxclient 20.0%                                                                                       x
xMin/Maxfree     960   1088       Total Virtual   78.2GB      | User      59.3%                                                                                       x
xMin/Maxpgahead    2      8    Accessed Virtual   33.6GB 43.0%| Pinned     9.4%                                                                                       x
x                                                             | lruable pages   15811872.0

可以看到现在空闲物理内存为30.2%,文件系统缓存(FileSystemCache) 19.2%,maxperm,maxclient为20%。现在执行expdp导出正常。

通过这个问题可以看出,AIX为了提高系统IO能力将空闲的物理内存作为文件系统缓存来使用,而且缺省参数可以使用物理内存的90%,这个缺省值在实际的生产环境中是很容易将内存耗尽的,所以AIX推荐的缺省值也是有问题的。

谓词条件的数据类型随意书写对SQL性能造成巨大的影响

最近在优化某系统中发现许多SQL语句在书写谓词条件(wheret条件)时完全不根据表结构定义的字段数据类型来,而是随意书写谓词条件,这样造成原来能走正确索引的结果不能使用该索引,其结果就是查询语句的性能很差,这里将我所遇到的两种情况介绍一下.

第一种情况是谓词条件进行了数据类型的转换转换使得CBO无法使用索引:
其SQL语句如下所示,该SQL的功能是统计一年社保中心一年内由于各种伤害或骨折所发生的医疗费用

select a.hospital_id,
       c.hospital_name,
       count(distinct a.serial_no) rc,
       round(sum(b.real_pay), 2) ylfyze,
       round(sum(case
                   when b.fund_id in ('001') then
                    b.real_pay
                   else
                    0
                 end),
             2) tczc,
       round(sum(case
                   when b.fund_id in ('201') then
                    b.real_pay
                   else
                    0
                 end),
             2) zffy,
       round(sum(case
                   when b.fund_id in ('003', '999') then
                    b.real_pay
                   else
                    0
                 end),
             2) yyzf
  from mt_biz_fin a, mt_pay_record_fin b, bs_hospital c, bs_disease d
 where a.hospital_id = b.hospital_id
   and a.serial_no = b.serial_no
   and a.hospital_id = c.hospital_id
   and a.fin_disease = d.icd
   and d.center_id = a.center_id
   and a.valid_flag = 1
   and b.valid_flag = 1
   and a.biz_type = 12
   and a.pers_type in (1, 2)
   and (d.disease like '%伤%' or d.disease like '%骨折%')
   and a.center_id = '430740'
   and to_char(a.fin_date, 'yyyymmdd') >= '20140101'
   and to_char(a.fin_date, 'yyyymmdd') < = '20141231'
 group by a.hospital_id, c.hospital_name
 order by a.hospital_id

上述SQL执行情况如下,其执行时间为4分40秒

SQL> set timing on
SQL> set autotrace traceonly
SQL> select c.hospital_id,
  2         c.hospital_name,
  3         count(distinct a.serial_no) rc,
  4         round(sum(b.real_pay), 2) ylfyze,
  5         round(sum(case
  6                     when b.fund_id in ('001') then
  7                      b.real_pay
  8                     else
  9                      0
 10                   end),
 11               2) tczc,
 12         round(sum(case
 13                     when b.fund_id in ('201') then
 14                      b.real_pay
 15                     else
 16                      0
 17                   end),
 18               2) zffy,
 19         round(sum(case
 20                     when b.fund_id in ('003', '999') then
 21                      b.real_pay
 22                     else
 23                      0
 24                   end),
 25               2) yyzf
 26    from mt_biz_fin a, mt_pay_record_fin b, bs_hospital c, bs_disease d
 27   where a.hospital_id = b.hospital_id
 28     and a.serial_no = b.serial_no
 29     and a.hospital_id = c.hospital_id
 30     and a.fin_disease = d.icd
 31     and d.center_id = a.center_id
 32     and a.valid_flag = 1
 33     and b.valid_flag = 1
 34     and a.biz_type = 12
 35     and a.pers_type in (1, 2)
 36     and (d.disease like '%伤%' or d.disease like '%骨折%')
 37     and a.center_id = '430740'
 38     and to_char(a.fin_date, 'yyyymmdd') >= '20140101'
 39     and to_char(a.fin_date, 'yyyymmdd') < = '20141231'
 40   group by c.hospital_id, c.hospital_name
 41   order by c.hospital_id
 42  ;

Elapsed: 00:04:39.59

Execution Plan
----------------------------------------------------------
Plan hash value: 1467084556

---------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                      |     1 |   148 |  4254  (20)| 00:00:04 |
|   1 |  SORT GROUP BY                   |                      |     1 |   148 |  4254  (20)| 00:00:04 |
|*  2 |   TABLE ACCESS BY INDEX ROWID    | MT_PAY_RECORD_FIN    |     1 |    31 |     1   (0)| 00:00:01 |
|   3 |    NESTED LOOPS                  |                      |     1 |   148 |  4252  (20)| 00:00:04 |
|   4 |     NESTED LOOPS                 |                      |     1 |   117 |  4251  (20)| 00:00:04 |
|   5 |      NESTED LOOPS                |                      |     3 |   252 |  4250  (20)| 00:00:04 |
|   6 |       INDEX FULL SCAN            | IDX_BS_HOSPITAL_NAME |  1227 | 39264 |     2   (0)| 00:00:01 |
|*  7 |       TABLE ACCESS BY INDEX ROWID| MT_BIZ_FIN           |     1 |    52 |     3   (0)| 00:00:01 |
|*  8 |        INDEX RANGE SCAN          | PK_MT_BIZ_FIN        |     1 |       |     3   (0)| 00:00:01 |
|*  9 |      TABLE ACCESS BY INDEX ROWID | BS_DISEASE           |     1 |    33 |     1   (0)| 00:00:01 |
|* 10 |       INDEX RANGE SCAN           | INX_BS_DISEASE_01    |     1 |       |     1   (0)| 00:00:01 |
|* 11 |     INDEX RANGE SCAN             | I_MT_PAY_RECORD_FIN_1|     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("B"."VALID_FLAG")=1)
   7 - filter(TO_NUMBER("A"."VALID_FLAG")=1 AND (TO_NUMBER("A"."PERS_TYPE")=1 OR
              TO_NUMBER("A"."PERS_TYPE")=2))
   8 - access("A"."HOSPITAL_ID"="C"."HOSPITAL_ID" AND "A"."CENTER_ID"='430740')
       filter("A"."CENTER_ID"='430740' AND TO_NUMBER("A"."BIZ_TYPE")=12 AND
              TO_CHAR(INTERNAL_FUNCTION("A"."FIN_DATE"),'yyyymmdd')>='20140101' AND
              TO_CHAR(INTERNAL_FUNCTION("A"."FIN_DATE"),'yyyymmdd')< ='20141231')
   9 - filter("D"."DISEASE" LIKE '%伤%' OR "D"."DISEASE" LIKE '%骨折%')
  10 - access("D"."CENTER_ID"='430740' AND "A"."FIN_DISEASE"="D"."ICD")
  11 - access("A"."HOSPITAL_ID"="B"."HOSPITAL_ID" AND "A"."SERIAL_NO"="B"."SERIAL_NO")


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
     161233  consistent gets
      83048  physical reads
        624  redo size
       1197  bytes sent via SQL*Net to client
        492  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          5  rows processed

上述SQL对于表BS_HOSPITAL只查询了hospital_name列,而在BS_HOSPITAL表中存在索引IDX_BS_HOSPITAL_NAME(hospital_name,hospital_id)所以首先对IDX_BS_HOSPITAL_NAME索引全扫描这样就不用再回表查询从索引中就是得到hospital_name列的值作为结果集1。再通过对MT_BIZ_FIN表执行索引(PK_MT_BIZ_FIN)范围扫描,再回表查询返回其记录作为结果集2,再以结果集1作为驱动表进行嵌套循环连接。再与表BS_DISEASE,I_MT_PAY_RECORD_FIN_1执行嵌套循环连接,再执行分组排序。其实在MT_BIZ_FIN表中存在复合索引INDI_MT_BIZ_FIN_F_H(FIN_DATE,HOSPITAL_ID,BIZ_TYPE, TREATMENT_TYPE, CENTER_ID),而查询条件中用到了find_date,hospital_id,biz_type,center_id,只是这里因为谓词条件中对于fin_date条件是to_char(a.fin_date, 'yyyymmdd') >= '20140101' and to_char(a.fin_date, 'yyyymmdd') < = '20141231',而fin_date(费用完成时间)是日期类型,这里将find_date转换成字符型所以没有办法使用索引INDI_MT_BIZ_FIN_F_H。 将to_char(a.fin_date, 'yyyymmdd') >= '20140101' and to_char(a.fin_date, 'yyyymmdd') < = '20141231'条件改写成 a.fin_date between to_date('20140101','yyyymmdd') and to_date('20141231','yyyymmdd') ,改写后其SQL语句如下所示:

select  c.hospital_id,
       c.hospital_name,
       count(distinct a.serial_no) rc,
       round(sum(b.real_pay), 2) ylfyze,
       round(sum(case
                   when b.fund_id in ('001') then
                    b.real_pay
                   else
                    0
                 end),
             2) tczc,
       round(sum(case
                   when b.fund_id in ('201') then
                    b.real_pay
                   else
                    0
                 end),
             2) zffy,
       round(sum(case
                   when b.fund_id in ('003', '999') then
                    b.real_pay
                   else
                    0
                 end),
             2) yyzf
  from mt_biz_fin a, mt_pay_record_fin b, bs_hospital c, bs_disease d
 where a.hospital_id = b.hospital_id
   and a.serial_no = b.serial_no
   and a.hospital_id = c.hospital_id
   and a.fin_disease = d.icd
   and d.center_id = a.center_id
   and a.valid_flag = 1
   and b.valid_flag = 1
   and a.biz_type = 12
   and a.pers_type in (1, 2)
   and (d.disease like '%伤%' or d.disease like '%骨折%')
   and a.center_id = '430740'
   and a.fin_date between to_date('20140101','yyyymmdd') and to_date('20141231','yyyymmdd')
group by c.hospital_id, c.hospital_name
 order by c.hospital_id

来实际执行一次,其执行结果如下所示,现在执行时间稳定在1-2秒之间,能满足客户要求。

SQL> set autotrace traceonly
SQL> select  c.hospital_id,
  2         c.hospital_name,
  3         count(distinct a.serial_no) rc,
  4         round(sum(b.real_pay), 2) ylfyze,
  5         round(sum(case
  6                     when b.fund_id in ('001') then
  7                      b.real_pay
  8                     else
  9                      0
 10                   end),
 11               2) tczc,
 12         round(sum(case
 13                     when b.fund_id in ('201') then
 14                      b.real_pay
 15                     else
 16                      0
 17                   end),
 18               2) zffy,
 19         round(sum(case
 20                     when b.fund_id in ('003', '999') then
 21                      b.real_pay
 22                     else
 23                      0
 24                   end),
 25               2) yyzf
 26    from mt_biz_fin a, mt_pay_record_fin b, bs_hospital c, bs_disease d
 27   where a.hospital_id = b.hospital_id
 28     and a.serial_no = b.serial_no
 29     and a.hospital_id = c.hospital_id
 30     and a.fin_disease = d.icd
 31     and d.center_id = a.center_id
 32     and a.valid_flag = 1
 33     and b.valid_flag = 1
 34     and a.biz_type = 12
 35     and a.pers_type in (1, 2)
 36     and (d.disease like '%伤%' or d.disease like '%骨折%')
 37     and a.center_id = '430740'
 38     and a.fin_date between to_date('20140101','yyyymmdd') and to_date('20141231','yyyymmdd')
 39  group by c.hospital_id, c.hospital_name
 40   order by c.hospital_id
 41  ;

Elapsed: 00:00:01.02

Execution Plan
----------------------------------------------------------
Plan hash value: 1467084556

---------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name                 | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |                      |    17 |  2516 |  1529  (15)| 00:00:02 |
|   1 |  SORT GROUP BY                   |                      |    17 |  2516 |  1529  (15)| 00:00:02 |
|*  2 |   TABLE ACCESS BY INDEX ROWID    | MT_PAY_RECORD_FIN    |     1 |    31 |     1   (0)| 00:00:01 |
|   3 |    NESTED LOOPS                  |                      |    17 |  2516 |  1528  (15)| 00:00:02 |
|   4 |     NESTED LOOPS                 |                      |    33 |  3861 |  1521  (15)| 00:00:02 |
|   5 |      NESTED LOOPS                |                      |   354 | 29736 |  1450  (16)| 00:00:02 |
|   6 |       INDEX FULL SCAN            | IDX_BS_HOSPITAL_NAME |  1227 | 39264 |     2   (0)| 00:00:01 |
|*  7 |       TABLE ACCESS BY INDEX ROWID| MT_BIZ_FIN           |     1 |    52 |     1   (0)| 00:00:01 |
|*  8 |        INDEX RANGE SCAN          | INDI_MT_BIZ_FIN_F_H  |     1 |       |     1   (0)| 00:00:01 |
|*  9 |      TABLE ACCESS BY INDEX ROWID | BS_DISEASE           |     1 |    33 |     1   (0)| 00:00:01 |
|* 10 |       INDEX RANGE SCAN           | INX_BS_DISEASE_01    |     1 |       |     1   (0)| 00:00:01 |
|* 11 |     INDEX RANGE SCAN             | I_MT_PAY_RECORD_FIN_1|     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter(TO_NUMBER("B"."VALID_FLAG")=1)
   7 - filter(TO_NUMBER("A"."VALID_FLAG")=1 AND (TO_NUMBER("A"."PERS_TYPE")=1 OR
              TO_NUMBER("A"."PERS_TYPE")=2))
   8 - access("A"."HOSPITAL_ID"="C"."HOSPITAL_ID" AND "A"."FIN_DATE">=TO_DATE(' 2014-01-01
              00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND "A"."CENTER_ID"='430740' AND "A"."FIN_DATE"< =TO_DATE('
              2014-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
       filter("A"."CENTER_ID"='430740' AND TO_NUMBER("A"."BIZ_TYPE")=12)
   9 - filter("D"."DISEASE" LIKE '%伤%' OR "D"."DISEASE" LIKE '%骨折%')
  10 - access("D"."CENTER_ID"='430740' AND "A"."FIN_DISEASE"="D"."ICD")
  11 - access("A"."HOSPITAL_ID"="B"."HOSPITAL_ID" AND "A"."SERIAL_NO"="B"."SERIAL_NO")


Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
      71411  consistent gets
          0  physical reads
          0  redo size
       1197  bytes sent via SQL*Net to client
        492  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          5  rows processed


第二种谓词条件的数据类型隐式转换无法使用索引的情况,其原始SQL语句如下所示,查询一个医疗机构的费用支出情况

select  a.hospital_id,
       count(distinct a.serial_no) rc,
       round(sum(b.real_pay), 2) ylfyze,
       round(sum(case
                   when b.fund_id in ('001') then
                    b.real_pay
                   else
                    0
                 end),
             2) tczc,
       round(sum(case
                   when b.fund_id in ('201') then
                    b.real_pay
                   else
                    0
                 end),
             2) zffy,
       round(sum(case
                   when b.fund_id in ('003', '999') then
                    b.real_pay
                   else
                    0
                 end),
             2) yyzf
  from mt_biz_fin a, mt_pay_record_fin b
 where a.hospital_id = b.hospital_id
   and a.serial_no = b.serial_no
   and a.valid_flag = '1'
   and b.valid_flag = '1'
   and a.biz_type = '12'
   and a.pers_type in ('1', '2')    
   and b.hospital_id=4307000231
group by a.hospital_id

该SQL的执行计划如下所示,执行了1分22秒:

SQL> set autotrace traceonly
SQL> select  a.hospital_id,
  2         count(distinct a.serial_no) rc,
  3         round(sum(b.real_pay), 2) ylfyze,
  4         round(sum(case
  5                     when b.fund_id in ('001') then
  6                      b.real_pay
  7                     else
  8                      0
  9                   end),
 10               2) tczc,
 11         round(sum(case
 12                     when b.fund_id in ('201') then
 13                      b.real_pay
 14                     else
 15                      0
 16                   end),
 17               2) zffy,
 18         round(sum(case
 19                     when b.fund_id in ('003', '999') then
 20                      b.real_pay
 21                     else
 22                      0
 23                   end),
 24               2) yyzf
 25    from mt_biz_fin a, mt_pay_record_fin b
 26   where a.hospital_id = b.hospital_id
 27     and a.serial_no = b.serial_no
 28     and a.valid_flag = ‘1’
 29     and b.valid_flag = ‘1’
 30     and a.biz_type = ‘12’
 31     and a.pers_type in ('1', '2')    
 32     and b.hospital_id=4307000231
 33  group by a.hospital_id
 34  ;

no rows selected

Elapsed: 00:01:22.20

Execution Plan
----------------------------------------------------------
Plan hash value: 3673479381

--------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name              | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                   |     1 |    61 |   127K (16)| 00:01:56 |
|   1 |  SORT GROUP BY               |                   |     1 |    61 |   127K (16)| 00:01:56 |
|*  2 |   TABLE ACCESS BY INDEX ROWID| MT_BIZ_FIN        |     1 |    30 |     1   (0)| 00:00:01 |
|   3 |    NESTED LOOPS              |                   |    45 |  2745 |   127K (16)| 00:01:56 |
|*  4 |     TABLE ACCESS FULL        | MT_PAY_RECORD_FIN |  8327 |   252K|   123K (16)| 00:01:53 |
|*  5 |     INDEX RANGE SCAN         | PK_MT_BIZ_FIN     |     1 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("A"."BIZ_TYPE"='12’ AND "A"."VALID_FLAG"='1’ AND
              ("A"."PERS_TYPE"='1’ OR "A"."PERS_TYPE"='2’)
   4 - filter(TO_NUMBER("B"."HOSPITAL_ID")=4307000231 AND "B"."VALID_FLAG"='1')
   5 - access("A"."HOSPITAL_ID"="B"."HOSPITAL_ID" AND "A"."SERIAL_NO"="B"."SERIAL_NO")


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
     572386  consistent gets
     383935  physical reads
          0  redo size
        638  bytes sent via SQL*Net to client
        481  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          0  rows processed

从执行计划中可以看到在访问表MT_PAY_RECORD_FIN时使用的全表扫描,而在表MT_PAY_RECORD_FIN上存在索引PK_MT_PAY_RECORD_FIN(HOSPITAL_ID, SERIAL_NO)为什么没有使用该索引了,查询条件中的谓词条件是b.hospital_id=4307000231而从Predicate Information信息中的4 – filter(TO_NUMBER(“B”.”HOSPITAL_ID”)=4307000231
可知hospital_id在表中是字符型,而在书写查询条件时使用的是数字类型,这里CBO进行数据类型的隐式转换。所以使用不了索引。我们需要写成b.hospital_id=’4307000231′,修改后的SQL如下所示:

select  a.hospital_id,
       count(distinct a.serial_no) rc,
       round(sum(b.real_pay), 2) ylfyze,
       round(sum(case
                   when b.fund_id in ('001') then
                    b.real_pay
                   else
                    0
                 end),
             2) tczc,
       round(sum(case
                   when b.fund_id in ('201') then
                    b.real_pay
                   else
                    0
                 end),
             2) zffy,
       round(sum(case
                   when b.fund_id in ('003', '999') then
                    b.real_pay
                   else
                    0
                 end),
             2) yyzf
  from mt_biz_fin a, mt_pay_record_fin b
 where a.hospital_id = b.hospital_id
   and a.serial_no = b.serial_no
   and a.valid_flag = '1'
   and b.valid_flag = '1'
   and a.biz_type = '12'
   and a.pers_type in ('1', '2')    
   and b.hospital_id='4307000231'
group by a.hospital_id

来真实执行一次,现在能使用索引之后执行时间只要0.1秒

SQL> select  a.hospital_id,
  2         count(distinct a.serial_no) rc,
  3         round(sum(b.real_pay), 2) ylfyze,
  4         round(sum(case
  5                     when b.fund_id in ('001') then
  6                      b.real_pay
  7                     else
  8                      0
  9                   end),
 10               2) tczc,
 11         round(sum(case
 12                     when b.fund_id in ('201') then
 13                      b.real_pay
 14                     else
 15                      0
 16                   end),
 17               2) zffy,
 18         round(sum(case
 19                     when b.fund_id in ('003', '999') then
 20                      b.real_pay
 21                     else
 22                      0
 23                   end),
 24               2) yyzf
 25    from mt_biz_fin a, mt_pay_record_fin b
 26   where a.hospital_id = b.hospital_id
 27     and a.serial_no = b.serial_no
 28     and a.valid_flag = '1'
 29     and b.valid_flag = '1'
 30     and a.biz_type = '12'
 31     and a.pers_type in ('1', '2')    
 32     and b.hospital_id='4307000231'
 33  group by a.hospital_id
 34  ;

no rows selected

Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
Plan hash value: 3142857175

------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                     |     1 |    61 |   115   (1)| 00:00:01 |
|   1 |  SORT GROUP BY                 |                     |     1 |    61 |   115   (1)| 00:00:01 |
|*  2 |   TABLE ACCESS BY INDEX ROWID  | MT_PAY_RECORD_FIN   |     1 |    31 |     1   (0)| 00:00:01 |
|   3 |    NESTED LOOPS                |                     |   139 |  8479 |   115   (1)| 00:00:01 |
|*  4 |     TABLE ACCESS BY INDEX ROWID| MT_BIZ_FIN          |   139 |  4170 |    87   (2)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN          | INDI_MT_BIZ_FIN_H_F |   371 |       |    19   (6)| 00:00:01 |
|*  6 |     INDEX RANGE SCAN           | PK_MT_PAY_RECORD_FIN|     1 |       |     1   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("B"."VALID_FLAG"='1')
   4 - filter("A"."VALID_FLAG"='1' AND ("A"."PERS_TYPE"='1' OR
             A"."PERS_TYPE"='2'))
   5 - access("A"."HOSPITAL_ID"='4307000231')
       filter("A"."BIZ_TYPE"='12')
   6 - access("B"."HOSPITAL_ID"='4307000231' AND "A"."SERIAL_NO"="B"."SERIAL_NO")


Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        203  consistent gets
          0  physical reads
          0  redo size
        638  bytes sent via SQL*Net to client
        481  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          0  rows processed

从上面的执行计划可以看到现在访问表MT_PAY_RECORD_FIN能正确使用索引PK_MT_PAY_RECORD_FIN,但这里CBO并不是先访问表MT_PAY_RECORD_FIN,这里执行了谓词传递,从Predicate Information 中的 5 – access(“A”.”HOSPITAL_ID”=’4307000231′)可知是先对索引INDI_MT_BIZ_FIN_H_F执行索引范围,但是在查询条件中并没有写a.hospital_id=’4307000231’这个条件,这就是谓词传递的结果,因为有b.hospital_id=’4307000231′ and a.hospital_id=b.hospital_id,所以CBO推导出a.hospital_id=’4307000231’。

在优化这个系统时发现好多类似这两种情况的SQL,都是因为在书写SQL语句时根本就没有注意字段的类型,不同的开发人员书写的SQL语句,有的人谓词数据类型书写正确,有的人谓词数据类型书写不正确。希望开发人员在书写SQL谓词条件时注意数据类型,一定要书写正确。

kksfbc child completion与ksdxexeotherwait引发CPU使用异常

某客户操作人员反应很慢不能操作,管理人员登录小机系统后发现CPU使用到了96%。而且这种情况持续了几个月。以下是登录后小机后载取的topas图,而且是周末,并没有人使用系统。小机是IBM的550,配置是2颗6核的CPU,内存是48G。
1

如是登录数据库执行以下脚本来查看当前数据库消耗CPU最多的进程在执行什么

Connected to Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 
Connected as gtp2
 


SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE,
  2  (select  c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext,
  3  (select  c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID
  4  from v$session s,v$process p
  5  where p.SPID in(491720,90116,127336,529102,987524,331990)
  6  and s.event not like'%SQL*Net%' and s.USERNAME='GTP2'
  7  order by s.wait_time desc
  8  ;
 
       SID SPID         USERNAME                       EVENT                                                             WAIT_TIME STATE               SECONDS_IN_WAIT PROGRAM                                          MACHINE                                                          SQL_FULLTEXT                                                                     BIND_DATA                                                                        SQL_ID
---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------
      1020 90116        GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 987524       GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 331990       GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 491720       GTP2                           kksfbc child completion                                                  -1 WAITED SHORT TIME             53742 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
 
4 rows selected
 
SQL> select s.sid,p.SPID,s.username,s.event,s.wait_time,s.state,s.seconds_in_wait,p.PROGRAM,s.MACHINE,
  2  (select  c.SQL_FULLTEXT from v$sqlarea c where c.SQL_ID=s.SQL_ID) sql_fulltext,
  3  (select  c.BIND_DATA from v$sqlarea c where c.SQL_ID=s.SQL_ID) BIND_DATA,s.SQL_ID
  4  from v$session s,v$process p
  5  where p.SPID in(491720,90116,127336,529102,987524,331990)
  6  and s.event not like'%SQL*Net%' and s.USERNAME='GTP2'
  7  order by s.wait_time desc
  8  ;
 
       SID SPID         USERNAME                       EVENT                                                             WAIT_TIME STATE               SECONDS_IN_WAIT PROGRAM                                          MACHINE                                                          SQL_FULLTEXT                                                                     BIND_DATA                                                                        SQL_ID
---------- ------------ ------------------------------ ---------------------------------------------------------------- ---------- ------------------- --------------- ------------------------------------------------ ---------------------------------------------------------------- -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- -------------
      1020 90116        GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 987524       GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 331990       GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
      1020 491720       GTP2                           ksdxexeotherwait                                                         -1 WAITED SHORT TIME              3342 oracleorcl@dbserv                                WORKGROUP\WIN-AUQ43P0UU9L                                                                                                                                                                                                          063cu7y841kmc
 
4 rows selected
 

从上面的信息可以看到这些进程的等待事件为kksfbc child completion,ksdxexeotherwait。当看到这种情况时第一反应是不是遇到的BUG,以KKSFBC CHILD COMPLETION为关键字到MOS查询可以找到,该Bug的症状为进程不断spin且hang住、出现’KKSFBC CHILD COMPLETION’等待事件、还可能伴有’Waits for “cursor: pin S”‘等待事件,直接影响的版本有11.1.0.6、10.2.0.3和10.2.0.4。而我这里的版本是10.2.0.1。
2

对于该Bug的描述是在发生’kksfbc child completion’等待事件后会话陷入无休止的自旋(spins)中,这种自旋(spins)发生在由堆栈调用(stack call)kksSearchChildList->kkshgnc陷入对kksSearchChildList函数的无限循环中。需要更详细的stack call,如是对系统进程90116进行跟踪。

SQL> oradebug setospid 90116
Oracle pid: 40, Unix process pid: 90116, image: oracleorcl@dbserv
SQL> oradebug unlimit;
Statement processed.
SQL> oradebug short_stack;
ksdxfstk+002c< -ksdxcb+04e4<-sspuser+0068<-00004750<-kksfbc+0bb0<-kkspsc0+0f3c<-kksParseCursor+00d4<-opiosq0+0b10<-kpooprx+0168<-kpoal8+0400<-opiodr+0adc<-ttcpip+1004<-opitsk+1000<-opiino+0990<-opiodr+0adc<-opidrv+0474<-sou2o+0090<-opimai_real+01bc<-main+0098<-__start+0070
SQL> oradebug dump processstate 10;
Statement processed.
SQL>  oradebug dump systemstate 266;
Statement processed.
SQL> oradebug tracefile_name
/oracle/admin/orcl/udump/orcl_ora_90116.trc

查看生成的跟踪文件orcl_ora_90116.trc有如下内容:

SO: 7000001486ab188, type: 4, owner: 70000014346c5a8, flag: INIT/-/-/0x00
    (session) sid: 1020 trans: 0, creator: 70000014346c5a8, flag: (41) USR/- BSY/-/-/-/-/-
              DID: 0000-0000-00000000, short-term DID: 0000-0000-00000000
              txn branch: 0
              oct: 0, prv: 0, sql: 7000001473dcf10, psql: 7000001225ac0c8, user: 82/GTP2
    O/S info: user: gtp-default, term: WIN-AUQ43P0UU9L, ospid: 6708:12196, machine: WORKGROUP\WIN-AUQ43P0UU9L
              program: w3wp.exe
    last wait for 'kksfbc child completion' blocking sess=0x0 seq=2831 wait_time=48850 seconds since wait started=572057
                =0, =0, =0
    Dumping Session Wait History
     for 'kksfbc child completion' count=1 wait_time=48850
                =0, =0, =0

可以从以上trace中看到会话确实曾长时间处于’kksfbc child completion’等待中,之后陷入无限自旋(spins)中消耗了大量CPU时间。但这里实际的表现又存有差异,引发无限循环的函数是kksfbc而不是kksSearchChildList(常规的调用序列是:kksParseCursor->kkspsc0->kksfbc ->kksSearchChildList->kkshgnc)。kksfbc意为K[Kernel]K[Kompile]S[Shared]F[Find]B[Best]C[Child]该函数用以在软解析时找寻合适的子游标,在10.2.0.2以后引入了mutex互斥体来取代原有的Cursor Pin机制,Mutex较Latch更为轻量级。虽然mutex的引入改变了众多cursor pin的内部机制,但kksfbc仍需要持有library cache latches才能扫描library cache hash chains。另一方面当kksfbc函数针对某个parent cursor找到合适child cursor后,可能使用KKSCHLPINx方法将该child cursor pin住,这个时候就需要exclusive地持有该child cursor相应的mutex。Oracle在10.2.0.4上提供了该Bug的one-off Patch
8575528,其在10.2.0.4 psu4以后的等价补丁为(Equivalent patch)为merge patch 9696904:8557428 9696904 7527908 Both fixes are needed. 6795880 superceded by 8575528 in 9696904 which includes extra files so may cause new conflicts。但merge patch 9696904目前仅有Linux x86/64平台上的版本,而问题数据库所在平台为IBM AIX on POWER Systems (64-bit),而且版本是10.2.0.1。那么要解决这个问题是不是没有办法了,其实不然,我们可以将数据库从10.2.0.1升级到10.2.0.5来解决这个BUG,在升级到10.2.0.5之后确实解决这个问题。

Weblogic BEA-141281 unable to get file lock, will retry 故障处理

今天兄弟单位的一台应用服务器需要从测试环境移交到机房并修改IP,在移交前关闭了应用服务器,在移交后启动weblogic时出现了问题,weblogic进程起来了,但是控制台进不了不能修改数据源设置(也就是修改jdbc连接串)。操作下如下:

手动执行启动脚本(但这里其实另一个同事已经执行过一次)

[root@cdydtest bin]# ./startWebLogic.sh
.
.
JAVA Memory arguments: -Xms300m -Xmx300m -XX:CompileThreshold=8000 -XX:PermSize=256m  -XX:MaxPermSize=256m
.
WLS Start Mode=Development
.
CLASSPATH=/usr/bea/patch_wls1036/profiles/default/sys_manifest_classpath/weblogic_patch.jar:/usr/bea/jdk1.6.0_20/lib/tools.jar:/usr/bea/wlserver_10.3/server/lib/weblogic_sp.jar:/usr/bea/wlserver_10.3/server/lib/weblogic.jar:/usr/bea/modules/features/weblogic.server.modules_10.3.6.0.jar:/usr/bea/wlserver_10.3/server/lib/webservices.jar:/usr/bea/modules/org.apache.ant_1.7.1/lib/ant-all.jar:/usr/bea/modules/net.sf.antcontrib_1.1.0.0_1-0b2/lib/ant-contrib.jar:/usr/bea/wlserver_10.3/common/derby/lib/derbyclient.jar:/usr/bea/wlserver_10.3/server/lib/xqrl.jar
.
PATH=/usr/bea/wlserver_10.3/server/bin:/usr/bea/modules/org.apache.ant_1.7.1/bin:/usr/bea/jdk1.6.0_20/jre/bin:/usr/bea/jdk1.6.0_20/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
.
***************************************************
*  To start WebLogic Server, use a username and   *
*  password assigned to an admin-level user.  For *
*  server administration, use the WebLogic Server *
*  console at http://hostname:port/console        *
***************************************************
starting weblogic with Java version:
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
Starting WLS with line:
/usr/bea/jdk1.6.0_20/bin/java -client   -Xms300m -Xmx300m -XX:CompileThreshold=8000 -XX:PermSize=256m  -XX:MaxPermSize=256m -Dweblogic.Name=AdminServer -Djava.security.policy=/usr/bea/wlserver_10.3/server/lib/weblogic.policy  -Xverify:none  -da -Dplatform.home=/usr/bea/wlserver_10.3 -Dwls.home=/usr/bea/wlserver_10.3/server -Dweblogic.home=/usr/bea/wlserver_10.3/server   -Dweblogic.management.discover=true  -Dwlw.iterativeDev= -Dwlw.testConsole= -Dwlw.logErrorsToConsole= -Dweblogic.ext.dirs=/usr/bea/patch_wls1036/profiles/default/sysext_manifest_classpath  weblogic.Server

因为输出日志被重定向到了日志文件(47ggzj.log)中,所以这里没有显示完整的日志信息。

抓取java进程来判断是否weblogic已经启动

[root@cdydtest bin]# ps -ef | grep java
root     11933     1  0 14:23 ?        00:00:17 /usr/bea/jdk1.6.0_20/bin/java -client -Xms300m -Xmx300m -XX:CompileThreshold=8000 -XX:PermSize=256m -XX:MaxPermSize=256m -Dweblogic.Name=AdminServer -Djava.security.policy=/usr/bea/wlserver_10.3/server/lib/weblogic.policy -Xverify:none -da -Dplatform.home=/usr/bea/wlserver_10.3 -Dwls.home=/usr/bea/wlserver_10.3/server -Dweblogic.home=/usr/bea/wlserver_10.3/server -Dweblogic.management.discover=true -Dwlw.iterativeDev= -Dwlw.testConsole= -Dwlw.logErrorsToConsole= -Dweblogic.ext.dirs=/usr/bea/patch_wls1036/profiles/default/sysext_manifest_classpath weblogic.Server

root     14675     1 20 15:10 pts/3    00:00:01 /usr/bea/jdk1.6.0_20/bin/java -client -Xms300m -Xmx300m -XX:CompileThreshold=8000 -XX:PermSize=256m -XX:MaxPermSize=256m -Dweblogic.Name=AdminServer -Djava.security.policy=/usr/bea/wlserver_10.3/server/lib/weblogic.policy -Xverify:none -da -Dplatform.home=/usr/bea/wlserver_10.3 -Dwls.home=/usr/bea/wlserver_10.3/server -Dweblogic.home=/usr/bea/wlserver_10.3/server -Dweblogic.management.discover=true -Dwlw.iterativeDev= -Dwlw.testConsole= -Dwlw.logErrorsToConsole= -Dweblogic.ext.dirs=/usr/bea/patch_wls1036/profiles/default/sysext_manifest_classpath weblogic.Server
root     14695 14567  0 15:10 pts/3    00:00:00 grep java

从上面的信息看出现了两个weblogic进程(pid为11933,14675)在运行,这是因为两个同事都手动执行了一次启动脚本。但登录不了weblogic控制台,如是查看weblogic日志文件。

[root@cdydtest base_domain]# cat 47ggzj.log
<nov 19, 2015 3:10:34 PM CST> <info> <security> <bea -090905> <disabling CryptoJ JCE Provider self-integrity check for better startup performance. To enable this check, specify -Dweblogic.security.allowCryptoJDefaultJCEVerification=true>
<nov 19, 2015 3:10:34 PM CST> <info> <security> <bea -090906> <changing the default Random Number Generator in RSA CryptoJ from ECDRBG to FIPS186PRNG. To disable this change, specify -Dweblogic.security.allowCryptoJDefaultPRNG=true>
<nov 19, 2015 3:10:35 PM CST> <info> <weblogicserver> <bea -000377> <starting WebLogic Server with Java HotSpot(TM) 64-Bit Server VM Version 16.3-b01 from Sun Microsystems Inc.>
<nov 19, 2015 3:10:45 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:10:55 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:11:05 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:11:15 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:11:25 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:11:35 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:11:45 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:11:55 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:12:05 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:12:15 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:12:25 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:12:35 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:12:45 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:12:55 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>
<nov 19, 2015 3:13:06 PM CST> <info> <management> <bea -141281> <unable to get file lock, will retry …>

从上面消息可以看到不能获得文件锁,虽然weblogic进程已经在运行,但是不能执行任何操作。这里出现这个问题的原因是因为需要修改jdbc连接,因为数据库服务器的IP地址修改了,而原来在weblogic中配置的jdbc并没有修改,那么在启动weblogic时,就会一直尝试连接,在这时weblogic服务是没有成功启动的,也登录不了控制台,但尝试连接达到weblogic缺的次数后,就会放弃尝试连接而执行后续的启动操作,但这需要等待一定的时间,而这时业务人员说不能登录系统,一位同事如是登录weblogic控制不能登录,如是再次执行了一次启动脚本,所以出现了不能获得文件锁的问题。现在的处理方法是kill掉这两个weblogic进程,并删除被锁定的AdminServer.lok文件,再次执行weblogic启动脚本就能正常启动。

删除被锁定的AdminServer.lok文件

[root@cdydtest /]#cd /usr/bea/user_projects/domains/base_domain/servers/AdminServer/tmp

[root@cdydtest tmp]# ls
AdminServer.lok  WebServiceUtils.ser  _WL_internal  _WL_user
[root@cdydtest tmp]# rm AdminServer.lok
rm: remove regular empty file `AdminServer.lok'? y

手动执行启动脚本

[root@cdydtest bin]# ./startWebLogic.sh
.
.
JAVA Memory arguments: -Xms300m -Xmx300m -XX:CompileThreshold=8000 -XX:PermSize=256m  -XX:MaxPermSize=256m
.
WLS Start Mode=Development
.
CLASSPATH=/usr/bea/patch_wls1036/profiles/default/sys_manifest_classpath/weblogic_patch.jar:/usr/bea/jdk1.6.0_20/lib/tools.jar:/usr/bea/wlserver_10.3/server/lib/weblogic_sp.jar:/usr/bea/wlserver_10.3/server/lib/weblogic.jar:/usr/bea/modules/features/weblogic.server.modules_10.3.6.0.jar:/usr/bea/wlserver_10.3/server/lib/webservices.jar:/usr/bea/modules/org.apache.ant_1.7.1/lib/ant-all.jar:/usr/bea/modules/net.sf.antcontrib_1.1.0.0_1-0b2/lib/ant-contrib.jar:/usr/bea/wlserver_10.3/common/derby/lib/derbyclient.jar:/usr/bea/wlserver_10.3/server/lib/xqrl.jar
.
PATH=/usr/bea/wlserver_10.3/server/bin:/usr/bea/modules/org.apache.ant_1.7.1/bin:/usr/bea/jdk1.6.0_20/jre/bin:/usr/bea/jdk1.6.0_20/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/root/bin
.
***************************************************
*  To start WebLogic Server, use a username and   *
*  password assigned to an admin-level user.  For *
*  server administration, use the WebLogic Server *
*  console at http://hostname:port/console        *
***************************************************
starting weblogic with Java version:
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
Starting WLS with line:
/usr/bea/jdk1.6.0_20/bin/java -client   -Xms300m -Xmx300m -XX:CompileThreshold=8000 -XX:PermSize=256m  -XX:MaxPermSize=256m -Dweblogic.Name=AdminServer -Djava.security.policy=/usr/bea/wlserver_10.3/server/lib/weblogic.policy  -Xverify:none  -da -Dplatform.home=/usr/bea/wlserver_10.3 -Dwls.home=/usr/bea/wlserver_10.3/server -Dweblogic.home=/usr/bea/wlserver_10.3/server   -Dweblogic.management.discover=true  -Dwlw.iterativeDev= -Dwlw.testConsole= -Dwlw.logErrorsToConsole= -Dweblogic.ext.dirs=/usr/bea/patch_wls1036/profiles/default/sysext_manifest_classpath  weblogic.Server

查看日志信息

[root@cdydtest base_domain]# cat 47ggzj.log
2015-11-19 15:23:46 Initializing Insur_CHANGDE from init-parameters:
PowerSI Version v0.2.9(Build20140901)
starttime:2015-11-19 15:23:46
trace_busiconn:1
trace_dbconn:0
hostname:cdydtest
jdbclogger:/usr/bea/user_projects/domains/base_domain/applications/Insur_CHANGDE/WEB-INF/jdbclogger.properties
homedir:/usr/bea/user_projects/domains/base_domain/applications/Insur_CHANGDE
companyname:????
jdbclogger.maxBatchCount:10
jdbclogger.minRuntime:1000
jdbclogger.needCaller:true
service_centerid:
applicationname:Insur_CHANGDE
scheduler_flag:0
logger:/usr/bea/user_projects/domains/base_domain/applications/Insur_CHANGDE/WEB-INF/log4j.properties
log_level:0
instancename:cdydtest.Insur_CHANGDE

<nov 19, 2015 3:23:46 PM CST> <notice> <loggingservice> <bea -320400> <the log file /usr/bea/user_projects/domains/base_domain/servers/AdminServer/logs/base_domain.log will be rotated. Reopen the log file if tailing has stopped. This can happen on some platforms like Windows.>
<nov 19, 2015 3:23:46 PM CST> <notice> <loggingservice> <bea -320401> <the log file has been rotated to /usr/bea/user_projects/domains/base_domain/servers/AdminServer/logs/base_domain.log00027. Log messages will continue to be logged in /usr/bea/user_projects/domains/base_domain/servers/AdminServer/logs/base_domain.log.>
<nov 19, 2015 3:23:46 PM CST> <notice> <log Management> <bea -170027> <the Server has established connection with the Domain level Diagnostic Service successfully.>
<nov 19, 2015 3:23:46 PM CST> <notice> <weblogicserver> <bea -000365> <server state changed to ADMIN>
<nov 19, 2015 3:23:46 PM CST> <notice> <weblogicserver> <bea -000365> <server state changed to RESUMING>
<nov 19, 2015 3:23:46 PM CST> <notice> <server> <bea -002613> <channel "Default" is now listening on 10.138.130.251:7001 for protocols iiop, t3, ldap, snmp, http.>
<nov 19, 2015 3:23:46 PM CST> <notice> <server> <bea -002613> <channel "Default[1]" is now listening on 127.0.0.1:7001 for protocols iiop, t3, ldap, snmp, http.>
<nov 19, 2015 3:23:46 PM CST> <notice> <weblogicserver> <bea -000331> <started WebLogic Admin Server "AdminServer" for domain "base_domain" running in Development Mode>
<nov 19, 2015 3:23:46 PM CST> <notice> <weblogicserver> <bea -000365> <server state changed to RUNNING>
<nov 19, 2015 3:23:46 PM CST> <notice> <weblogicserver> <bea -000360> <server started in RUNNING mode>

从上面信息可知weblogic成功启动。从这个故障的原因来看就是操作人员在处理问题时不够细心,在不能登录weblogic控制台时并没有检查当前已经启动weblogic服务,而就执行了启动脚本才产生的问题。处理问题时一定要搞清状况,了解必要的信息,弄清原因才能操作。

AIX JFS2 Filesystem Concurrent Mount Protection 0506-365 Failure

朋友帮助客户通过复制来创建数据库副本来做测试,操作系统AIX,存储是V7000,使用了存储子系统的FlashCopy来进行复制,FlashCopy是IBM ESS存储服务器所支持的功能之一,主要用于本地的备份和恢复。FlashCopy在某一时间点t0建立源LUN和目标LUN之间的对应关系,随后源LUN数据块(512字节)的更新会将源LUN数据块更新前的原始数据拷贝到目标LUN中。FlashCopy可以保存系统在t-时间的数据映像,如果在T0时间系统中的数据是完整和一致的,那么在目标LUN中的数据就可以用于系统的备份和恢复。但在加载目标LUN对应文件系统时出现了故障,故障信息如下:

#mount /oracle/EP1/origlogB
mount: /dev/origlogBlv on /oracle/EP1/origlogB
0506-365 Cannot mount guarded filesystem.
The filesystem is potentially mounted on another node.

虽然AIX PowerHA可以并发访问多个系统中的卷组,但在多个节点同时mount JFS2文件系统将会造成文件系统损坏。当系统检测到文件系统中的数据或元数据与内存中的文件系统状态冲突时,这些同时mount事件也可能会造成系统崩溃。唯一的例外就是mount只读文件系统,文件或目录不会被改变。

在AIX 7100-01 and 6100-07引入了一个叫作”Mount Guard”的特性用来阻止同进或并发mount相同文件系统。如果一个文件系统已经被mount到另一个节点,那么这个功能就会被启用。AIX将会阻止这个文件系统被mount到其它节点。Mount Guard缺省情况下是没有启用的,但可以通过系统管理员进行配置。但不允许对基本操作系统的文件系统,比如/,/usr,/var等进行设置。

启用Mount Guard
为了对一个文件系统永久启用Mount Guard可以编辑/usr/sbin/chfs:

# chfs -a mountguard=yes /mountpoint

/mountpoint现在就处于保护状态并阻止并发mount。这个选项也可以在创建文件系统是给crfs使用。

# chfs -a mountguard=no /mountpoint

/mountpoint将不再受保护也不会阻止并发mount。

为了判断一个文件系统的mount guard状态,执行以下命令:

# lsfs -q /mountpoint
Name            Nodename   Mount Pt               VFS   Size    Options    Auto Accounting
/dev/fslv34     --         /mountpoint            jfs2  4194304 rw         no   no
  (lv size: 4194304, fs size: 4194304, block size: 4096, sparse files: yes, inline log: no, inline log size: 0, EAformat: v1, Quota: no, DMAPI: no, VIX: yes, EFS: no, ISNAPSHOT: no, MAXEXT: 0, MountGuard: yes)

执行/usr/sbin/mount命令将不会显示mount guard状态。

文件系统的mount与mount guard
当一个受保护的文件系统被并发mount时,第二个mount操作将会出现以下错误信息:

# mount /mountpoint
mount: /dev/fslv34 on /mountpoint:
Cannot mount guarded filesystem.
The filesystem is potentially mounted on another node

在系统崩溃后文件系统可能仍然保留了mount启用标识并且拒绝被mount。在这种情况下可以通过有
“noguard”选项的mount命令来临时覆盖文件系统的guard状态。

# mount -o noguard /mountpoint
mount: /dev/fslv34 on /mountpoint:
Mount guard override for filesystem.
The filesystem is potentially mounted on another node.

这里因为使用flashcopy技术来备份数据是通过复制LUN来完成的,也就复制了文件系统,而原来的文
件系统启用了mount guard,所以在mount目标文件系统现在有两种方法:
1.禁用目标文件系统的mount guard特性
2.使用mount -o noguard来临时覆盖mount guard特性

Oracle 11R2 Grid Infrastructure执行root.sh脚本rootcrs.pl execution failed的处理

Oracle 11.2.0.4在Redhat Linux 6.1上执行/u01/app/product/11.2.0/crs/root.sh脚本时报以下错误信息:

/u01/app/product/11.2.0/crs/bin/srvctl start nodeapps -n beiku1 ... failed
FirstNode configuration failed at /u01/app/product/11.2.0/crs/crs/install/crsconfig_lib.pm line 9379.
/u01/app/product/11.2.0/crs/perl/bin/perl -I/u01/app/product/11.2.0/crs/perl/lib -I/u01/app/product/11.2.0/crs/crs/install /u01/app/product/11.2.0/crs/crs/install/rootcrs.pl execution failed

从上面的错误信息可以看到在执行srvctl start nodeapps -n bieku1时失败,尝试手动执行这个命令

[grid@beiku1 bin]$ ./srvctl start nodeapps -n beiku1
PRCR-1013 : Failed to start resource ora.ons
PRCR-1064 : Failed to start resource ora.ons on node beiku1
CRS-5016: Process "/u01/app/product/11.2.0/crs/opmn/bin/onsctli" spawned by agent "/u01/app/product/11.2.0/crs/bin/oraagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/u01/app/product/11.2.0/crs/log/beiku1/agent/crsd/oraagent_grid/oraagent_grid.log"
CRS-2674: Start of 'ora.ons' on 'beiku1' failed

错误信息是Start of ‘ora.ons’ on ‘beiku1’ failed,那么来检查$ORACLE_HOME/cfgtoollogs/crsconfig/rootcrs_$HOSTNAME.log日志文件

[grid@beiku1 crs]$ cd $ORACLE_HOME/cfgtoollogs/crsconfig/
[grid@beiku1 crsconfig]$ ls -lrt
total 332
-rwxrwxr-x 1 grid oinstall  81336 Aug 26 15:36 srvmcfg0.log
-rwxrwxr-x 1 grid oinstall  18719 Aug 26 15:36 srvmcfg1.log
-rwxrwxr-x 1 grid oinstall  23213 Aug 26 15:36 srvmcfg2.log
-rwxrwxr-x 1 grid oinstall  24700 Aug 26 15:36 srvmcfg3.log
-rwxrwxr-x 1 grid oinstall  10705 Aug 26 15:36 srvmcfg4.log
-rwxrwxr-x 1 grid oinstall  25594 Aug 26 15:37 srvmcfg5.log
-rwxrwxr-x 1 grid oinstall 132771 Aug 26 15:37 rootcrs_beiku1.log
[grid@beiku1 crsconfig]$ cat rootcrs_beiku1.log
2015-08-26 15:36:52: J2EE (OC4J) Container Resource Add Wallet ... passed ...
2015-08-26 15:36:52: Running as user grid: /u01/app/product/11.2.0/crs/bin/qosctl -autogenerate
2015-08-26 15:36:52: s_run_as_user2: Running /bin/su grid -c ' /u01/app/product/11.2.0/crs/bin/qosctl -autogenerate '
2015-08-26 15:36:54: Removing file /tmp/fileoriV8Q
2015-08-26 15:36:54: Successfully removed file: /tmp/fileoriV8Q
2015-08-26 15:36:54: /bin/su successfully executed

2015-08-26 15:36:54: qosctl output: User qosadmin added successfully.

User oc4jadmin added successfully.

2015-08-26 15:36:54: Running as user grid: /u01/app/product/11.2.0/crs/bin/crsctl query wallet -type APPQOSADMIN -user oc4jadmin
2015-08-26 15:36:54: s_run_as_user2: Running /bin/su grid -c ' /u01/app/product/11.2.0/crs/bin/crsctl query wallet -type APPQOSADMIN -user oc4jadmin '
2015-08-26 15:36:55: Removing file /tmp/fileHsIIY7
2015-08-26 15:36:55: Successfully removed file: /tmp/fileHsIIY7
2015-08-26 15:36:55: /bin/su successfully executed

2015-08-26 15:36:55: Running as user grid: /u01/app/product/11.2.0/crs/bin/crsctl query wallet -type APPQOSADMIN -user qosadmin
2015-08-26 15:36:55: s_run_as_user2: Running /bin/su grid -c ' /u01/app/product/11.2.0/crs/bin/crsctl query wallet -type APPQOSADMIN -user qosadmin '
2015-08-26 15:36:55: Removing file /tmp/fileQXtLZo
2015-08-26 15:36:55: Successfully removed file: /tmp/fileQXtLZo
2015-08-26 15:36:55: /bin/su successfully executed

2015-08-26 15:36:55: Invoking "/u01/app/product/11.2.0/crs/bin/srvctl add cvu"
2015-08-26 15:36:55: trace file=/u01/app/product/11.2.0/crs/cfgtoollogs/crsconfig/srvmcfg5.log
2015-08-26 15:36:55: Running as user grid: /u01/app/product/11.2.0/crs/bin/srvctl add cvu
2015-08-26 15:36:55:   Invoking "/u01/app/product/11.2.0/crs/bin/srvctl add cvu" as user "grid"
2015-08-26 15:36:55: Executing /bin/su grid -c "/u01/app/product/11.2.0/crs/bin/srvctl add cvu"
2015-08-26 15:36:55: Executing cmd: /bin/su grid -c "/u01/app/product/11.2.0/crs/bin/srvctl add cvu"
2015-08-26 15:36:57: add cvu ... success
2015-08-26 15:36:57: starting nodeapps...
2015-08-26 15:36:57: DHCP_flag=0
2015-08-26 15:36:57: nodes_to_start=beiku1
2015-08-26 15:37:18: exit value of start nodeapps/vip is 1
2015-08-26 15:37:18: output for start nodeapps is  PRCR-1013 : Failed to start resource ora.ons PRCR-1064 : Failed to start resource ora.ons on node beiku1 CRS-5016: Process "/u01/app/product/11.2.0/crs/opmn/bin/onsctli" spawned by agent "/u01/app/product/11.2.0/crs/bin/oraagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/u01/app/product/11.2.0/crs/log/beiku1/agent/crsd/oraagent_grid/oraagent_grid.log" CRS-2674: Start of 'ora.ons' on 'beiku1' failed
2015-08-26 15:37:18: output of startnodeapp after removing already started mesgs is PRCR-1013 : Failed to start resource ora.ons PRCR-1064 : Failed to start resource ora.ons on node beiku1 CRS-5016: Process "/u01/app/product/11.2.0/crs/opmn/bin/onsctli" spawned by agent "/u01/app/product/11.2.0/crs/bin/oraagent.bin" for action "start" failed: details at "(:CLSN00010:)" in "/u01/app/product/11.2.0/crs/log/beiku1/agent/crsd/oraagent_grid/oraagent_grid.log" CRS-2674: Start of 'ora.ons' on 'beiku1' failed
2015-08-26 15:37:18: /u01/app/product/11.2.0/crs/bin/srvctl start nodeapps -n beiku1 ... failed

检查I $GRID_HOME/opmn/logs/ons.log.*文件,看是否有以下错误:
1.

[grid@beiku1 oraagent_grid]$ cd $ORACLE_HOME/opmn/logs/
[grid@beiku1 logs]$ ls -lrt
total 8
-rw-r--r-- 1 grid oinstall 576 Aug 26 15:48 ons.log.beiku1
-rw-r--r-- 1 grid oinstall 267 Aug 26 15:48 ons.out
[grid@beiku1 logs]$ cat ons.log.beiku1
[2015-08-26T15:37:02+08:00] [internal] getaddrinfo(::0, 6200, 1) failed (Hostname and service name not provided or found): Connection timed out

如果存在上面的错误信息,那么原因就是/etc/hosts文件中localhost对应的IP地址不是127.0.0.1。解决方法如就是确保DNS和/etc/hosts文件正确设置了localhost,DNS或/etc/hosts文件依赖于(/etc/nsswitch.conf, or /etc/netsvc.conf depend on platform),这些配置文件中的命名解决方案的设置,可以参考MOS中的ID 942166.1 or ID 969254.1文档来进行处理。

2.

[grid@beiku1 oraagent_grid]$ cd $ORACLE_HOME/opmn/logs/
[grid@beiku1 logs]$ ls -lrt
total 8
-rw-r--r-- 1 grid oinstall 576 Aug 26 15:48 ons.log.beiku1
-rw-r--r-- 1 grid oinstall 267 Aug 26 15:48 ons.out
[grid@beiku1 logs]$ cat ons.log.beiku1
[2015-08-26T15:37:02+08:00] [ons] [NOTIFICATION:1] [104] [ons-internal] ONS server initiated
[2015-08-26T15:37:02+08:00] [ons] [ERROR:1] [17] [ons-listener] any: BIND (Address already in use)
[2015-08-26T15:39:42+08:00] [ons] [NOTIFICATION:1] [104] [ons-internal] ONS server initiated
[2015-08-26T15:39:42+08:00] [ons] [ERROR:1] [17] [ons-listener] any: BIND (Address already in use)
[2015-08-26T15:48:40+08:00] [ons] [NOTIFICATION:1] [104] [ons-internal] ONS server initiated
[2015-08-26T15:48:40+08:00] [ons] [ERROR:1] [17] [ons-listener] any: BIND (Address already in use)

原因是有其它的进程占用的ONS服务的端口

[grid@beiku1 logs]$ grep port $ORACLE_HOME/opmn/conf/ons.config
localport=6100          # line added by Agent
remoteport=6200         # line added by Agent

[root@beiku1 /]# lsof | grep 6200 | grep LISTEN
ons       16413      grid    6u     IPv6     162533                  TCP *:6200 (LISTEN)

可以看到进程ID16413的ons进程占用了6200端口,解决方法是确保这个端口不被其它进行所占用,如果是在执行 rootupgrade.sh脚本进行升级之前被占用,那么可能的原因是旧版本的ons进程还在运行。

3.

[grid@beiku1 oraagent_grid]$ cd $ORACLE_HOME/opmn/logs/
[grid@beiku1 logs]$ ls -lrt
total 8
-rw-r--r-- 1 grid oinstall 576 Aug 26 15:48 ons.log.beiku1
-rw-r--r-- 1 grid oinstall 267 Aug 26 15:48 ons.out
[grid@beiku1 logs]$ cat ons.log.beiku1
[2015-08-26T15:48:40+08:00] [ons] [NOTIFICATION:1] [104] [ons-internal] ONS server initiated
[2015-08-26T15:48:40+08:00] [ons] [ERROR:1] [17] [ons-listener] 0000:0000:0000:0000:0000:0000:0000:0001,6100: BIND (Cannot assign requested address)

这种情况可能是IPV6被部分配置了,11gR2 Grid Infrastructure不支持IPv6。解决方法就是在$GRID_HOME/opmn/conf/ons.config and ons.config.文件中设置下面的参数:
interface=ipv4

这里出现的错误是第2种,进程ID16413的ons进程占用了6200端口,解决方法是确保这个端口不被其它进行所占用

[root@beiku1 /]# lsof | grep 6200 | grep LISTEN
ons       16413      grid    6u     IPv6     162533                  TCP *:6200 (LISTEN)
[root@beiku1 /]# kill -9 16413

再重新执行root.sh脚本

[root@beiku1 /]# ./u01/app/product/11.2.0/crs/root.sh
Performing root user operation for Oracle 11g

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /u01/app/product/11.2.0/crs

Enter the full pathname of the local bin directory: [/usr/local/bin]:
The contents of "dbhome" have not changed. No need to overwrite.
The contents of "oraenv" have not changed. No need to overwrite.
The contents of "coraenv" have not changed. No need to overwrite.

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Using configuration parameter file: /u01/app/product/11.2.0/crs/crs/install/crsconfig_params
User ignored Prerequisites during installation
Installing Trace File Analyzer
PRKO-2190 : VIP exists for node beiku1, VIP name beiku1-vip
Preparing packages for installation...
cvuqdisk-1.0.9-1
Configure Oracle Grid Infrastructure for a Cluster ... succeeded

在kill掉占用6200端口的进程之后,root.sh脚本可以成功执行。

Redhat linux DNS配置指南

在oracle 11g的RAC中增加了SCAN IP,而使用 SCAN IP的一种方式就是使用DNS,这里介绍在Redhat Linux 5.4中DNS的详细配置操作
在配置DNS之前修改主机名
Redhat linux 5.4 DNS配置操作
在配置DNS之前修改主机名

[root@beiku1 etc]# hostname beiku1.sbyy.com
[root@beiku1 etc]# vi /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1               beiku1.sbyy.com localhost
::1             localhost6.localdomain6 localhost6
10.138.130.161 beiku1

[root@beiku1 etc]# vi /etc/sysconfig/network
NETWORKING=yes
NETWORKING_IPV6=no
HOSTNAME=beiku1.sbyy.com
GATEWAY=10.138.130.254

一.安装软件包
Redhat linux 5.4 下的dns服务所有的bind包如下:

bind-9.3.6-4.P1.el5 
bind-libbind-devel-9.3.6-4.P1.el5 
kdebindings-devel-3.5.4-6.el5 
kdebindings-3.5.4-6.el5 
bind-devel-9.3.6-4.P1.el5 
bind-utils-9.3.6-4.P1.el5 
bind-chroot-9.3.6-4.P1.el5 
ypbind-1.19-12.el5 
system-config-bind-4.0.3-4.el5 
bind-libs-9.3.6-4.P1.el5 
bind-sdb-9.3.6-4.P1.el5 

使用rpm –qa | grep bind来检查系统是否已经安装了以上软件包:

[root@beiku1 soft]# rpm -qa | grep bind
bind-chroot-9.3.6-4.P1.el5
kdebindings-3.5.4-6.el5
ypbind-1.19-12.el5
bind-libs-9.3.6-4.P1.el5
bind-9.3.6-4.P1.el5
system-config-bind-4.0.3-4.el5
bind-utils-9.3.6-4.P1.el5

对于没有安装的软件包执行以下命令进行安装

[root@beiku1 soft]# rpm -ivh bind-9.3.6-4.P1.el5.i386.rpm
warning: bind-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
        package bind-9.3.6-4.P1.el5.i386 is already installed
[root@beiku1 soft]# rpm -ivh caching-nameserver-9.3.6-4.P1.el5.i386.rpm
warning: caching-nameserver-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:caching-nameserver     ########################################### [100%]

[root@beiku1 soft]# rpm -ivh install kdebindings-devel-3.5.4-6.el5.i386.rpm
error: open of install failed: No such file or directory
warning: kdebindings-devel-3.5.4-6.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
[root@beiku1 soft]# rpm -ivh kdebindings-devel-3.5.4-6.el5.i386.rpm
warning: kdebindings-devel-3.5.4-6.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:kdebindings-devel      ########################################### [100%]
[root@beiku1 soft]# rpm -ivh bind-sdb-9.3.6-4.P1.el5.i386.rpm
warning: bind-sdb-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:bind-sdb               ########################################### [100%]
[root@beiku1 soft]# rpm -ivh bind-libbind-devel-9.3.6-4.P1.el5.i386.rpm
warning: bind-libbind-devel-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:bind-libbind-devel     ########################################### [100%]
[root@beiku1 soft]# rpm -ivh bind-devel-9.3.6-4.P1.el5.i386.rpm
warning: bind-devel-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:bind-devel             ########################################### [100%]

还要手动安装一个软件包caching-nameserver-9.3.6-4.P1.el5 ,不安装这个软件包named服务不能启动,会报错误信息 例如:

[root@beiku1 ~]# service named start
Locating /var/named/chroot//etc/named.conf failed:
[FAILED]

[root@beiku1 soft]# rpm -ivh caching-nameserver-9.3.6-4.P1.el5.i386.rpm
warning: caching-nameserver-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:caching-nameserver     ########################################### [100%]

[root@beiku1 soft]# service named start
Starting named: [  OK  ]

二.复制模板文件
由于安装了chroot环境,所以我们的DNS主配置文件应该在/var/named/chroot/etc目录下面

[root@beiku1 soft]# cd /var/named/chroot/
[root@beiku1 chroot]# ls
dev  etc  proc  var
[root@beiku1 chroot]# cd etc
[root@beiku1 etc]# ls
localtime  named.caching-nameserver.conf  named.rfc1912.zones  rndc.key
[root@beiku1 etc]#

named.caching-nameserver.conf文件内容如下:

[root@beiku1 etc]# cat named.caching-nameserver.conf
//
// named.caching-nameserver.conf
//
// Provided by Red Hat caching-nameserver package to configure the
// ISC BIND named(8) DNS server as a caching only nameserver 
// (as a localhost DNS resolver only). 
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//
// DO NOT EDIT THIS FILE - use system-config-bind or an editor
// to create named.conf - edits to this file will be lost on 
// caching-nameserver package upgrade.
//
options {
        listen-on port 53 { 127.0.0.1; };
        listen-on-v6 port 53 { ::1; };
        directory       "/var/named";
        dump-file       "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";

        // Those options should be used carefully because they disable port
        // randomization
        // query-source    port 53;
        // query-source-v6 port 53;

        allow-query     { localhost; };
        allow-query-cache { localhost; };
};
logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};
view localhost_resolver {
        match-clients      { localhost; };
        match-destinations { localhost; };
        recursion yes;
        include "/etc/named.rfc1912.zones";
};

这个文件告诉我们不要直接的编辑这个文件,去创建一个named.conf文件,然后编辑named.conf文件,当有了named.conf,将不在读取这个文件。现在就将named.caching-nameserver.conf文件复制成named.conf文件。

[root@beiku1 etc]# cp -p named.caching-nameserver.conf named.conf
[root@beiku1 etc]# ls
localtime  named.caching-nameserver.conf  named.conf  named.rfc1912.zones  rndc.key

可以看到,named.conf文件就被创建成功了。最好在copy的时候加上-P的参数,保留权限。否则启动服务的时候会报权限拒绝的。

三.编辑named.conf文件

[root@beiku1 etc]# vi named.conf
//
// named.caching-nameserver.conf
//
// Provided by Red Hat caching-nameserver package to configure the
// ISC BIND named(8) DNS server as a caching only nameserver
// (as a localhost DNS resolver only).
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//
// DO NOT EDIT THIS FILE - use system-config-bind or an editor
// to create named.conf - edits to this file will be lost on
// caching-nameserver package upgrade.
//
options {
        listen-on port 53 { any; };
        listen-on-v6 port 53 { ::1; };
        directory       "/var/named";
        dump-file       "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";

        // Those options should be used carefully because they disable port
        // randomization
        // query-source    port 53;
        // query-source-v6 port 53;

        allow-query     { 10.138.130.0/24; };
        allow-query-cache { any; };
};
logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};
view localhost_resolver {
        match-clients      { 10.138.130.0/24; };
        match-destinations { any; };
        recursion yes;
        include "/etc/named.rfc1912.zones";
};

解释这些语法参数的意思
options
代表全局配置
listen-on port 53 { any; };
DNS服务监听在所有接口
listen-on-v6 port 53 { ::1; };
ipv6监听在本地回环接口
directory “/var/named”;
zone文件的存放目录,指的是chroot环境下面的/var/named
dump-file “/var/named/data/cache_dump.db”;
存放缓存的信息
statistics-file “/var/named/data/named_stats.txt”;
统计用户的访问状态
memstatistics-file “/var/named/data/named_mem_stats.txt”;
每一次访问耗费了多数内存的存放文件
allow-query { 10.138.130.0/24 };
允许查询的客户端,现在修改成本地网段,
allow-query-cache {any; };
允许那些客户端来查询缓存,any表示允许任何人。
logging {
channel default_debug {
file “data/named.run”;
severity dynamic;
};
定义日志的存放位置在/var/named/chroot/var/named/data/目录下面
};
view localhost_resolver {
match-clients { 10.138.130.0/24; };
match-destinations { any; };
recursion yes;
include “/etc/named.rfc1912.zones”;
};

这里是定义视图的功能,
Match-clients 是指匹配的客户端
Match-destination 是指匹配的目标
到这里,named.conf文件就已经配置成功了,这个视图最后写include “/etc/named.rfc1912.zones”;接下面,就去配置这个文件。当然,我们可以匹配不同的客户端来创建不同的视图。

四.定义zone文件

[root@beiku1 etc]# vi  named.rfc1912.zones
// named.rfc1912.zones:
//
// Provided by Red Hat caching-nameserver package 
//
// ISC BIND named zone configuration for zones recommended by
// RFC 1912 section 4.1 : localhost TLDs and address zones
// 
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//
zone "." IN {
        type hint;
        file "named.ca";
};

zone "sbyy.com" IN {
        type master;
        file "sbyy.zone";
        allow-update { none; };
};

zone "130.138.10.in-addr.arpa" IN {
        type master;
        file "named.sbyy";
        allow-update { none; };
};

解释这些语法参数的意思
Zone “.” 根区域
Zone “sbyy.com” 定义正向解析的区域
zone “130.138.10.in-addr.arpa” 定义反向解析的区域
IN Internet记录
type hint 根区域的类型为hint
type master 区域的类型为主要的
file “named.ca” ; 区域文件是named,ca
file “sbyy.zone”; 指定正向解析的区域文件是sbyy.zone
file “named.sbyy”; 指定反向解析的区域文件是named,sbyy
allow-update { none; }; 默认情况下,是否允许客户端自动更新
在named.ca文件中就定义了全球的13台根服务器,
在sbyy.com文件中就定义DNS的正向解析数据库
在named.sbyy文件中就定义DNS反向解析的数据库
定义zone文件就完成了,下面来编辑DNS的数据库文件。

五.使用模板文件来创建数据库文件

[root@beiku1 etc]# cd /var/named/chroot/var/named/
[root@beiku1 named]# ls
data  localdomain.zone  localhost.zone  named.broadcast  named.ca  named.ip6.local  named.local  named.zero  slaves

可以看到,在chroot环境下面的/var/named/有很多模板文件。Named.ca就是根区域的数据库文件,我们将localhost.zone复制成sbyy.zone,这个是正向解析的数据库文件,将named.local复制成named.sbyy,这个是反向解析的数据库文件。数据库文件一定要和/etc/named.rfc1912.zones这个文件里面的匹配。

[root@beiku1 named]# cp -p localhost.zone sbyy.zone
[root@beiku1 named]# cp -p named.local named.sbyy
[root@beiku1 named]# ls 
data              named.broadcast  named.local  sbyy.zone
localdomain.zone  named.ca         named.sbyy   slaves
localhost.zone    named.ip6.local  named.zero

复制成功,正向解析和反向解析的数据库文件就创建完成了。

六.定义数据库文件
1. 定义正向解析数据库文件

[root@beiku1 named]# vi sbyy.zone
$TTL    86400
@               IN SOA  beiku1.sbyy.com.       root.sbyy.com. (
                                        44              ; serial (d. adams)
                                        3H              ; refresh
                                        15M              ; retry
                                        1W              ; expiry
                                        1D )            ; minimum

@              IN NS           beiku1.sbyy.com.


beikuscan      IN A            10.138.130.167
beikuscan      IN A            10.138.130.168
beikuscan      IN A            10.138.130.169
beiku2         IN A            10.138.130.162
beiku1         IN A            10.138.130.161

关于正向解析数据库中每一行参数的解释
$TTL 86400
最小的存活的时间是86400S(24H)

@ IN SOA @ root (
这是一笔SOA记录,只允许存在一个SOA记录
@是代表要解析的这个域本身()
IN是Internet记录。
SOA 是初始授权记录,指定网络中第一台DNS Server。
root是指管理员的邮箱。

44 ; serial (d. adams)
3H ; refresh
15M ; retry
1W ; expiry
1D ) ; minimum

这些部分主要是用来主DNS和辅助DNS做同步用的
44 序列号,当主DNS数据改变时,这个序列号就要被增加1,而辅助DNS通过序列号来和主DNS同步。
3H 刷新,主DNS和辅助DNS每隔三小时同步一次。
15M 重试,3H之内,没有同步,每隔15M在尝试同步
1W 过期,1W之内,还没有同步,就不同步了
1D 生存期,没有这条记录,缓存的时间。
@ IN NS beiku1.sbyy.com.

这是一笔NS记录,指定nameserver为beiku1.sbyy.com至少要有一笔NS记录

beiku1 IN A 10.138.130.161
指定beiku1的ip地址为10.138.130.161

beikuscan IN A 10.138.130.167
指定beikuscan的ip地址为10.138.130.167

beikuscan IN A 10.138.130.168
指定beikuscan的ip地址为10.138.130.168

beikuscan IN A 10.138.130.169
指定beikuscan的ip地址为10.138.130.169
beiku2 IN A 10.138.130.162
指定beiku2的ip地址为10.138.130.162

正向解析的数据库就完成了,下面定义反向解析的数据库。

2. 定义反向解析数据库

[root@beiku1 named]# vi named.sbyy
$TTL    86400
@       IN      SOA     beiku1.sbyy.com. root.sbyy.com.  (
                                      1997022702 ; Serial
                                      120      ; Refresh
                                      120      ; Retry
                                      3600000    ; Expire
                                      86400 )    ; Minimum
@        IN      NS     beiku1.sbyy.com.

167     IN      PTR     beikuscan.sbyy.com.
168     IN      PTR     beikuscan.sbyy.com.
169     IN      PTR     beikuscan.sbyy.com.
162     IN      PTR     beiku2.sbyy.com. 
161     IN      PTR     beiku1.sbyy.com.

其实反向解析的数据库文件的配置和正向解析的差不多,只需要将ip地址和域名换一个位置就可以了,把A换成PTR就ok了。
DNS的基本配置就完成了,在来看看DNS是否能够正常工作。
我们先重启一下DNS服务

[root@beiku1 etc]# service named restart
Stopping named: [  OK  ]
Starting named: [  OK  ]

可以看到,DNS服务启动成功了。
在查询以前,要在客户端来指定DNS Server,在/etc/resolv.conf这个文件中指定。

[root@beiku1 etc]# vi /etc/resolv.conf
search sbyy.com
nameserver       10.138.130.161


[root@beiku1 etc]# service named restart
Stopping named: [  OK  ]
Starting named: [  OK  ]

参数及意义:
nameserver 表明dns 服务器的ip 地址,可以有很多行的nameserver,每一个带一个ip地址。
在查询时就按nameserver 在本文件中的顺序进行,且只有当第一个nameserver 没有反应时才查询下面的nameserver.
domain 声明主机的域名。很多程序用到它,如邮件系统;当为没有域名的主机进行dns 查询时,也要用到。如果没有域名,主机名将被使,用删除所有在第一个点( . )前面的内容。
search 它的多个参数指明域名查询顺序。当要查询没有域名的主机,主机将在由search 声明的域中分别查找。
domain 和search 不能共存;如果同时存在,后面出现的将会被使用。
sortlist 允许将得到域名结果进行特定的排序。它的参数为网络/掩码对,允许任意的排列顺序。

再来使用nslookup工具来查询一下

[root@beiku1 named]# nslookup beiku1.sbyy.com
Server:         10.138.130.161
Address:        10.138.130.161#53

Name:   beiku1.sbyy.com
Address: 10.138.130.161

[root@beiku1 named]# nslookup beiku2.sbyy.com
Server:         10.138.130.161
Address:        10.138.130.161#53

Name:   beiku2.sbyy.com
Address: 10.138.130.162

[root@beiku1 named]# nslookup beikuscan.sbyy.com
Server:         10.138.130.161
Address:        10.138.130.161#53

Name:   beikuscan.sbyy.com
Address: 10.138.130.169
Name:   beikuscan.sbyy.com
Address: 10.138.130.167
Name:   beikuscan.sbyy.com
Address: 10.138.130.168

[root@beiku1 named]# nslookup beiku1
Server:         10.138.130.161
Address:        10.138.130.161#53

Name:   beiku1.sbyy.com
Address: 10.138.130.161

[root@beiku1 named]# nslookup beiku2
Server:         10.138.130.161
Address:        10.138.130.161#53

Name:   beiku2.sbyy.com
Address: 10.138.130.162

[root@beiku1 named]# nslookup beikuscan
Server:         10.138.130.161
Address:        10.138.130.161#53

Name:   beikuscan.sbyy.com
Address: 10.138.130.168
Name:   beikuscan.sbyy.com
Address: 10.138.130.169
Name:   beikuscan.sbyy.com
Address: 10.138.130.167

[root@beiku1 named]# nslookup 10.138.130.161
Server:         10.138.130.161
Address:        10.138.130.161#53

161.130.138.10.in-addr.arpa     name = beiku1.sbyy.com.

[root@beiku1 named]# nslookup 10.138.130.162
Server:         10.138.130.161
Address:        10.138.130.161#53

162.130.138.10.in-addr.arpa     name = beiku2.sbyy.com.

[root@beiku1 named]# nslookup 10.138.130.167
Server:         10.138.130.161
Address:        10.138.130.161#53

167.130.138.10.in-addr.arpa     name = beikuscan.sbyy.com.

[root@beiku1 named]# nslookup 10.138.130.168
Server:         10.138.130.161
Address:        10.138.130.161#53

168.130.138.10.in-addr.arpa     name = beikuscan.sbyy.com.

[root@beiku1 named]# nslookup 10.138.130.169
Server:         10.138.130.161
Address:        10.138.130.161#53

169.130.138.10.in-addr.arpa     name = beikuscan.sbyy.com.

可以看到,DNS解析一切正常,上面只是配置了主DNS服务器,而且主DNS服务器也工作正常,现在我们来配置一个辅助DNS服务器

配置辅助DNS服务器
主DNS的东西和辅助DNS东西其实是相同的
一.安装软件包

 [root@beiku2 soft]# rpm -qa | grep bind
bind-chroot-9.3.6-4.P1.el5
kdebindings-3.5.4-6.el5
system-config-bind-4.0.3-4.el5
ypbind-1.19-12.el5
bind-libs-9.3.6-4.P1.el5
bind-9.3.6-4.P1.el5
bind-utils-9.3.6-4.P1.el5
[root@beiku2 soft]# rpm -ivh kdebindings-devel-3.5.4-6.el5.i386.rpm
warning: kdebindings-devel-3.5.4-6.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:kdebindings-devel      ########################################### [100%]
[root@beiku2 soft]# rpm -ivh caching-nameserver-9.3.6-4.P1.el5.i386.rpm
warning: caching-nameserver-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:caching-nameserver     ########################################### [100%]
[root@beiku2 soft]# rpm -ivh bind-sdb-9.3.6-4.P1.el5.i386.rpm
warning: bind-sdb-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:bind-sdb               ########################################### [100%]
[root@beiku2 soft]# rpm -ivh bind-libbind-devel-9.3.6-4.P1.el5.i386.rpm
warning: bind-libbind-devel-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:bind-libbind-devel     ########################################### [100%]
[root@beiku2 soft]# rpm -ivh bind-devel-9.3.6-4.P1.el5.i386.rpm
warning: bind-devel-9.3.6-4.P1.el5.i386.rpm: Header V3 DSA signature: NOKEY, key ID 37017186
Preparing...                ########################################### [100%]
   1:bind-devel             ########################################### [100%]

二.复制模板文件

[root@beiku2 /]# cd /var/named/chroot/etc
[root@beiku2 etc]# ls -lrt
total 24
-rw-r--r-- 1 root root  3519 Feb 27  2006 localtime
-rw-r----- 1 root named  955 Jul 30  2009 named.rfc1912.zones
-rw-r----- 1 root named 1230 Jul 30  2009 named.caching-nameserver.conf
-rw-r----- 1 root named  113 Nov 15  2014 rndc.key

[root@beiku2 etc]# cp -p named.caching-nameserver.conf named.conf

三.编辑named.conf文件

[root@beiku2 etc]# vi named.conf
//
// named.caching-nameserver.conf
//
// Provided by Red Hat caching-nameserver package to configure the
// ISC BIND named(8) DNS server as a caching only nameserver
// (as a localhost DNS resolver only).
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//
// DO NOT EDIT THIS FILE - use system-config-bind or an editor
// to create named.conf - edits to this file will be lost on
// caching-nameserver package upgrade.
//
options {
        listen-on port 53 { any; };
        listen-on-v6 port 53 { ::1; };
        directory       "/var/named";
        dump-file       "/var/named/data/cache_dump.db";
        statistics-file "/var/named/data/named_stats.txt";
        memstatistics-file "/var/named/data/named_mem_stats.txt";

        // Those options should be used carefully because they disable port
        // randomization
        // query-source    port 53;
        // query-source-v6 port 53;

        allow-query     { 10.138.130.0/24; };
        allow-query-cache { any; };
};
logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};
view localhost_resolver {
        match-clients      { 10.138.130.0/24; };
        match-destinations { any; };
        recursion yes;
        include "/etc/named.rfc1912.zones";
};

和主DNS配置一样

四.定义zone文件

[root@beiku2 etc]# vi named.rfc1912.zones
// named.rfc1912.zones:
//
// Provided by Red Hat caching-nameserver package
//
// ISC BIND named zone configuration for zones recommended by
// RFC 1912 section 4.1 : localhost TLDs and address zones
//
// See /usr/share/doc/bind*/sample/ for example named configuration files.
//

zone "sbyy.com" IN {
        type slave;
        masters {10.138.130.161;};
        file "slaves/sbyy.com";
};

zone "0.138.10.in-addr.arpa" IN {
        type slave;
        masters {10.138.130.161;};
        file "slaves/named.sbyy";
};

辅助DNS在定义zone文件的时候和主DNS有些不同
在辅助DNS里面 type要改为slave
master { 10.138.130.161; }; 而且必须指定主DNS的IP address
file “slaves/sbyy.com”;
file “slaves/named.sbyy”;
为什么要指定数据库文件在slaves目录下面呢,是因为slaves目录是拥有人和拥有组都是named用户,在启动DNS服务的时候,只有named有权限进行操作,所以我们要把数据库放在这个目录下面。

[root@beiku2 etc]# cd /var/named/chroot/var/named/
[root@beiku2 named]# ls -lrt
total 44
drwxrwx--- 2 named named 4096 Jul 27  2004 slaves
drwxrwx--- 2 named named 4096 Aug 26  2004 data
-rw-r----- 1 root  named  427 Jul 30  2009 named.zero
-rw-r----- 1 root  named  426 Jul 30  2009 named.local
-rw-r----- 1 root  named  424 Jul 30  2009 named.ip6.local
-rw-r----- 1 root  named 1892 Jul 30  2009 named.ca
-rw-r----- 1 root  named  427 Jul 30  2009 named.broadcast
-rw-r----- 1 root  named  195 Jul 30  2009 localhost.zone
-rw-r----- 1 root  named  198 Jul 30  2009 localdomain.zone
[root@beiku2 named]# cd slaves
[root@beiku2 slaves]# ls -lrt
total 0

可以看到,slaves目录的拥有人和拥有组是named,并且现在的slaves目录下面是什么东西都没有的。
现在我们重启一下DNS服务

[root@beiku2 slaves]# service named restart
Stopping named: [  OK  ]
Starting named: [  OK  ]

可以看到,服务启动成功了。在启动服务的同时,我们来查看一下日志信息,看看日志里面有什么提示

[root@beiku2 slaves]# tail /var/log/messages
Aug 25 23:41:49 beiku2 named[30421]: the working directory is not writable
Aug 25 23:41:49 beiku2 named[30421]: running
Aug 25 23:41:49 beiku2 named[30421]: zone 0.138.10.in-addr.arpa/IN/localhost_resolver: Transfer started.
Aug 25 23:41:49 beiku2 named[30421]: transfer of '0.138.10.in-addr.arpa/IN' from 10.138.130.161#53: connected using 10.138.130.162#44647
Aug 25 23:41:49 beiku2 named[30421]: zone 0.138.10.in-addr.arpa/IN/localhost_resolver: transferred serial 1997022700
Aug 25 23:41:49 beiku2 named[30421]: transfer of '0.138.10.in-addr.arpa/IN' from 10.138.130.161#53: end of transfer
Aug 25 23:41:49 beiku2 named[30421]: zone sbyy.com/IN/localhost_resolver: Transfer started.
Aug 25 23:41:49 beiku2 named[30421]: transfer of 'sbyy.com/IN' from 10.138.130.161#53: connected using 10.138.130.162#56490
Aug 25 23:41:49 beiku2 named[30421]: zone sbyy.com/IN/localhost_resolver: transferred serial 42
Aug 25 23:41:49 beiku2 named[30421]: transfer of 'sbyy.com/IN' from 10.138.130.161#53: end of transfer

在日志里面可以看到,主DNS与辅助DNS正在同步序列号,同步成功,这个日志里面的信息非常的详细。
接下来,我们在到slaves目录下面去看看

[root@beiku2 slaves]# ls -lrt
total 8
-rw-r--r-- 1 named named 414 Aug 25 23:41 sbyy.com
-rw-r--r-- 1 named named 451 Aug 25 23:41 named.sbyy

刚才slaves目录下面的是什么东西都没有,现在就多了两个文件,example.com和named.example这个两个文件。这个就是我们刚才在定义zone文件的时候在slaves目录下面定义的,文件名是随意写的,这个没有关系,但是里面东西是和主DNS一样的。
我们查看这两个文件的具体内容

[root@beiku2 slaves]# cat sbyy.com
$ORIGIN .
$TTL 86400      ; 1 day
sbyy.com                IN SOA  sbyy.com. root.sbyy.com. (
                                42         ; serial
                                10800      ; refresh (3 hours)
                                900        ; retry (15 minutes)
                                604800     ; expire (1 week)
                                86400      ; minimum (1 day)
                                )
                        NS      sbyy.com.
                        A       127.0.0.1
                        AAAA    ::1
$ORIGIN sbyy.com.
beiku1                  A       10.138.130.161
beikuscan1              A       10.138.130.167
beikuscan2              A       10.138.130.168
beikuscan3              A       10.138.130.169
beiku2                  A       10.138.130.162

[root@beiku2 slaves]# cat named.sbyy
$ORIGIN .
$TTL 86400      ; 1 day
0.138.10.in-addr.arpa   IN SOA  localhost. root.localhost. (
                                1997022700 ; serial
                                28800      ; refresh (8 hours)
                                14400      ; retry (4 hours)
                                3600000    ; expire (5 weeks 6 days 16 hours)
                                86400      ; minimum (1 day)
                                )
                        NS      localhost.
$ORIGIN 0.138.10.in-addr.arpa.
1                       PTR     localhost.
161                     PTR     beiku1.sbyy.com
167                     PTR     beikuscan1.sbyy.com
168                     PTR     beikuscan2.sbyy.com
169                     PTR     beikuscan3.sbyy.com
162                     PTR     beiku2.sbyy.com

这两个文件里面的内容和我们的主DNS的内容都是一样的。而且还帮我们整理的非常的漂亮。这些都是系统自动生成的。
现在我们来测试一下主DNS和辅助DNS可不可以正常的工作

[root@beiku2 slaves]# vi /etc/resolv.conf
search sbyy.com
nameserver 10.138.130.161
nameserver 10.138.130.162

现在我们将主DNS和辅助DNS都设置一下。然后在使用nslookup工具来测试

[root@beiku2 slaves]# nslooup beiku1
-bash: nslooup: command not found
[root@beiku2 slaves]# nslookup beiku1
Server:         10.138.130.161
Address:        10.138.130.161#53

Name:   beiku1.sbyy.com
Address: 10.138.130.161

 [root@beiku2 slaves]# nslookup beiku2
Server:         10.138.130.161
Address:        10.138.130.161#53

Name:   beiku2.sbyy.com
Address: 10.138.130.162

现在解析没有问题,还是有10.138.130.161这台主DNS来解析的。
接下来,我们将10.138.130.161这台主DNS给down,看下10.138.130.162这台辅助DNS能否正常工作。

[root@beiku1 named]# service named stop
Stopping named: [  OK  ]

用nslookup来测试一下

[root@beiku2 slaves]# nslookup beiku1
Server:         10.138.130.162
Address:        10.138.130.162#53

Name:   beiku1.sbyy.com
Address: 10.138.130.161

现在解析照样成功了,现在并不是通过10.138.130.161这台主DNS来解析出来的,而是通过我们的10.138.130.162这台辅助DNS来解析出来的。当我们网络中的主DNSdown掉的时候,我们的辅助DNS照样能够正常的工作。我们还可以实现负载均衡,可以在网络中的一半客户端的主DNS指向10.138.130.161,辅助DNS指向10.138.130.161。将网络中的另一半客户端的主DNS指向10.138.130.162,辅助DNS指向10.138.130.161。这样两台服务器都可以正常的工作,正常的为客户端解析,当其中的一台DNSdown掉后,另一台DNS也会继续的工作,这样就实现了简单的负载均衡。到目前为止,我们的主DNS Server 和我们的辅助DNS Server都已经设置成功了,并且都可以正常的工作了。

接下来,我们在做一个试验,我们在主DNS添加一条记录,看下辅助DNS能否检测试到这条记录,不能够在辅助DNS上面添加记录,这样没有意义,我们的主DNS是检测不到这条记录的。

[root@beiku1 named]# vi sbyy.zone
$TTL    86400
@               IN SOA  @       root (
                                        43              ; serial (d. adams)
                                        2M              ; refresh
                                        2M              ; retry
                                        1W              ; expiry
                                        1D )            ; minimum

                IN NS           @
                IN A            127.0.0.1
                IN AAAA         ::1


beiku1          IN A            10.138.130.161
beikuscan      IN A            10.138.130.167
beikuscan      IN A            10.138.130.168
beikuscan      IN A            10.138.130.169
beiku2          IN A            10.138.130.162
www             IN A            10.138.130.170

增加了www IN A 10.138.130.170记录。在主DNS里面做了新的操作以后,一定要将主DNS的序列号加一。否则辅助DNS是不会来同步我们的主DNS的。我们已经将主DNS的序列号加一了,但是默认情况下,主DNS与辅助DNS的同步时间是3H,这样我们很难看到效果,我们将它改为2M,然后在将重试时间改为2M,这样就代表每隔两分钟主DNS和辅助DNS进行同步,如果同步不成功,在隔两分钟同步一次。接下来我们将反向解析里面的也来修改一下

[root@beiku1 named]# vi named.sbyy
$TTL    86400
@       IN      SOA     beiku1.sbyy.com. root.sbyy.com.  (
                                      1997022703 ; Serial
                                      120      ; Refresh
                                      120      ; Retry
                                      3600000    ; Expire
                                      86400 )    ; Minimum
@        IN      NS     beiku1.sbyy.com.

167     IN      PTR     beikuscan.sbyy.com.
168     IN      PTR     beikuscan.sbyy.com.
169     IN      PTR     beikuscan.sbyy.com.
162     IN      PTR     beiku2.sbyy.com.
161     IN      PTR     beiku1.sbyy.com.
170     IN      PTR     www.sbyy.com.

这样,反向解析里面也已经修改完成了。现在将DNS服务重启

[root@beiku1 named]# service named restart
Stopping named: [  OK  ]
Starting named: [  OK  ]

重启成功,等几分钟之后在来看下效果。现在我们查看辅助DNS的正向解析数据库文件的内容

[root@beiku2 slaves]# cat sbyy.com
$ORIGIN .
$TTL 86400      ; 1 day
sbyy.com                IN SOA  beiku1.sbyy.com. root.sbyy.com. (
                                45         ; serial
                                120        ; refresh (2 minutes)
                                120        ; retry (2 minutes)
                                604800     ; expire (1 week)
                                86400      ; minimum (1 day)
                                )
                        NS      beiku1.sbyy.com.
$ORIGIN sbyy.com.
beiku1                  A       10.138.130.161
beiku2                  A       10.138.130.162
beikuscan               A       10.138.130.167
                        A       10.138.130.168
                        A       10.138.130.169
www                     A       10.138.130.170

OK,可以看到,我们刚才在主DNS里面添加的一条新的记录现在已经被辅助DNS同步过去了,而且辅助DNS的序列号和刷新时间,重试时间都同步了。下来我们查看辅助DNS的反向解析数据库文件的内容

[root@beiku2 slaves]# cat named.sbyy
RIGIN .
$TTL 86400      ; 1 day
0.138.10.in-addr.arpa   IN SOA  localhost. root.localhost. (
                                1997022702 ; serial
                                28800      ; refresh (8 hours)
                                14400      ; retry (4 hours)
                                3600000    ; expire (5 weeks 6 days 16 hours)
                                86400      ; minimum (1 day)
                                )
                        NS      localhost.
$ORIGIN 0.138.10.in-addr.arpa.
1                       PTR     localhost.
161                     PTR     beiku1.sbyy.com
167                     PTR     beikuscan1.sbyy.com
168                     PTR     beikuscan2.sbyy.com
169                     PTR     beikuscan3.sbyy.com
162                     PTR     beiku2.sbyy.com
170                     PTR     www.sbyy.com

OK,也可以看到,辅助DNS也已经同步成功了,到此DNS的配置就完成了。

将SQL质量审计引入软件开发可以避免不必要的SQL优化工作

今天帮助兄弟部门优化五险统一征缴数据发送程序,优化其实很简单,主要是解决了原本不应该执行的全表扫描和笛卡尔积。但问题是为什么会出现全表扫描和笛卡尔积,是Oracle优化器选择错了执行计划吗,答案并不是,原因就是在设计表结构时的缺陷造成的,如果在设计表结构时能够根据业务合理设计,也就没有这次优化了。其实这个问题我在公司就提过,但不重视,现在我成了甲方,我又要当救火队员了。

下面是每个月社会保障系统向五险征缴系统发送每月所有单位各个险种的应缴数据的查询语句:

Select t.Pay_Object_Id,
       t.Pay_Object_Code,
       t.Pay_Object_Name,
       t.Insr_Detail_Code,
       t.asgn_tenet,
       t.asgn_order,
       t.use_pred_insr,
       Sum(t.Topay_Money) as topay_money,
       Sum(Pay_Money) as pay_money,
       Sum(Pred_Money) as pred_money,
       to_char(sysdate, 'yyyy-mm-dd') as pay_time,
       t.corp_type_code
  From (Select T1.Corp_Id As Pay_Object_Id,
               T1.Insr_Detail_Code,
               T1.Corp_Code As Pay_Object_Code,
               T1.Corp_Name As Pay_Object_Name,
               T1.asgn_tenet,
               T1.asgn_order,
               T1.use_pred_insr,
               Decode(Sign(T1.pay_Money),
                      -1,
                      T1.pay_Money,
                      Decode(Sign(T1.pay_Money -
                                  Decode(Sign(T1.pay_Money),
                                         -1,
                                         0,
                                         Nvl(T2.Pred_Money, 0))),
                             -1,
                             0,
                             T1.pay_Money -
                             Decode(Sign(T1.pay_Money),
                                    -1,
                                    0,
                                    Nvl(T2.Pred_Money, 0)))) As pay_Money,
               T1.toPay_Money,
               Nvl(T2.Pred_Money, 0) As Pred_Money,
               T1.corp_type_code
          from (select t11.Corp_Id,
                       t11.Corp_Code,
                       t11.Corp_Name,
                       t11.Insr_Detail_Code,
                       sum(t11.Topay_Money) as Topay_Money,
                       t11.corp_type_code,
                       sum(t11.Pay_Money) as Pay_Money,
                       t11.asgn_tenet,
                       t11.asgn_order,
                       t11.use_pred_insr
                  from (Select b.Corp_Id,
                               a.Corp_Code,
                               a.Corp_Name,
                               b.insr_detail_code,
                               a.corp_type_code,
                               Sum(b.Pay_Money - nvl(b.Payed_Money, 0)) As Topay_Money,
                               Sum(b.Pay_Money - nvl(b.Payed_Money, 0)) As Pay_Money,
                               c.asgn_tenet,
                               c.asgn_order,
                               c.use_pred_insr
                          From Bs_Corp a, Lv_Insr_Topay b, lv_scheme_detail c
                         Where a.Corp_Id = b. Corp_Id
                           and ((b.payed_flag = 0 and
                               nvl(b.busi_asg_no, 0) = 0) or
                               (b.payed_flag = 2))
                           and nvl(b.indi_pay_flag, 0) = 0
                           and c.scheme_id = 1
                           and b.insr_detail_code=c.insr_detail_code
                           and not exists
                         (select 'x'
                                  from lv_busi_bill lbb, lv_busi_record lbr
                                 where b.corp_id = lbr.pay_object_id
                                   and lbb.busi_bill_sn = lbr.busi_bill_sn
                                   and lbb.pay_object = 1
                                   and lbb.audit_flag = 0)
                           and c.insr_detail_code = b.insr_detail_code
                           and b.calc_prd < = '201508'
                           and b.insr_detail_code in
                               (select distinct insr_detail_code
                                  from lv_scheme_detail
                                 where scheme_id = 1)
                           and b.topay_type in
                               (select topay_type
                                  from lv_busi_type_topay
                                 where busi_type = 1)
                           and b.src_type = 1
                           and a.center_id = '430701'
                         Group By b.Corp_Id,
                                  b.Insr_Detail_Code,
                                  c.use_pred_insr,
                                  a.Corp_Code,
                                  a.Corp_Name,
                                  a.corp_type_code,
                                  c.asgn_tenet,
                                  c.asgn_order,
                                  c.use_pred_insr) t11
                 group by t11.Corp_Id,
                          t11.Corp_Code,
                          t11.Corp_Name,
                          t11.Insr_Detail_Code,
                          t11.corp_type_code,
                          t11.asgn_tenet,
                          t11.asgn_order,
                          t11.use_pred_insr) T1,
               (select t21.corp_id,
                       sum(t21.pred_money) as pred_money,
                       t21.Insr_Detail_Code
                  from (Select a.Corp_Id,
                               decode(c.use_pred_insr,
                                      null,
                                      b.insr_detail_code,
                                      c.use_pred_insr) as Insr_Detail_Code,
                               sum(decode(1, 0, 0, 1, b.Pred_Money)) as pred_money
                          From Bs_Corp a, Lv_Pred_Money b, lv_scheme_detail c
                         Where a.Corp_Id = b.Corp_Id
                           and c.insr_detail_code = b.insr_detail_code
                           and c.scheme_id = 1
                           and decode(c.use_pred_insr,
                                      null,
                                      c.insr_detail_code,
                                      c.use_pred_insr) = c.insr_detail_code
                         group by a.corp_id,
                                  c.use_pred_insr,
                                  b.insr_detail_code) t21
                 group by t21.corp_id, t21.Insr_Detail_Code) T2
         Where T1.Corp_Id = T2.Corp_Id(+)
           And T1.Insr_Detail_Code = T2.Insr_Detail_Code(+)) t
 where not exists (select 'X'
          from lv_busi_bill a, lv_busi_record b
         where a.busi_bill_sn = b.busi_bill_sn
           and a.audit_flag = 0
           and a.pay_object = 1
           and b.PAY_OBJECT_ID = t.PAY_OBJECT_ID
           and b.INSR_DETAIL_CODE = t.insr_detail_code)
 Group By t.pay_money,
          t.Pay_Object_Id,
          t.Pay_Object_Code,
          t.Pay_Object_Name,
          t.corp_type_code,
          t.insr_detail_code,
          t.asgn_tenet,
          t.asgn_order,
          t.use_pred_insr
Having Sum(t.pay_Money) = 0
 order by t.Pay_Object_Name, t.asgn_order
 

其执行计划的统计信息如下:
3
执行时间是1481秒,这个时间是不可接受的。

其执行计划如下:
4

执行计划中对表lv_busi_record执行全表扫描,该表记录有2000w,这明显是不对,为什么不走索引了,是因为表在设计和创建时就没有创建索引,这个表的数据是不断增加的,前期数据量少,执行全表扫描对性能的影响就根本体现不出来,但随着系统的运行,数据量的增加就会越来越慢。还有就是表lv_scheme_detail和Bs_Corp之间的笛卡尔积,为什么会出现笛卡尔积了,发现两个表之间根本就没有关联条件,一开始还以为开发人员忘记书写了,但经过查询表空间发现,两个表根本就没有可以关联的字段,而最后使用了group by来进行去重。

这里我只能对表lv_busi_record根据业务规则创建索引,但没有办法解决表lv_scheme_detail和Bs_Corp之间的笛卡尔积关联的问题
如果修改表结构就涉及到修改应用程序了。在对表lv_busi_record索引后的执行情况如下。
其执行计划的统计信息如下:
2

5
执行时间缩短为接近14秒,从1481到14是百倍的提升。其实处理方法很简单,但我想说的是,这本就不应该出现的,如果我们软件开发商在设计,开发和测试阶段能认真设计,编写SQL和测试,也就是引入SQL质量审计就能避免这种问题的发生。

oracle 10g data guard broker ORA-16607 故障处理案例

为了更简单的管理data guard可以配置data guard broker来进行管理,配置broker过程如下:

[oracle@oracle11g ~]$ dgmgrl xxx/xxxxx@xxx
DGMGRL for Linux: Version 10.2.0.5.0 - Production

Copyright (c) 2000, 2005, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
Connected.
DGMGRL> help

The following commands are available:

add            Add a standby database to the broker configuration
connect        Connect to an Oracle instance
create         Create a broker configuration
disable        Disable a configuration, a database, or Fast-Start Failover
edit           Edit a configuration, database, or instance
enable         Enable a configuration, a database, or Fast-Start Failover
exit           Exit the program
failover       Change a standby database to be the primary database
help           Display description and syntax for a command
quit           Exit the program
reinstate      Change a disabled database into a viable standby database
rem            Comment to be ignored by DGMGRL
remove         Remove a configuration, database, or instance
show           Display information about a configuration, database, or instance
shutdown       Shutdown a currently running Oracle instance
start          Start Fast-Start Failover observer
startup        Start an Oracle database instance
stop           Stop Fast-Start Failover observer
switchover     Switch roles between the primary database and a standby database

Use "help " to see syntax for individual commands

DGMGRL> show configuration
Error: ORA-16532: Data Guard broker configuration does not exist

Configuration details cannot be determined by DGMGRL
DGMGRL> help create

Create a broker configuration

Syntax:

  CREATE CONFIGURATION  AS
    PRIMARY DATABASE IS 
    CONNECT IDENTIFIER IS ;

创建broker配置文件

DGMGRL> create configuration 'broker_dg' as primary database is test connect identifier is test;
Configuration "broker_dg" created with primary database "test"
DGMGRL> show configuration

Configuration
  Name:                broker_dg
  Enabled:             NO
  Protection Mode:     MaxPerformance
  Fast-Start Failover: DISABLED
  Databases:
    test - Primary database

Current status for "broker_dg":
DISABLED

DGMGRL> help show configuration

Display information about a configuration, database, or instance

Syntax:

  SHOW CONFIGURATION;

  SHOW DATABASE [VERBOSE]  [];

  SHOW INSTANCE [VERBOSE]  []
    [ON DATABASE ];

DGMGRL> help add

Add a standby database to the broker configuration

Syntax:

  ADD DATABASE  AS
    CONNECT IDENTIFIER IS 
    MAINTAINED AS {PHYSICAL|LOGICAL};

向配置文件添加备库(物理备库test_dg)

DGMGRL> add database test_dg as connect identifier is test_dg maintained as physical;
Database "test_dg" added
DGMGRL> show configuration

Configuration
  Name:                broker_dg
  Enabled:             NO
  Protection Mode:     MaxPerformance
  Fast-Start Failover: DISABLED
  Databases:
    test    - Primary database
    test_dg - Physical standby database

Current status for "broker_dg":
DISABLED

启用broker配置

DGMGRL> enable configuration
Enabled.

显示broker配置信息,显示如下错误信息:

DGMGRL> show configuration

Configuration
  Name:                broker_dg
  Enabled:             YES
  Protection Mode:     MaxPerformance
  Fast-Start Failover: DISABLED
  Databases:
    test    - Primary database
    test_dg - Physical standby database

Current status for "broker_dg":
Warning: ORA-16607: one or more databases have failed

显示主库test的状态报告

DGMGRL> show database test statusreport
STATUS REPORT
       INSTANCE_NAME   SEVERITY ERROR_TEXT

显示备库test_dg的状态报告,显示如下错误信息:

DGMGRL> show database test_dg statusreport
Error: ORA-16664: unable to receive the result from a remote database

显示主库test的详细信息

DGMGRL> show database verbose test

Database
  Name:            test
  Role:            PRIMARY
  Enabled:         YES
  Intended State:  ONLINE
  Instance(s):
    test

  Properties:
    InitialConnectIdentifier        = 'test'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'ASYNC'
    Dependency                      = ''
    DelayMins                       = '0'
    Binding                         = 'OPTIONAL'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '180'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'AUTO'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '10'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = '/u03/app/oracle/oradata/test/, /u01/app/oracle/oradata/test/, /u03/app/oracle/oradata/test_ldg/, /u01/app/oracle/oradata/test/'
    LogFileNameConvert              = '/u03/app/oracle/oradata/test/, /u01/app/oracle/oradata/test/, /u03/app/oracle/oradata/test_ldg/, /u01/app/oracle/oradata/test/'
    FastStartFailoverTarget         = ''
    StatusReport                    = '(monitor)'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = 'xxxxxx'
    SidName                         = 'test'
    LocalListenerAddress            = '(ADDRESS=(PROTOCOL=tcp)(HOST=xxxxxx)(PORT=1521))'
    StandbyArchiveLocation          = '/u02/archive/'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = '%t_%s_%r.dbf'
    LatestLog                       = '(monitor)'
    TopWaitEvents                   = '(monitor)'

Current status for "test":
SUCCESS

显示备库test_dg的详细信息,显示如下错误:

DGMGRL> show database verbose test_dg

Database
  Name:            test_dg
  Role:            PHYSICAL STANDBY
  Enabled:         YES
  Intended State:  ONLINE
  Instance(s):
    test_dg

  Properties:
    InitialConnectIdentifier        = 'test_dg'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'ASYNC'
    Dependency                      = ''
    DelayMins                       = '0'
    Binding                         = 'OPTIONAL'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '180'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'AUTO'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '2'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = '/u03/app/oracle/oradata/test_ldg/, /u03/app/oracle/oradata/test/, /u01/app/oracle/oradata/test/, /u03/app/oracle/oradata/test/'
    LogFileNameConvert              = '/u03/app/oracle/oradata/test_ldg/, /u03/app/oracle/oradata/test/, /u01/app/oracle/oradata/test/, /u03/app/oracle/oradata/test/'
    FastStartFailoverTarget         = ''
    StatusReport                    = '(monitor)'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = 'jingyong1'
    SidName                         = 'test_dg'
    LocalListenerAddress            = '(ADDRESS=(PROTOCOL=tcp)(HOST=jingyong1)(PORT=1521))'
    StandbyArchiveLocation          = '/u03/app/oracle/archive/'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = '%t_%s_%r.dbf'
    LatestLog                       = '(monitor)'
    TopWaitEvents                   = '(monitor)'

Current status for "test_dg":
Error: ORA-16664: unable to receive the result from a remote database

显然是物理备库test_dg出了故障,检查备库的drctest_dg.log该日志文件在oracle10g中存储bdump文件中:

DG 2015-08-04-17:07:48        0 2 0 NSV0: Failed to connect to remote database test. Error is ORA-12514
DG 2015-08-04-17:07:48        0 2 0 NSV0: Failed to send message to site test. Error code is ORA-12514.
DG 2015-08-04-17:07:48        0 2 0 DMON: Database test returned ORA-12514
DG 2015-08-04-17:07:48        0 2 0       for opcode = CTL_GET_STATUS, phase = BEGIN, req_id = 1.1.886847999
DG 2015-08-04-17:07:59        0 2 0 RSM 0 received GETPROP request: rid=0x02010000, pid=54
DG 2015-08-04-17:07:59        0 2 0 Database Resource: Get Property InconsistentProperties
DG 2015-08-04-17:07:59        0 2 0 RSM Warning: Property 'ArchiveLagTarget' has inconsistent values:METADATA='0', SPFILE='', DATABASE='0'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property ArchiveLagTarget is inconsistent with the database setting
DG 2015-08-04-17:07:59        0 2 0 RSM Warning: Property 'LogArchiveMaxProcesses' has inconsistent values:METADATA='2', SPFILE='', DATABASE='2'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property LogArchiveMaxProcesses is inconsistent with the database setting
DG 2015-08-04-17:07:59        0 2 0 RSM Warning: Property 'LogArchiveMinSucceedDest' has inconsistent values:METADATA='1', SPFILE='', DATABASE='1'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property LogArchiveMinSucceedDest is inconsistent with the database setting
DG 2015-08-04-17:07:59        0 2 0 SPFILE is missing value for property 'LogArchiveTrace' with sid='test_dg'
DG 2015-08-04-17:07:59        0 2 0 RSM Warning: Property 'LogArchiveTrace' has inconsistent values:METADATA='0', SPFILE='(missing)', DATABASE='0'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property LogArchiveTrace is inconsistent with the database setting
DG 2015-08-04-17:07:59        0 2 0 SPFILE is missing value for property 'LogArchiveFormat' with sid='test_dg'
DG 2015-08-04-17:07:59        0 2 0 RSM Warning: Property 'LogArchiveFormat' has inconsistent values:METADATA='%t_%s_%r.dbf', SPFILE='(missing)', DATABASE='%t_%s_%r.dbf'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property LogArchiveFormat is inconsistent with the database setting
DG 2015-08-04-17:07:59        0 2 0 Database Resource GetProperty succeeded
DG 2015-08-04-17:07:59  2010000 4 886848003 DMON: MON_PROPERTY operation completed
DG 2015-08-04-17:07:59        0 2 0 NSV0: Failed to connect to remote database test. Error is ORA-12514
DG 2015-08-04-17:07:59        0 2 0 NSV0: Failed to send message to site test. Error code is ORA-12514.
DG 2015-08-04-17:07:59        0 2 0 DMON: Database test returned ORA-12514
DG 2015-08-04-17:07:59        0 2 0       for opcode = MON_PROPERTY, phase = NULL, req_id = 1.1.886848003
DG 2015-08-04-17:08:03        0 2 0 DRCX: could not find task req_id=1.1.886847999 for PROBE.

从上面的信息中可以看到如下信息:

RSM Warning: Property 'ArchiveLagTarget' has inconsistent values:METADATA='0', SPFILE='', DATABASE='0'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property ArchiveLagTarget is inconsistent with the database setting
DG 2015-08-04-17:07:59        0 2 0 RSM Warning: Property 'LogArchiveMaxProcesses' has inconsistent values:METADATA='2', SPFILE='', DATABASE='2'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property LogArchiveMaxProcesses is inconsistent with the database setting
DG 2015-08-04-17:07:59        0 2 0 RSM Warning: Property 'LogArchiveMinSucceedDest' has inconsistent values:METADATA='1', SPFILE='', DATABASE='1'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property LogArchiveMinSucceedDest is inconsistent with the database setting
DG 2015-08-04-17:07:59        0 2 0 SPFILE is missing value for property 'LogArchiveTrace' with sid='test_dg'
DG 2015-08-04-17:07:59        0 2 0 RSM Warning: Property 'LogArchiveTrace' has inconsistent values:METADATA='0', SPFILE='(missing)', DATABASE='0'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property LogArchiveTrace is inconsistent with the database setting
DG 2015-08-04-17:07:59        0 2 0 SPFILE is missing value for property 'LogArchiveFormat' with sid='test_dg'
DG 2015-08-04-17:07:59        0 2 0 RSM Warning: Property 'LogArchiveFormat' has inconsistent values:METADATA='%t_%s_%r.dbf', SPFILE='(missing)', DATABASE='%t_%s_%r.dbf'
DG 2015-08-04-17:07:59        0 2 0 RSM0: HEALTH CHECK WARNING: ORA-16714: the value of property LogArchiveFormat is inconsistent with the database setting

这里显示
‘ArchiveLagTarget’ has inconsistent values:METADATA=’0′, SPFILE=”, DATABASE=’0′
这说明archive_lag_target参数spfile文件的值与database,metadata的值不相同(它们都为0)。

‘LogArchiveMaxProcesses’ has inconsistent values:METADATA=’2′, SPFILE=”, DATABASE=’2′ 这说明log_archive_max_processes参数spfile文件的值与database,metadata的值不相同(它们都为2)。

‘LogArchiveMinSucceedDest’ has inconsistent values:METADATA=’1′, SPFILE=”, DATABASE=’1′ 这说明log_archive_min_succeed_dest参数spfile文件的值与database,metadata的值不相同(它们都为1)。

‘LogArchiveTrace’ has inconsistent values:METADATA=’0′, SPFILE='(missing)’, DATABASE=’0′ 这说明log_archive_trace参数spfile文件的值与database,metadata的值不相同(它们都为0)。

‘LogArchiveFormat’ with sid=’test_dg’
DG 2015-08-04-17:07:59 0 2 0 RSM Warning: Property ‘LogArchiveFormat’ has inconsistent values:METADATA=’%t_%s_%r.dbf’, SPFILE='(missing)’, DATABASE=’%t_%s_%r.dbf’ 这说明log_archive_format参数spfile文件的值与database,metadata的值不相同(它们都为’%t_%s_%r.dbf’)。

对以上不一致参数进行修改

SQL> alter system set log_archive_max_processes=2 scope=spfile;

System altered.

SQL> alter system set archive_lag_target=0 scope=spfile;

System altered.

SQL> alter system set log_archive_min_succeed_dest=1 scope=spfile;

System altered.

SQL> alter system set log_archive_trace=0 scope=spfile;

System altered.


SQL> alter system set log_archive_format='%t_%s_%r.dbf' scope=spfile;

System altered.

再次检查broker配置

DGMGRL> show database verbose test_dg

Database
  Name:            test_dg
  Role:            PHYSICAL STANDBY
  Enabled:         YES
  Intended State:  ONLINE
  Instance(s):
    test_dg

  Properties:
    InitialConnectIdentifier        = 'test_dg'
    ObserverConnectIdentifier       = ''
    LogXptMode                      = 'ASYNC'
    Dependency                      = ''
    DelayMins                       = '0'
    Binding                         = 'OPTIONAL'
    MaxFailure                      = '0'
    MaxConnections                  = '1'
    ReopenSecs                      = '300'
    NetTimeout                      = '180'
    LogShipping                     = 'ON'
    PreferredApplyInstance          = ''
    ApplyInstanceTimeout            = '0'
    ApplyParallel                   = 'AUTO'
    StandbyFileManagement           = 'AUTO'
    ArchiveLagTarget                = '0'
    LogArchiveMaxProcesses          = '2'
    LogArchiveMinSucceedDest        = '1'
    DbFileNameConvert               = '/u03/app/oracle/oradata/test_ldg/, /u03/app/oracle/oradata/test/, /u01/app/oracle/oradata/test/, /u03/app/oracle/oradata/test/'
    LogFileNameConvert              = '/u03/app/oracle/oradata/test_ldg/, /u03/app/oracle/oradata/test/, /u01/app/oracle/oradata/test/, /u03/app/oracle/oradata/test/'
    FastStartFailoverTarget         = ''
    StatusReport                    = '(monitor)'
    InconsistentProperties          = '(monitor)'
    InconsistentLogXptProps         = '(monitor)'
    SendQEntries                    = '(monitor)'
    LogXptStatus                    = '(monitor)'
    RecvQEntries                    = '(monitor)'
    HostName                        = 'jingyong1'
    SidName                         = 'test_dg'
    LocalListenerAddress            = '(ADDRESS=(PROTOCOL=tcp)(HOST=jingyong1)(PORT=1521))'
    StandbyArchiveLocation          = '/u03/app/oracle/archive/'
    AlternateLocation               = ''
    LogArchiveTrace                 = '0'
    LogArchiveFormat                = '%t_%s_%r.dbf'
    LatestLog                       = '(monitor)'
    TopWaitEvents                   = '(monitor)'

Current status for "test_dg":
SUCCESS
DGMGRL> show configuration

Configuration
  Name:                broker_dg
  Enabled:             YES
  Protection Mode:     MaxPerformance
  Fast-Start Failover: DISABLED
  Databases:
    test    - Primary database
    test_dg - Physical standby database

Current status for "broker_dg":
SUCCESS

现在已经能成功显示broker配置中的数据库信息。

oracle 10g 物理备库转换逻辑备库ORA-19953故障解决方法

操作环境是Red hat Linux 5.4 x86-64 Oracle 10.2.0.5 在将物理备库转换为逻辑备库出现ORA-19953

SQL> alter database recover to logical standby test;
alter database recover to logical standby test
*
ERROR at line 1:
ORA-19953: database should not be open

alert.log文件内容如下:

Incomplete Recovery applied until change 720500
Sun Jun 28 19:50:45 CST 2015
Media Recovery Complete (test_ldg)
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
RESETLOGS after incomplete recovery UNTIL CHANGE 720500
Resetting resetlogs activation ID 2174774786 (0x81a06e02)
Online log /u03/app/oracle/oradata/test_ldg/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u03/app/oracle/oradata/test_ldg/redo02.log: Thread 1 Group 2 was previously cleared
Online log /u03/app/oracle/oradata/test_ldg/redo03.log: Thread 1 Group 3 was previously cleared
Standby became primary SCN: 720498
Sun Jun 28 19:50:48 CST 2015
Setting recovery target incarnation to 3
Sun Jun 28 19:50:48 CST 2015
ACTIVATE STANDBY: Complete - Database shutdown required (test_ldg)
Sun Jun 28 19:50:48 CST 2015
ORA-19953 signalled during: alter database recover to logical standby test...

MOS上有一关于这个问题的BUG(Bug ID 9207121)内容如下:

Type	B - Defect	Fixed in Product Version
Severity	2 - Severe Loss of Service	Product Version	10.2.0.4
Status	33 - Suspended, Req'd Info not Avail	Platform	226 - Linux x86-64
Created	11-Dec-2009	Platform Version	RED HAT ENTERPRISE LINUX 5
Updated	05-Feb-2015	Base Bug	N/A
Database Version	10.2.0.4	Affects Platforms	Generic
Product Source	Oracle	Knowledge, Patches and Bugs related to this bug


Related Products

Line	Oracle Database Products	Family	Oracle Database Suite
Area	Oracle Database	Product	5 - Oracle Database - Enterprise Edition

Hdr: 9207121 10.2.0.4 RDBMS 10.2.0.4 DATAGUARD_LSBY PRODID-5 PORTID-226 ORA-19953
Abstract: ORA-19953 CREATING LOGICAL STANDBY

*** 12/11/09 12:35 pm ***

PROBLEM:
--------
ct has a 3-node RAC primary(db_name=TCIP, unique_name=TCIP)
and a single node physical standby db_name=TCIP,unique_name=TCIPvl) using
spfile.

Converting this physical standby to logical standby failed.
When executing on the standby side
SQL> alter database recover to logical standby TCIPvl;
the db_name in the spfile is not changed to TCIPvl.




DIAGNOSTIC ANALYSIS:
--------------------
The following outlines the steps:
- Verified that primary and physical standby are in sync. (around 2009 12/11
12:30)
- stopped recovery at physical standby (Fri Dec 11 12:35:10 2009)
- build dictionary on primary  (Fri Dec 11 12:55:29 2009 log seq 9976)
   SQL> DBMS_LOGSTDBY.BUILD;
- switched logs on primary (all instances 3 times)
- verified on the standby side that the logs containing dictionary
information were archived and arrived (but not applied) on the standby
- executed "alter database recover to logical standby TCIPvl" on standby (Fri
Dec 11 13:05:35 2009)
- the above SQL did not show any errors on the screen. However I noticed the
following:
. the db_name was not changed in spfile.  (verified using pfile create
pfile='/tmp/whatever.ora" from spfile)
. the standby's alert log shows ORA-19953.
. did not see the following message in the alert log.
    *** DBNEWID utility started ***
     DBID will be changed from 3890508598 to new DBID of 70593532 for
database ORCL10
     DBNAME will be changed from ORCL10 to new DBNAME of ORCL10S
     Starting datafile conversion
    ...
- verified that spfile is writable as the changes to archive_dest_3 was
effective in spfile.
- performed "alter system set db_name='TCIPvl' scope=spfile sid='*' ' on
standby
- shutdown standby, then startup mount
  got ORA-1103 "database name '%s' in control file is not '%s' on the command
line.

WORKAROUND:
-----------

RELATED BUGS:
-------------

REPRODUCIBILITY:
----------------
at ct site.

TEST CASE:
----------

STACK TRACE:
------------

SUPPORTING INFORMATION:
-----------------------
- alert logs from primary and standby, as well as the pfile from the standby
after "recover to logical standy.." was excuted.
- The converting physical-> logical work was done between 2009 12/11 12:30 -
13:10

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
----------------------------------------

DIAL-IN INFORMATION:
--------------------

IMPACT DATE:
------------

*** 12/11/09 12:58 pm ***
*** 12/11/09 12:58 pm *** (CHG: Sta->16)
*** 12/11/09 01:00 pm *** (CHG: Sta->10)
*** 01/08/10 12:44 pm ***
*** 01/12/10 10:55 am *** (CHG: Sta->33)
*** 02/04/15 11:54 pm ***
*** 02/04/15 11:54 pm ***
*** 02/04/15 11:54 pm ***

描述是Linux x86-64位的10.2.0.4,但我这是10.2.0.5,与现象与这个BUG相同。上面给出的论断步骤如下:

The following outlines the steps:
- Verified that primary and physical standby are in sync. (around 2009 12/11
12:30)
- stopped recovery at physical standby (Fri Dec 11 12:35:10 2009)
- build dictionary on primary  (Fri Dec 11 12:55:29 2009 log seq 9976)
   SQL> DBMS_LOGSTDBY.BUILD;
- switched logs on primary (all instances 3 times)

在主库中执行DBMS_LOGSTDBY.BUILD创建数据字典后,在主库执行日志切换三次(因为缺省有三组重做日志组,如果是RAC,每个实例都要执行三次)以确保创建的数据字典传输同物理备库。

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter database recover to logical standby test;

Database altered.

转换成功,alert.log内容如下:

alter database recover to logical standby test
Sun Jun 28 20:12:29 CST 2015
Media Recovery Start: Managed Standby Recovery (test_ldg)
Sun Jun 28 20:12:29 CST 2015
Managed Standby Recovery not using Real Time Apply
Media Recovery Log /u03/app/oracle/archive/test_ldg/1_71_876665479.dbf
Media Recovery Log /u03/app/oracle/archive/test_ldg/1_72_876665479.dbf
Media Recovery Log /u03/app/oracle/archive/test_ldg/1_73_876665479.dbf
Sun Jun 28 20:12:31 CST 2015
Incomplete Recovery applied until change 722225
Sun Jun 28 20:12:31 CST 2015
Media Recovery Complete (test_ldg)
Begin: Standby Redo Logfile archival
End: Standby Redo Logfile archival
RESETLOGS after incomplete recovery UNTIL CHANGE 722225
Resetting resetlogs activation ID 2174774786 (0x81a06e02)
Online log /u03/app/oracle/oradata/test_ldg/redo01.log: Thread 1 Group 1 was previously cleared
Online log /u03/app/oracle/oradata/test_ldg/redo02.log: Thread 1 Group 2 was previously cleared
Online log /u03/app/oracle/oradata/test_ldg/redo03.log: Thread 1 Group 3 was previously cleared
Standby became primary SCN: 722223
Sun Jun 28 20:12:34 CST 2015
Setting recovery target incarnation to 3
Sun Jun 28 20:12:34 CST 2015
Converting standby mount to primary mount.
Sun Jun 28 20:12:34 CST 2015
ACTIVATE STANDBY: Complete - Database mounted as primary (test_ldg)
*** DBNEWID utility started ***
DBID will be changed from 2174811906 to new DBID of 2181762994 for database TEST
DBNAME will be changed from TEST to new DBNAME of TEST
Starting datafile conversion
kcv_lh_or_upgrade: 10.2 upgrading 1 incarnations
Setting recovery target incarnation to 1
Datafile conversion complete
Failed to find temporary file: /u03/app/oracle/oradata/test_ldg/temp01.dbf
Database name changed to TEST.
Modify parameter file and generate a new password file before restarting.
Database ID for database TEST changed to 2181762994.
All previous backups and archived redo logs for this database are unusable.
Database has been shutdown, open with RESETLOGS option.
Succesfully changed database name and ID.
*** DBNEWID utility finished succesfully ***
Completed: alter database recover to logical standby test
Sun Jun 28 20:12:44 CST 2015
destination database instance is 'started' not 'mounted'

从上面的Completed: alter database recover to logical standby test可以确认将test数据库从物理备为转换为了逻辑备库。