ORACLE 10G RAC 监听服务异常中断(offline)分析
作者:
1 问题描述
管理人员发现Oracle10g RAC集群第一个节点主机上的 ora.apprac1.LISTENER_apprac1.lsnr与ora.apprac1.ons服务处于offline状态,把这两个服务手工重启后隔一段这两个服务又自动变成offline。
[@more@]
2 故障分析
2.1 检查数据库运行状态
mailto:oracle@apprac1$]oracle@apprac1$ crstat
HA Resource Target State
----------- ------ -----
ora.apprac1.LISTENER_apprac1.lsnr OFFLINE OFFLINE
ora.apprac1.ons ONLINE OFFLINE
ora.apprac1.vip ONLINE ONLINE on apprac1
ora.apprac2.LISTENER_apprac2.lsnr ONLINE ONLINE on apprac2
ora.apprac2.ons ONLINE ONLINE on apprac2
ora.apprac2.vip ONLINE ONLINE on apprac2
ora.dgyd.db ONLINE ONLINE on apprac1
ora.dgyd.inst1.inst ONLINE ONLINE on apprac1
ora.dgyd.dgyd2.inst ONLINE ONLINE on apprac2
故障现象是ora.apprac1.LISTENER_apprac1.lsnr服务与ora.apprac1.ons处于offline状态。
2.2 数据库报警日志分析
alert_inst1.log
Sun Feb 28 21:58:31 2010
Process startup failed, error stack:
Sun Feb 28 21:58:31 2010
Errors in file /oracle/admin/dgyd/bdump/inst1_psp0_356718.trc:
ORA-27300: OS system dependent operation:fork failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgpspawn5
ORA-27303: additional information: skgpspawn5
Sun Feb 28 21:58:32 2010
Process m000 died, see its trace file
Sun Feb 28 21:58:32 2010
ksvcreate: Process(m000) creation failed
Sun Feb 28 21:59:58 2010
Process startup failed, error stack:
Sun Feb 28 21:59:58 2010
Errors in file /oracle/admin/dgyd/bdump/inst1_psp0_356718.trc:
ORA-27300: OS system dependent operation:fork failed with status: 2
ORA-27301: OS failure message: No such file or directory
ORA-27302: failure occurred at: skgpspawn5
ORA-27303: additional information: skgpspawn5
Sun Feb 28 21:59:59 2010
Process J000 died, see its trace file
Sun Feb 28 21:59:59 2010
kkjcre1p: unable to spawn jobq slave process
Sun Feb 28 21:59:59 2010
Errors in file /oracle/admin/dgyd/bdump/inst1_cjq0_184864.trc:
从数据库日志可以看出:在故障发生期间,由于OS资源原因,oracle数据库的jobq slave进程无法启动。
2.3 操作系统资源使用情况分析
mailto:oracle@apprac1$]oracle@apprac1$ vmstat 1 20
System configuration: lcpu=12 mem=45312MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
3 0 9861313 9273 0 0 0 0 0 0 809 101745 6087 18 3 79 0
0 0 9861327 9259 0 0 0 0 0 0 1694 119757 9918 21 3 75 0
1 0 9834443 36207 0 39 0 44 543 0 411 44502 4505 7 3 87 2
0 0 9836958 33116 0 6 0 0 0 0 741 271949 25123 15 5 79 1
1 0 9834389 35685 0 0 0 0 0 0 376 92789 9585 17 2 81 0
1 0 9834382 35691 0 0 0 0 0 0 553 95995 5002 17 2 80 0
2 0 9834381 35692 0 0 0 0 0 0 259 86042 4031 14 2 84 0
1 0 9834378 35599 0 6 0 0 0 0 700 120049 6419 33 3 64 0
1 0 9834377 35600 0 0 0 0 0 0 282 99036 4170 16 2 82 0
0 0 9834371 35606 0 0 0 0 0 0 433 100106 5049 20 2 78 0
1 0 9834377 35536 0 4 0 0 0 0 791 111374 5407 24 3 73 0
0 0 9834377 35536 0 0 0 0 0 0 1143 111069 6707 26 3 70 0
2 0 9834379 35502 0 2 0 0 0 0 1162 103518 6228 20 3 75 2
1 0 9834379 35470 0 2 0 0 0 0 1178 101210 5693 19 3 73 5
从vmstat结果可以看到:在故障发生期间,操作系统可用内存非常少。
2.4 检查主机配置
prtconf
System Model: IBM,9117-570
Machine Serial Number: 65DCAEC
Processor Type: PowerPC_POWER5
Number Of Processors: 6
Processor Clock Speed: 1900 MHz
CPU Type: 64-bit
Kernel Type: 64-bit
LPAR Info: 2 apprac1
Memory Size: 45312 MB
Good Memory Size: 45312 MB
Platform Firmware level: SF240_284
Firmware Version: IBM,SF240_284
Console Login: enable
Auto Restart: true
Full Core: false
主机内存:45312 MB
2.5 数据库内存参数
sga_target = 32212254720
pga_aggregate_target = 4294967296
数据库SGA+PGA = 30G+4G = 34G
2.6 数据库监听日志
listener_apprac1.log
(CONNECT_DATA=(SID=inst1)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(
PROTOCOL=tcp)(HOST=192.168.100.8)(PORT=3476)) * establish * inst1 * 12518
TNS-12518: TNS:listener could not hand off client connection
TNS-12549: TNS:operating system resource quota exceeded
TNS-12560: TNS:protocol adapter error
TNS-00519: Operating system resource quota exceeded
IBM/AIX RISC System/6000 Error: 12: Not enough space
01-MAR-2010 16:59:52 * (CONNECT_DATA=(SID=inst1)(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))) * (ADDRESS=(
PROTOCOL=tcp)(HOST=192.168.100.8)(PORT=3477)) * establish * inst1 * 12518
TNS-12518: TNS:listener could not hand off client connection
TNS-12549: TNS:operating system resource quota exceeded
TNS-12560: TNS:protocol adapter error
TNS-00519: Operating system resource quota exceeded
IBM/AIX RISC System/6000 Error: 12: Not enough space
从数据库监听日志可以看到:在故障发生期间,操作系统可用内存不足导致监听程序无法spawn服务器进程。
2.7 ora.apprac1.LISTENER_apprac1.lsnr服务日志
ora.apprac1.LISTENER_apprac1.lsnr.log
2010-02-28 20:40:16.504: [ RACG][1] [582038][1][ora.apprac1.LISTENER_apprac1.lsnr]: clsrcexecut: scls_process_spawn() error, status = 1, command = /oracle/product/db/bin/racgeut
Category: 1234
Operation: scls_process_spawn
Location: read_pipe
Other: EOF on read pipe
OS Error: Not enough space
2010-02-28 20:40:16.508: [ RACG][1] [582038][1][ora.apprac1.LISTENER_apprac1.lsnr]: end for resource = ora.apprac1.LISTENER_apprac1.lsnr, action = check, status = 1, time = 0.136s
从ora.apprac1.LISTENER_apprac1.lsnr.log日志可以看到:在故障发生期间,由于OS资源原因,监听程序无法spawn服务器进程。
2.8 数据库集群进程crsd.log日志
crsd.log
2010-02-28 20:38:18.396: [ CRSEVT][629]32CAAMonitorHandler :: 0:Could not execute /oracle/product/crs/bin/racgwrap(check) for ora.dgyd.db
category: 1234, operation: scls_process_spawn, loc: read_pipe, OS error: 12, other: EOF on read pipe
2010-02-28 20:38:18.406: [ CRSAPP][629]32CheckResource error for ora.dgyd.db error code = -1
2010-02-28 20:40:16.514: [ CRSAPP][636]32CheckResource error for ora.apprac1.LISTENER_apprac1.lsnr error code = 1
2010-02-28 20:40:16.754: [ CRSRES][636]32In stateChanged, ora.apprac1.LISTENER_apprac1.lsnr target is ONLINE
2010-02-28 20:40:16.796: [ CRSRES][636]32ora.apprac1.LISTENER_apprac1.lsnr on apprac1 went OFFLINE unexpectedly
……
2010-02-28 20:45:51.989: [ CRSAPP][653]32CheckResource error for ora.apprac1.ons error code = 1
2010-02-28 20:45:51.991: [ CRSRES][653]32In stateChanged, ora.apprac1.ons target is ONLINE
2010-02-28 20:45:51.991: [ CRSRES][653]32ora.apprac1.ons on apprac1 went OFFLINE unexpectedly
……
从crsd.log日志可以看出:在故障发生期间,监听程序无法spawn服务器进程导致ora.apprac1.LISTENER_apprac1.lsnr服务异常中断offline,由于该服务offline又进一步导致ora.apprac1.ons服务异常中断offline 。
3 故障解决方法
故障解决思路:降低数据库SGA内存占用量,以留出更多的内存给数据库服务器进程使用,并重启数据库及操作系统,以释放文件型内存。
操作步骤如下:
1. 修改sga_target参数,从30G修改为26G
alter system set sga_target=26G scope=spfile sid='inst1';
2. 关闭数据库
shutdown immediate
3. 关闭数据库集群
crsctl stop crs
4. 关闭HACMP
smitty clstop
5. 重启数据库服务器
shutdown -Fr
6. 启动HACMP
smitty clstart
7. 启动数据库集群
crsctl start crs
8. 启动数据库
startup
进行以上处理后,ora.apprac1.LISTENER_apprac1.lsnr与ora.apprac1.ons服务恢复正常。
4 总结及建议
通过对出现故障时数据库运行状态、数据库日志、数据库集群日志、主机性能、主机日志等综合分析后认为:故障发生时操作系统可用内存(free)非常少,数据库监听程序没能申请到足够的内存去spawn服务器进程,导致ora.apprac1.LISTENER_apprac1.lsnr服务异常中断(offline),由于该服务offline又进一步导致了ora.apprac1.ons服务异常中断(offline) 。
通常情况下,在数据库服务器中占用系统内存的组件包括:数据库SGA、数据库PGA、数据库服务器进程、文件系统缓存、操作系统内核等。必须要合理设置数据库SGA及PGA参数,以留出足够的内存给数据库服务器进程(与客户端连接请求一一对应)。
metalink工程师给出的建议如下:
ACTION PLAN
=========
Hi,
1. From the current collected diagnose information...the cause of this issue may be that there is not enough free memory on OS.
2. The possible reasons that OS memory is not enough are the below:
-- Memory leak occurred.
-- OS memory was occupied by some other processes which used a lot of memory.
3. I have checked and researched furtherly...but regret that can not find the relevant known issues or Bugs which reported the issue about memory leak on 10.2.0.4 version.
So please help to try to follow up the below action plans in order to avoid this issue occurring again in the future:
1). Try to decrease the SGA size...so that more memory can be given to OS.
2). Monitor the memory usage to check and find which processes are costing a lot of memory by the below methods:
-- ps -efo "vsz,pid,user,cpu,thcount,comm" | sort -n
In addition...the following 2 Metalink/MOS Notes are for your reference...You can review them from Metalink/MOS website:
-- Note 477522.1:How To Troubleshoot Memory Leaks on UNIX
-- Note 835254.1:Script To Monitor RDBMS Session UGA and PGA Current And