诊断Oracle RAC数据库上的“IPC Send timeout”问题(转)
本帖最后由 肥兔 于 2014-1-21 22:16 编辑
IPC Send timeout故障现象
Oracle RAC 数据库上比较常见的一种问题就是“IPC Send timeout”。数据库Alert log中出现了“IPC Send timeout”之后,经常会伴随着ora-29740 或者 "Waiting for clusterware split-brain resolution"等,数据库实例会因此异常终止或者被驱逐出集群
比如:
实例1的ALERT LOG:
[color=#0800]Thu Jul 02 05:24:50 2012
IPC Send timeout detected.Sender: ospid 6143755 <==发送者
Receiver: inst 2 binc 1323620776 ospid 49715160 <==接收者
[color=#0800]Thu Jul 02 05:24:51 2012
[color=#0800]IPC Send timeout to 1.7 inc 120 for msg type 65516 from opid 13
[color=#0800]Thu Jul 02 05:24:51 2012
[color=#0800]Communications reconfiguration: instance_number 2
Waiting for clusterware split-brain resolution <==出现脑裂
[color=#0800]Thu Jul 02 05:24:51 2012
[color=#0800]Trace dumping is performing id=[cdmp_20120702052451]
[color=#0800]Thu Jul 02 05:34:51 2012
Evicting instance 2 from cluster <==过了10分钟,实例2被驱逐出集群实例2的ALERT LOG:
[color=#0800]Thu Jul 02 05:24:50 2012
IPC Send timeout detected. Receiver ospid 49715160 <==接收者
[color=#0800]Thu Jul 02 05:24:50 2012
[color=#0800]Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms6_49715160.trc:
[color=#0800]Thu Jul 02 05:24:51 2012
[color=#0800]Waiting for clusterware split-brain resolution
[color=#0800]Thu Jul 02 05:24:51 2012
[color=#0800]Trace dumping is performing id=[cdmp_20120702052451]
[color=#0800]Thu Jul 02 05:35:02 2012
[color=#0800]Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lmon_6257780.trc:
ORA-29740: evicted by member 0, group incarnation 122 <==实例2出现ORA- 29740错误,并被驱逐出集群
[color=#0800]Thu Jul 02 05:35:02 2012
[color=#0800]LMON: terminating instance due to error 29740
[color=#0800]Thu Jul 02 05:35:02 2012
[color=#0800]Errors in file /u01/oracle/product/admin/sales/bdump/sales2_lms7_49453031.trc:
[color=#0800]ORA-29740: evicted by member , group incarnation
在RAC实例间主要的通讯进程有LMON, LMD, LMS等进程。正常来说,当一个消息被发送给其它实例之后,发送者期望接收者会回复一个确认消息,但是如果这个确认消息没有在指定的时间内收到(默认300秒),发送者就会认为消息没有达到接收者,于是会出现“IPC Send timeout”问题。
这种问题通常有以下几种可能性:
1. 网络问题造成丢包或者通讯异常。
2. 由于主机资源(CPU、内存)问题造成这些进程无法被调度或者这些进程无响应。
3. Oracle Bug.
4. AIX平台没有打IZ97457丁包
网络问题造成的“IPC Send timeout”例子
实例1的Alert log中显示接收者是2号机的进程49715160,
[color=#0800]Thu Jul 02 05:24:50 2012
IPC Send timeout detected.Sender: ospid 6143755 <==发送者
Receiver: inst 2 binc 1323620776 ospid 49715160 <==接收者
查看当时2号机的OSWatcher的vmstat输出,没有发现CPU和内存紧张的问题,查看OSWatcher的netstat输出,在发生问题前几分钟,私网的网卡上有大量的网络包传输。
[color=#0800]Node2:
[color=#0800]zzz Thu Jul 02 05:12:38 CDT 2012
[color=#0800]Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.2 4073847798 0 512851119 0 0 <==4073847798 - 4073692530 = 155268 个包/30秒
[color=#0800]zzz Thu Jul 02 05:13:08 CDT 2012
[color=#0800]Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.2 4074082951 0 513107924 0 0 <==4074082951 - 4073847798 = 235153 个包/30秒
Node1:
[color=#0800]zzz Thu Jul 02 05:12:54 CDT 2012
[color=#0800]Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.1 502159550 0 4079190700 0 0 <==502159550 - 501938658 = 220892 个包/30秒
[color=#0800]zzz Thu Jul 02 05:13:25 CDT 2012
[color=#0800]Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.1 502321317 0 4079342048 0 0 <==502321317 - 502159550 = 161767 个包/30秒
查看这个系统正常的时候,大概每30秒传输几千个包:
[color=#0800]zzz Thu Jul 02 04:14:09 CDT 2012
[color=#0800]Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en1 1500 10.182.3 10.182.3.2 4074126796 0 513149195 0 0 <==4074126796 - 4074122374 = 4422个包/30秒
这种突然的大量的网络传输可能会引发网络传输异常,另外网络的UDP或者IP包丢失也会造成该错误。对于这种情况,需要联系网管对网络进行检查。在某些案例中,重启私网交换机或者调换了交换机后问题不再发生。(请注意,网络的正常的传输量会根据硬件和业务的不同而不同。)
CPU负载过高造成的“IPC Send timeout”例子
实例1的Alert log中显示接收者是2号机的进程1596935,
[color=#0800]Fri Aug 01 02:04:29 2008
IPC Send timeout detected.Sender: ospid 1506825 <==发送者
Receiver: inst 2 binc -298848812 ospid 1596935 <==接收者
查看当时2号机的OSWatcher的vmstat输出:
[color=#0800] zzz ***Fri Aug 01 02:01:51 CST 2008
[color=#0800] System Configuration: lcpu=32 mem=128000MB
[color=#0800] kthr memory page faults cpu
[color=#0800] ----- ----------- ------------------------ ------------ -----------
[color=#0800] r b avm fre re pi po fr sr cy in sy cs us sy id wa
[color=#0800] 25 1 7532667 19073986 0 0 0 0 5 0 9328 88121 20430 32 10 47 11
58 0 7541201 19065392 0 0 0 0 0 0 11307 177425 10440 87 13 0 0 <==idle的CPU为0,说明CPU100%被使用
[color=#0800]61 1 7552592 19053910 0 0 0 0 0 0 11122 206738 10970 85 15 0 0
zzz ***Fri Aug 01 02:03:52 CST 2008
[color=#0800] System Configuration: lcpu=32 mem=128000MB
[color=#0800] kthr memory page faults cpu
[color=#0800] ----- ----------- ------------------------ ------------ -----------
[color=#0800] r b avm fre re pi po fr sr cy in sy cs us sy id wa
[color=#0800] 25 1 7733673 18878037 0 0 0 0 5 0 9328 88123 20429 32 10 47 11
81 0 7737034 18874601 0 0 0 0 0 0 9081 209529 14509 87 13 0 0 <==CPU的run queue非常高
80 0 7736142 18875418 0 0 0 0 0 0 9765 156708 14997 91 9 0 0 <==idle的CPU为0,说明CPU100%被使用
上面这个例子说明当主机CPU负载非常高的时候,接收进程无法响应发送者,从而引发了“IPC Send timeout”。
引起IPC Send timeout问题的常见bug
10g平台上该问题的常见Bug有Bug 5190596和Bug 6200820。这两个bug多出现在10.2.0.3和10.2.0.4,到了10.2.0.5版本就已经修复了该bug,具体请参见MOS上的文章:
LMON dumps LMS0 too often during DRM leading to IPC send timout [ID 5190596.8]
'IPC Send Timeout Detected' errors between QMON Processes after RAC reconfiguration [ID 458912.1]
11g平台上的常见bug有Bug 6200820和Bug 7653579具体请参见MOS上的文章:
Bug 6200820 AQ node affinity not reconfigured after RAC reconfiguration (QMNC timeouts)
Bug 7653579 - IPC send timeout in RAC after only short period [ID 7653579.8]
AIX平台没有打IZ97457丁包引起的 IPC Send timeout
关于这点MOS上的这篇文章
AIX VIO: Block Lost or IPC Send Timeout Possible Without Fix of APAR IZ97457 [ID 1305174.1]
有如下介绍
Applies to:
Oracle Server - Enterprise Edition - Version 9.2.0.2 and later
IBM AIX on POWER Systems (64-bit)
Symptoms
Environment with IBM AIX VIO experiences one or some or all of the following symptoms:
Packet Loss
Cache Fusion "block lost"
IPC Send timeout
Instance Eviction
SKGXPSEGRCV: MESSAGE TRUNCATED user data nnnn bytes payload nnnn bytes
Cause
AIX issue APAR IZ97457 - A VIOS Server will not forward traffic from its VIO Clients to the external network
Solution
Please engage your OS vendor for fix.
Oracle的建议是打上补丁,IZ97457补丁的介绍如下
Error description
A VIOS Server will not forward traffic from its VIO Clients to the external network.
Packets from the VIO Client travel to the hypervisor(phype) but the packets are dropped by the hypervisor as it attempts to deliver the packet to the VIO Server's trunk adapter.
The hypervisor will have dropped the packets because there are no buffers to place the data in. On the VIOServer,interrupts are not activating the trunk adapter to read and remove data from its buffers. This results in having full buffers at the trunk adapter.
Since the trunk adapter's buffers are full, phype cannot deliver the data and so VIO Clients cannot get packets through the SEA adapter and out to the network.
The problem was discovered on P7 systems where Vlans on the SEA are used.
"Hypervisor Receive" errors on the trunk adapter will increase as this problem occurs and the VIO Clients are not able to reach the outside network.
Problem summary
Unresponsive VIO Clients with traffice not forwarded to external network.
Problem conclusion
Ensure proper locking around receive scheduling operations.
可以看到,IZ97457该补丁是用于处理网络缓冲池用满的情况,建议AIX系统的用户检查下是否打了这个补丁。