IBM P55A主机宕机AIX dump文件深入分析详细过程

教程发布:风哥 教程分类:ITPUX技术网 更新日期:2022-02-12 浏览学习:1210

IBM P55A主机宕机AIX dump文件深入分析详细过程

一客户P550每周六晚上22点后都会出现宕机现象,需要分析原因:
1、检查日志,发现每次宕机都生成了dump日志
p550a:/#sysdumpdev -L
0453-039
Device name: /dev/hd6
Major device number: 10
Minor device number: 2
Size: 93209600 bytes
Uncompressed Size: 835558845 bytes
Date/Time: Thu Jul 22 01:17:44 BEIST 2010
Dump status: 0
dump completed successfully
Dump copy filename: /var/adm/ras/vmcore.1.Z
2、分析dump日志
p550a:/tmp/ibmsupt#kdb vmcore.0 /unix
The specified kernel file is a 64-bit kernel
vmcore.0 mapped from @ 700000000000000 to @ 70000003401245f
Preserving 1317350 bytes of symbol table
First symbol __mulh
Component Names:
1) minidump [2 entries]
2) dmp_minimal [9 entries]
3) proc [345 entries]
4) thrd [2387 entries]
5) rasct [1 entries]
6) ldr [2 entries]
7) errlg [3 entries]
8) mtrc [26 entries]
9) lfs [2 entries]
10) bos [2 entries]
11) ipc [7 entries]
12) vmm [13 entries]
13) alloc_kheap [512 entries]
14) alloc_other [21 entries]
15) rtastrc [8 entries]
16) sisraid [4 entries]
17) aixpcm [9 entries]
18) scdisk [19 entries]
19) lvm [2 entries]
20) jfs2 [1 entries]
21) tty [4 entries]
22) netstat [10 entries]
23) goent_dd [7 entries]
24) dump_failures [1 entries]
25) dump_statistics [1 entries]
Component Dump Table has 3398 entries
START END
0000000000001000 0000000003BBA050 start+000FD8
F00000002FF47600 F00000002FFDC920 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
F100070F00000000 F100070F10000000 pvproc+000000
F100070F10000000 F100070F18000000 pvthread+000000
PFT:
PVT:
id....................0002
raddr.....0000000000686000 eaddr.....F200800030000000
size..............00040000 align.............00001000
valid..1 ros....0 fixlmb.1 seg....0 wimg...2
[kdb_read_mem] no real storage @ F1000000107145D8
Dump analysis on CHRP_SMP_PCI POWER_PC POWER_5 machine with 8 available CPU(s) (64-bit registers)
Processing symbol table...
.......................done
[kdb_read_mem] no real storage @ F1000000106C145B
ERROR: Unable to acess nfs_syms
Unable to initialize module: /usr/lib/ras/autoload/nfs64.kdb
(2)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_5 machine with 8 available CPU(s) (64-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. p550a
release... 3
version... 5
build date Jan 10 2006
build time 10:56:32
label..... 0602A_53E
machine... 000B27ACD600
nid....... 0B27ACD6
time of crash: Thu Jul 8 00:53:14 2010
age of system: 95 day, 19 hr., 57 min., 14 sec.
xmalloc debug: disabled
CRASH INFORMATION:
CPU 2 CSA 018BDE00 at time of crash, error code for LEDs: 30000000
pvthread+000C00 STACK:
[00075FEC]v_delpft+000108 (F200800020000008 [??])
[0010AA88]v_relframe+000464 (??, ??, ??)
[001027E4]v_pageout+0006D0 (??, ??, ??)
[00141A20]v_steal+00043C (??, ??, ??, ??)
[00144EF4]v_fblru_scan+0003B8 (??)
[001403D4]v_lru+00035C (??)
[001414D0]v_memp_lru+00023C (??)
[00207FEC]v_prememp_lru+000020 (??)
[002A2474].backt+000080 ()
____ Exception (F000000030017780) ____
iar : 00000000002A23F4 msr : 80000000000010B2 cr : 42000024
lr : 00000000001408D4 ctr : 0000000000000000 xer : 00000000
mq : 00000000 asr : 00000000F372A001
r0 : 0000000000207FCC r1 : 0FFFFFFFF4017E90 r2 : 0000000001491C28
r3 : 0000000000000000 r4 : F10001002CBA1100 r5 : 0000000003B90000
r6 : 0000000000000000 r7 : 0000000000000000 r8 : 0000000000000106
r9 : 0000000000000000 r10 : 00000000001408D4 r11 : F000000030017780
r12 : 80000000000010B2 r13 : F10001002CB82400 r14 : 00000000DEADBEEF
r15 : 000000000101A9C0 r16 : 00000000DEADBEEF r17 : 00000000DEADBEEF
r18 : 00000000DEADBEEF r19 : 00000000DEADBEEF r20 : 00000000DEADBEEF
r21 : 00000000DEADBEEF r22 : 00000000DEADBEEF r23 : 00000000DEADBEEF
r24 : 00000000DEADBEEF r25 : 00000000DEADBEEF r26 : 00000000DEADBEEF
r27 : 00000000DEADBEEF r28 : 00000000DEADBEEF r29 : 00000000DEADBEEF
r30 : 0000000003B90000 r31 : 0000000000000000

prev 0000000000000000 stackfix 0000000000000000 int_ticks 00
kjmpbuf 0000000000000000 excbranch 0000000000000000 no_pfault 00
intpri 0B backt 00 flags 00
fpscr 0000000000000000 fpscrx 00000000 fpowner 00
fpeu 00 fpinfo 00 alloc F000
o_iar 0000000000000000 o_toc 0000000000000000
o_arg1 0000000000000000 o_vaddr 0000000000000000
krlockp 0000000000000000
Except :
csr 0000000000000000 dsisr 0000000000000000
esid 0000000000000000 dar 0000000000000000 dsirr 0000000000000106
[002A23F4].backt+000000 ()
[kdb_get_memory] no real storage @ FFFFFFFF4017EA0
(2)>
(2)> status
CPU TID TSLOT PID PSLOT PROC_NAME
0 2005 2 2004 2 wait
1 12025 18 D01A 13 wait
2 C019 12 4008 4 lrud
3 1502B 21 F01E 15 wait
4 12D 32768 120 16384 wait
5 3133 32771 3126 16387 wait
6 E01D 14 600C 6 psmd
7 5137 32773 512A 16389 wait
8-63 Disabled
这里分析,宕机时有进程一直在做页面交换操作。

(2)> f 2
pvthread+000200 STACK:
Use current context [F00000002FF47600] of cpu 0
[002934FC].h_cede+000014 ()
[00049BE0]waitproc+00047C ()
[0013B0F4]procentry+000010 (??, ??, ??, ??)
(2)> f 14
pvthread+000E00 STACK:
Use current context [01941E00] of cpu 6
[002B0528]slock+0001D8 (0000000003B90010, 0000000001941B70 [??])
[00009558].simple_lock+000058 ()
[0020D6EC]v_prelru_remlist+000060 (??, ??, ??, ??)
[002A25FC]begfst+0000A0 ()
____ Exception (F000000030017780) ____
iar : 0000000000065090 msr : 8000000000009032 cr : 22022024
lr : 00000000000BBBD0 ctr : 00000000000D0508 xer : 20000000
mq : 00000000 asr : 00000001EF72A001
r0 : 0000000022022024 r1 : 0FFFFFFFF4017BE0 r2 : 0000000001491C28
r3 : 0000000000065090 r4 : 8000000000001032 r5 : 0000000022022024
r6 : 0FFFFFFFF4017C70 r7 : 0000000000000000 r8 : 00000000010209C0
r9 : 0000000000000001 r10 : 0000000000000000 r11 : 0000000000000000
r12 : 0000000000297EC8 r13 : F10001002CB82C00 r14 : 0000000001000085
r15 : F1000100100B0600 r16 : 00000000510000F0 r17 : 0000000000000003
r18 : 0000000000000001 r19 : 0000000000000000 r20 : F10001002CB82D78
r21 : 00000000FFFEFBFF r22 : 0000000000000000 r23 : F00000002FF47600
r24 : 0000000000000000 r25 : 0000000000000000 r26 : F100070F00001800
r27 : F100070F10000E00 r28 : F100010010088000 r29 : F10001002CB82C00
r30 : 0000000000000001 r31 : 0000000000000001
prev 0000000000000000 stackfix 0000000000000000 int_ticks 00
kjmpbuf 0000000000000000 excbranch 0000000000000000 no_pfault 00
intpri 0B backt 00 flags 00
fpscr 0000000000000000 fpscrx 00000000 fpowner 00
fpeu 00 fpinfo 00 alloc F000
o_iar 0000000000000000 o_toc 0000000000000000
o_arg1 0000000000000000 o_vaddr 0000000000000000
krlockp 0000000000000000
Except :
csr 0000000000000000 dsisr 0000000040000000 bit set: DSISR_PFT
esid 0000000009101400 dar F1000110151EB000 dsirr 0000000000000106
[00065090]et_wait+000344 (0000000000065090, 8000000000001032,
0000000022022024 [??])
[000CF134]vm_psmd_flush_pending+0000D8 (??, ??, ??)
[000CE78C]vm_psmd_demote+0002BC (??, ??, ??, ??)
[000CFEA8]psmd_kthread+000098 (??)
[0013DAC0]threadentry+000014 (??, ??, ??, ??)

(2)> symptom
[kdb_get_memory] no real storage @ FFFFFFFF4017EA0
PIDS/5765G0300 LVLS/530 PCSS/SPI1 MS/300 FLDS/v_delpft VALU/7c20f800 FLDS/v_relfram VALU/464

p550a:/tmp/ibmsupt#kdb vmcore.1 /unix
The specified kernel file is a 64-bit kernel
vmcore.1 mapped from @ 700000000000000 to @ 700000031cd9029
Preserving 1317350 bytes of symbol table
First symbol __mulh
Component Names:
1) minidump [2 entries]
2) dmp_minimal [9 entries]
3) proc [315 entries]
4) thrd [2310 entries]
5) rasct [1 entries]
6) ldr [2 entries]
7) errlg [3 entries]
8) mtrc [26 entries]
9) lfs [2 entries]
10) bos [3 entries]
11) ipc [7 entries]
12) vmm [13 entries]
13) alloc_kheap [512 entries]
14) alloc_other [21 entries]
15) rtastrc [8 entries]
16) eidedd [1 entries]
17) sisraid [4 entries]
18) aixpcm [9 entries]
19) scdisk [19 entries]
20) lvm [2 entries]
21) jfs2 [1 entries]
22) tty [4 entries]
23) netstat [10 entries]
24) goent_dd [7 entries]
25) dump_statistics [1 entries]
Component Dump Table has 3292 entries
START END
0000000000001000 0000000003BBA050 start+000FD8
F00000002FF47600 F00000002FFDC920 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
F100070F00000000 F100070F10000000 pvproc+000000
F100070F10000000 F100070F18000000 pvthread+000000
PFT:
PVT:
id....................0002
raddr.....0000000000686000 eaddr.....F200800030000000
size..............00040000 align.............00001000
valid..1 ros....0 fixlmb.1 seg....0 wimg...2
Dump analysis on CHRP_SMP_PCI POWER_PC POWER_5 machine with 8 available CPU(s) (64-bit registers)
Processing symbol table...
.......................done
(4)> stat
SYSTEM_CONFIGURATION:
CHRP_SMP_PCI POWER_PC POWER_5 machine with 8 available CPU(s) (64-bit registers)
SYSTEM STATUS:
sysname... AIX
nodename.. p550a
release... 3
version... 5
build date Jan 10 2006
build time 10:56:32
label..... 0602A_53E
machine... 000B27ACD600
nid....... 0B27ACD6
time of crash: Thu Jul 22 01:17:43 2010
age of system: 14 day, 21 min., 51 sec.
xmalloc debug: disabled
CRASH INFORMATION:
CPU 4 CSA 018FFE00 at time of crash, error code for LEDs: 30000000
pvthread+000C00 STACK:
[00075FEC]v_delpft+000108 (F200800020000008 [??])
[0010AA88]v_relframe+000464 (??, ??, ??)
[001027E4]v_pageout+0006D0 (??, ??, ??)
[00141A20]v_steal+00043C (??, ??, ??, ??)
[00144EF4]v_fblru_scan+0003B8 (??)
[001403D4]v_lru+00035C (??)
[001414D0]v_memp_lru+00023C (??)
[00207FEC]v_prememp_lru+000020 (??)
[002A2474].backt+000080 ()
____ Exception (F000000030017780) ____
iar : 00000000002A23F4 msr : 80000000000010B2 cr : 42000024
lr : 00000000001408D4 ctr : 0000000000000000 xer : 00000000
mq : 00000000 asr : 00000000F376A001
r0 : 0000000000207FCC r1 : 0FFFFFFFF4017E90 r2 : 0000000001491C28
r3 : 0000000000000000 r4 : F10001002CBA1100 r5 : 0000000003B90000
r6 : 0000000000000000 r7 : 0000000000000000 r8 : 0000000000000106
r9 : 0000000000000000 r10 : 00000000001408D4 r11 : F000000030017780
r12 : 80000000000010B2 r13 : F10001002CB82400 r14 : 00000000DEADBEEF
r15 : 000000000101A9C0 r16 : 00000000DEADBEEF r17 : 00000000DEADBEEF
r18 : 00000000DEADBEEF r19 : 00000000DEADBEEF r20 : 00000000DEADBEEF
r21 : 00000000DEADBEEF r22 : 00000000DEADBEEF r23 : 00000000DEADBEEF
r24 : 00000000DEADBEEF r25 : 00000000DEADBEEF r26 : 00000000DEADBEEF
r27 : 00000000DEADBEEF r28 : 00000000DEADBEEF r29 : 00000000DEADBEEF
r30 : 0000000003B90000 r31 : 0000000000000000

prev 0000000000000000 stackfix 0000000000000000 int_ticks 00
kjmpbuf 0000000000000000 excbranch 0000000000000000 no_pfault 00
intpri 0B backt 00 flags 00
fpscr 0000000000000000 fpscrx 00000000 fpowner 00
fpeu 00 fpinfo 00 alloc F000
o_iar 0000000000000000 o_toc 0000000000000000
o_arg1 0000000000000000 o_vaddr 0000000000000000
krlockp 0000000000000000
Except :
csr 0000000000000000 dsisr 0000000000000000
esid 0000000000000000 dar 0000000000000000 dsirr 0000000000000106
[002A23F4].backt+000000 ()
[kdb_get_memory] no real storage @ FFFFFFFF4017EA0
(4)> status
CPU TID TSLOT PID PSLOT PROC_NAME
0 E01D 14 600C 6 psmd
1 12025 18 D01A 13 wait
2 12809F 296 370C4 55 bpbkar
3 1502B 21 F01E 15 wait
4 C019 12 4008 4 lrud
5 3133 32771 3126 16387 wait
6 4135 32772 4128 16388 wait
7 5137 32773 512A 16389 wait
8-63 Disabled
根据这里分析,宕机的另一个CPU正在做页面交换及bpbkar操作,其中bpbkar是NBU的进程。

(4)> f 12
pvthread+000C00 STACK:
[002A23F4].backt+000000 ()
[kdb_get_memory] no real storage @ FFFFFFFF4017EA0
(4)> f 14
pvthread+000E00 STACK:
Use current context [0187BE00] of cpu 0
[002B0794]slock+000444 (00000000000034E0, F100070F10000C00 [??])
[00009558].simple_lock+000058 ()
[0020D7C4]v_prelru_addlist+000060 (??, ??, ??, ??)
[002A25FC]begfst+0000A0 ()
____ Exception (F000000030017780) ____
iar : 00000000000093FC msr : 8000000000009032 cr : 82008042
lr : 00000000000BB908 ctr : 00000000000CD990 xer : 20000000
mq : 00000000 asr : 00000000FC65A001
r0 : 0000000082008042 r1 : 0FFFFFFFF4017BF0 r2 : 0000000001491C28
r3 : 00000000000093FC r4 : 8000000000001032 r5 : 0000000082008042
r6 : 0FFFFFFFF4017B40 r7 : 0000000000000000 r8 : 0000000000000000
r9 : 000000000101A9C0 r10 : 000000000000E01D r11 : 000000000101A9C0
r12 : 0000000000297EBC r13 : F10001002CB82C00 r14 : 00000000DEADBEEF
r15 : 00000000DEADBEEF r16 : 00000000DEADBEEF r17 : 0000000000000010
r18 : 000000000000FFFF r19 : 00000000000669D0 r20 : 0000000000000000
r21 : 00000000000003C0 r22 : 0000000003B90000 r23 : 000000000109E6C8
r24 : 000000000109E848 r25 : 0000000000000000 r26 : 0000000000000001
r27 : 0000000000000001 r28 : 0000000000000013 r29 : 0000000000000000
r30 : 0000000000000000 r31 : 000000000000000B
prev 0000000000000000 stackfix 0000000000000000 int_ticks 00
kjmpbuf 0000000000000000 excbranch 0000000000000000 no_pfault 00
intpri 0B backt 00 flags 00
fpscr 0000000000000000 fpscrx 00000000 fpowner 00
fpeu 00 fpinfo 00 alloc F000
o_iar 0000000000000000 o_toc 0000000000000000
o_arg1 0000000000000000 o_vaddr 0000000000000000
krlockp 0000000000000000
Except :
csr 0000000000000000 dsisr 0000000040000000 bit set: DSISR_PFT
esid 0000000019003400 dar F100010030C2C000 dsirr 0000000000000106
[000093FC].unlock_enable_mem+0000F0 ()
[000BB904]vm_lru_addlist_87_23+0000B4 (??, ??, ??, ??, ??, ??)
[000CF1D8]vm_psmd_flush_pending+00017C (??, ??, ??)
[000CFAA0]vm_psmd_promote+0001E8 (??, ??, ??, ??)
[000CFEB4]psmd_kthread+0000A4 (??)
[0013DAC0]threadentry+000014 (??, ??, ??, ??)

(4)> f 296
pvthread+012800 STACK:
Use current context [F00000002FF47600] of cpu 2
WARNING: bad IAR: 1001E3E0, display stack from LR: 1001F6CC

根据以上分析,P550宕机的时候一直在做页面交换和NBU备份任务,通过检查备份服务器,发现有一个备份策略正好是每周六晚上22:00进行。
3、检查内存设置:
p550a:/tmp/ibmsupt#vmo -a |grep perm
maxperm = 1562847
maxperm% = 80
minperm = 390711
minperm% = 20
strict_maxperm = 0
p550a:/tmp/ibmsupt#vmo -a |grep client
maxclient% = 80
strict_maxclient = 1
p550a:/tmp/ibmsupt#vmo -a |grep lru_file_repage
lru_file_repage = 1

vmstat 2 5
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
0 0 882318 24052 0 0 0 0 0 0 21 1730 124 0 0 99 0
0 0 882319 24051 0 0 0 0 0 0 16 1959 139 0 0 99 0
这是平时无备份操作的内存使用情况,而且交换页面使用较多,看来fre比较少,通过设置参数:
vmo -p -o minperm%=5
vmo -p -o maxclient%=20
vmo -p -o maxperm%=20
vmo -p -o lru_file_repage=0
再次检查内存:
vmstat 2 5
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
0 1 882332 767703 0 0 0 0 0 0 22 1537 456 0 0 99 1
1 0 882333 767698 0 0 0 0 0 0 12 1334 99 0 0 99 0
0 0 882333 767698 0 0 0 0 0 0 14 1474 126 0 0 99 0
内存很快就出来了。

本文标签:
网站声明:本文由风哥整理发布,转载请保留此段声明,本站所有内容将不对其使用后果做任何承诺,请读者谨慎使用!
【上一篇】
【下一篇】