p570a主机挂起导致Oracle出现故障的故障处理报告

教程发布:风哥 教程分类:ITPUX技术网 更新日期:2022-02-12 浏览学习:2179

环境:oracle 11.2.0.1 +rac +AIX 6.1建立两套数据库

1 问题描述
2010年11月29日下午15点左右,p570a主机 telnet不进去,应用新建连接不成功,严重影响到业务,16点赶到用户现场,进行应急处理。现把本次数据库应急故障处理、问题分析过程总结如下:
2 应急处理 通过hmc控制台,登录到p570a主机,输入任何命令都报内存不足,如下;root@p570a:/> errpt|moreksh: 0403-031 The fork function failed. There is not enough memory available.ksh: 0403-031 The fork function failed. There is not enough memory available.root@p570a:/> ps -ef | grep LOCAL=NO|wc -lksh: 0403-031 The fork function failed. There is not enough memory available.root@p570a:/> lsksh: 0403-031 The fork function failed. There is not enough memory available.征求用户意见同意后,通过hmc控制台,重启p570a主机。 3 P570a故障分析

3.1 操作系统Errpt
p570a@root#errpt|moreIDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTIONA6DF45AA 1129164210 I O RMCdaemon The daemon is started.EC0BCCD4 1129164110 T H ent1 ETHERNET DOWN67145A39 1129163910 U S SYSDUMP SYSTEM DUMPF48137AC 1129163810 U O minidump COMPRESSED MINIMAL DUMP1104AA28 1129163810 T S SYSPROC SYSTEM RESET INTERRUPT RECEIVED9DBCFDEE 1129164110 T O errdemon ERROR LOGGING TURNED ONB6267342 1126235510 P H hdisk3 DISK OPERATION ERRORB6267342 1125235510 P H hdisk3 DISK OPERATION ERRORC5C09FFA 1125062110 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1125051010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDC5C09FFA 1124144010 P S SYSVMM SOFTWARE PROGRAM ABNORMALLY TERMINATEDp570a@root#errpt -aj C5C09FFA |more---------------------------------------------------------------------------LABEL: PGSP_KILLIDENTIFIER: C5C09FFADate/Time: Thu Nov 25 06:21:13 BEIST 2010Sequence Number: 99122Machine Id: 00C6E9C54C00Node Id: p570aClass: SType: PERMWPAR: GlobalResource Name: SYSVMM DescriptionSOFTWARE PROGRAM ABNORMALLY TERMINATEDProbable CausesSYSTEM RUNNING OUT OF PAGING SPACEFailure CausesINSUFFICIENT PAGING SPACE DEFINED FOR THE SYSTEMPROGRAM USING EXCESSIVE AMOUNT OF PAGING SPACE11月24号开始已经报没有足够的页面交换空间可以使用,可见物理内存早就用完。 3.2 数据库警告文件 alert_gdtest1.log从11月24号开始就有大量如下报错:Wed Nov 24 22:36:15 2010ORA-27302: failure occurred at: skgpspawn3ORA-27301: OS failure message: Not enough spaceORA-27300: OS system dependent operation:fork failed with status: 12Errors in file /oracle/app/oracle/diag/rdbms/gdtest/gdtest1/trace/gdtest1_psp0_352314.trc:Process startup failed, error stack:Thu Nov 25 02:56:24 2010Process q000 died, see its trace fileThu Nov 25 02:56:13 2010ORA-27302: failure occurred at: skgpspawn3ORA-27301: OS failure message: Not enough spaceORA-27300: OS system dependent operation:fork failed with status: 12Errors in file /oracle/app/oracle/diag/rdbms/gdtest/gdtest1/trace/gdtest1_psp0_352314.trc:Process startup failed, error stack:Instance terminated by USER, pid = 144242USER (ospid: 144242): terminating the instance due to error 443Process LMHB died, see its trace fileORA-27302: failure occurred at: skgpspawn3ORA-27301: OS failure message: Not enough spaceORA-27300: OS system dependent operation:fork failed with status: 12Errors in file /oracle/app/oracle/diag/rdbms/ggdtest/gdtest1/trace/gdtest1_ora_144242.trc:p570a节点数据库down机是由于物理内存和页面交换空间已经使用完,无法得到请求引起的。 3.3 Listener.log TNS-12500: TNS:监听器未能启动专用的服务器进程TNS-12540: TNS:超出内部极限限制TNS-12560: TNS: 协议适配器错误TNS-00510: 超出内部极限限制IBM/AIX RISC System/6000 Error: 12: Not enough space监听日志也报无法请求外部连接错误。 3.4 检查物理内存和oracle内存参数 物理内存p570aAIXSystem Model: IBM,9117-MMAMemory Size: 15232 MBConsole Login: enableAuto Restart: trueFull Core: false可以看出总物理内存为15G左右数据库ASQL> show sgaTotal System Global Area 2137886720 bytesFixed Size 2208496 bytesVariable Size 1207962896 bytesDatabase Buffers 922746880 bytesRedo Buffers 4968448 bytesSQL> show parameter sgaNAME TYPE VALUE------------------------------------ ----------- ------------------------------lock_sga boolean FALSEpre_page_sga boolean FALSEsga_max_size big integer 2Gsga_target big integer 2GSQL> show parameter pgaNAME TYPE VALUE------------------------------------ ----------- ------------------------------pga_aggregate_target big integer 1GSQL> show parameter instance_nameNAME TYPE VALUE------------------------------------ ----------- ------------------------------instance_name string gd1可以看出A数据库占用3G物理内存
数据库BSQL> show sgaTotal System Global Area 8551575552 bytesFixed Size 2223904 bytesVariable Size 1778385120 bytesDatabase Buffers 6761218048 bytesRedo Buffers 9748480 bytesSQL> show parameter sgaNAME TYPE VALUElock_sga Boolean FALSEpre_page_sga Boolean FALSEsga_max_size big integer 8Gsga_target big integer 8GSQL> show parameter instance_nameNAME TYPE VALUE------------------------------------ ----------- ------------------------------instance_name string gd2SQL> show parameter pgaNAME TYPE VALUEpga_aggregate_target big integer 2G可以看出B数据库占用10G物理内存,分配的值占用总内存较多。
4 总结及建议
4.1 故障原因分析
总物理内存15G,分配给两个数据库总共内存13G,只剩2G给操作系统使用,随着业务连接数增多或不释放等原因,很容易把物理内存和页面交换空间耗用完,导致数据库down机和主机挂起。
4.2 已采取措施和建议 1) gzcdc数据库oracle内存参数值设置过大,建议调整,跟开发商,用户商量后,将gzcdc数据库sga调整为5G,pga设置为1G,这样操作系统还剩余7G。

本文标签:
网站声明:本文由风哥整理发布,转载请保留此段声明,本站所有内容将不对其使用后果做任何承诺,请读者谨慎使用!
【上一篇】
【下一篇】