大页配置错误导致内存耗尽进而导致PMON (ospid: 12917): terminating the instance due to error 471
现象
客户反馈说有一个库不能启动,报错:
1 2 3 4 5 6 7 8 9 10 11 12 | SQL> startup ORACLE instance started. Total System Global Area 7918845952 bytes Fixed Size 8639328 bytes Variable Size 2197818528 bytes Database Buffers 5687476224 bytes Redo Buffers 24911872 bytes Database mounted. ORA-03113: end-of-file on communication channel Process ID: 12459 Session ID: 201 Serial number: 23649 |
分析
于是我连接到服务器上,使用nomount到mount再到open的方式来启动,nomount和mount都正常,但是最后open阶段报错了,告警日志如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | 2022-03-03T21:10:21.910113+08:00 alter database open 2022-03-03T21:10:21.963402+08:00 Ping without log force is disabled: instance mounted in exclusive mode. 2022-03-03T21:10:21.998833+08:00 Beginning crash recovery of 1 threads 2022-03-03T21:10:25.417324+08:00 Warning: VKTM detected a time stall. Time drifts can result in unexpected behavior such as time-outs. Please see the VKTM trace file for more details: /u01/app/oracle/diag/rdbms/ais/ais/trace/ais_vktm_12924.trc 2022-03-03T21:11:11.410737+08:00 Instance Critical Process (pid: 19, ospid: 12954, DBW0) died unexpectedly 2022-03-03T21:11:20.449081+08:00 PMON (ospid: 12917): terminating the instance due to error 471 2022-03-03T21:11:20.449081+08:00 PMON (ospid: 12917): terminating the instance due to error 471 2022-03-03T21:11:21.377591+08:00 System state dump requested by (instance=1, osid=12917 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/ais/ais/trace/ais_diag_12938_20220303211121.trc 2022-03-03T21:11:22.482920+08:00 Dumping diagnostic data in directory=[cdmp_20220303211118], requested by (instance=1, osid=12917 (PMON)), summary=[abnormal instance termination]. 2022-03-03T21:11:23.604548+08:00 Instance terminated by PMON, pid = 12917 2022-03-03T21:11:23.608875+08:00 Warning: 2 processes are still attach to shmid 15: 2022-03-03T21:11:23.608875+08:00 Warning: 2 processes are still attach to shmid 15: (size: 36864 bytes, creator pid: 12913, last attach/detach pid: 12934) 2022-03-03T21:11:24.438847+08:00 USER (ospid: 13315): terminating the instance due to error 471 2022-03-03T21:11:24.443487+08:00 Instance terminated by USER, pid = 13315 |
报错是“PMON (ospid: 12917): terminating the instance due to error 471”,PMON死掉了。
查询系统日志如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 | [root@ais ~]# more /var/log/messages | grep oom | tail -n 10more /var/log/messages | grep oom | tail -n 10 Mar 3 21:05:31 ais kernel: [<ffffffffb57c1c6d>] ? oom_unkillable_task+0xcd/0x120 Mar 3 21:05:31 ais kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Mar 3 21:05:31 ais kernel: systemd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0 Mar 3 21:05:31 ais kernel: [<ffffffffb57c227d>] oom_kill_process+0x2cd/0x490 Mar 3 21:05:31 ais kernel: [<ffffffffb57c1c6d>] ? oom_unkillable_task+0xcd/0x120 Mar 3 21:05:31 ais kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Mar 3 21:11:05 ais kernel: systemd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0 Mar 3 21:11:05 ais kernel: [<ffffffffb57c227d>] oom_kill_process+0x2cd/0x490 Mar 3 21:11:05 ais kernel: [<ffffffffb57c1c6d>] ? oom_unkillable_task+0xcd/0x120 Mar 3 21:11:05 ais kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [root@ais ~]# [root@ais ~]# |
可以看到,发生了OOM,oracle是被系统杀死了,出现这个情况,一般是因为内存或资源要耗尽,系统强制kill掉进程,查看内存:
1 2 3 4 5 6 7 | [root@ais ~]# free -h total used free shared buff/cache available Mem: 125G 124G 611M 8.8M 346M 245M Swap: 0B 0B 0B [root@ais ~]# [root@ais ~]# cat /etc/sysctl.conf | grep nr_hugepages vm.nr_hugepages=88850 |
可以看到,系统剩余不到250MB的内存,而大页配置了88850*2/1024=174GB,远超系统内存。
这才想起来,原来我在配置大页的时候,使用了另一个系统的脚本,然后下午重启了主机,所以导致问题出现。
1 2 3 4 5 | [root@ais ~]# history | grep nr 160 2022-03-03 21:01:29 root fs.aio-max-nr = 1048576 235 2022-02-26 21:04:21 root echo -e 'vm.nr_hugepages=88850' >> /etc/sysctl.conf 236 2022-02-26 21:04:23 root echo 88850 > /proc/sys/vm/nr_hugepages 370 2022-03-03 21:17:02 root history | grep nr |
这一点也可以从历史命令中证实。
解决
修改大页的配置:
1 2 | echo -e 'vm.nr_hugepages=51200' >> /etc/sysctl.conf echo 51200 > /proc/sys/vm/nr_hugepages |
配置完成后:
1 2 3 4 5 | [root@ais ~]# free -h total used free shared buff/cache available Mem: 125G 104G 19G 8.8M 1.4G 20G Swap: 0B 0B 0B [root@ais ~]# |
然后重新启动oracle,正常:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | [root@ais ~]# su - oracle Last login: Thu Mar 3 21:01:37 CST 2022 on pts/4 [oracle@ais ~]$ sas SQL*Plus: Release 12.2.0.1.0 Production on Thu Mar 3 21:16:11 2022 Copyright (c) 1982, 2016, Oracle. All rights reserved. Connected to an idle instance. SQL> startup ORACLE instance started. Total System Global Area 7918845952 bytes Fixed Size 8639328 bytes Variable Size 2197818528 bytes Database Buffers 5687476224 bytes Redo Buffers 24911872 bytes Database mounted. Database opened. SQL> exit Disconnected from Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production |
参考
有关大页请参考: