原 GreenPlum在何时会发生自动故障切换、节点漂移及如何排查(OOM)
Tags: 原创故障处理GreenPlumsegmentOOM自动故障转移实例漂移
简介
若发生了OOM,则在gpcc的告警通知里会有“[告警]Out of memory errors”,例如:
情况1(大部分情况):发生了OOM
发生自动切换的一个示例是发生了OOM,在master的日志文件中会有如下的内容:
“FTS: cannot establish libpq connection (content=0, dbid=11): could not fork new process for connection: Cannot allocate memory”或“FATAL: Out of memory. Failed on request of size 144 bytes. (context 'GPORCAmemory pool') ”或“ATAL: the database system is in recovery mode”,
若没有swap内存配置,会发生OOM,特别严重时会导致segment自动故障切换。
1 2 3 4 5 6 7 8 | FTS: cannot establish libpq connection (content=0, dbid=11): could not fork new process for connection: Cannot allocate memory The previous session was reset because its gang was disconnected (session id = 6072). The new session id = 109485 FATAL: Out of memory. Failed on request of size 144 bytes. (context 'GPORCAmemory pool') FATAL: the database system is in recovery mode gang was lost due to cluster reconfiguration(cdbgang_async.c:97) rejecting TCP connection to master using internalconnection protocol Any temporary tables for this session have been dropped because the gang was disconnected (session id = 85341) failed to acquire resources on one or more segments |
模拟OOM错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | docker rm -f gpdbtest docker run -itd --name gpdbtest -h gpdb6 \ -m 2GB --memory-swap=2GB \ -p 28180:28080 \ --privileged=true lhrbest/greenplum:6.23.1 /usr/sbin/init docker exec -it gpdbtest bash su - gpadmin gpstart -am gpconfig -c gp_vmem_protect_limit -v 1024 gpstop -M fast -ar gpcc start http://192.168.88.162:28180 docker stats gpdbtest psql -c 'drop database sbtest2;' psql -c 'create database sbtest2;' sysbench /usr/share/sysbench/oltp_common.lua --db-driver=pgsql --pgsql-host=127.0.0.1 --pgsql-port=5432 \ --pgsql-user=gpadmin --pgsql-password=lhr --pgsql-db=sbtest2 \ --time=300 --table-size=2000000 --tables=10 --threads=10 \ --events=999999999 prepare select * from gp_segment_configuration ; |
情况2:最大进程数超限导致系统资源不足
最大进程数超限,此时,日志报错:
1 2 3 4 5 6 7 8 9 | could not fork new process for connection: Resource temporarily unavailable could not fork new process for connection: Resource temporarily unavailable (seg0 119.10.25.26:6000) FATAL: InitMotionLayerIPC: failed to create thread (ic_udpifc.c:1488) DETAIL: pthread_create() failed with err 11 (seg11 19.10.25.26:7003) |
该报错,多半是因为内核参数没有做正确修改,修复如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ll /lib64/security/pam_limits.so echo "session required /lib64/security/pam_limits.so" >> /etc/pam.d/login cat >> /etc/security/limits.conf <<"EOF" * soft nofile 655350 * hard nofile 655350 * soft nproc 655350 * hard nproc 655350 gpadmin soft priority -20 EOF sed -i 's/4096/655350/' /etc/security/limits.d/20-nproc.conf cat /etc/security/limits.d/20-nproc.conf cat >> /etc/sysctl.conf <<"EOF" fs.file-max=9000000 fs.inotify.max_user_instances = 1000000 fs.inotify.max_user_watches = 1000000 kernel.pid_max=4194304 EOF sysctl -p |