PostgreSQL 11版本基于Pacemaker+Corosync+pcs的高可用搭建
Tags: CorosyncPacemakerpcsPG高可用
简介
PG常见的高可用软件包括pgpool-II、keepalived、patroni+etcd、repmgr等,架构包括如下几种:
- PG 14 + Pgpool-II + Watchdog 实现高可用(自动故障转移+读写分离+负载均衡)
- PG 14 + Pgpool-II + Watchdog 实现高可用
- PG高可用之主从流复制+keepalived 的高可用
- PG高可用集群之Patroni + etcd + HAProxy + keepalived + Prometheus + Grafana监控 部署
- PostgreSQL高可用之repmgr(1主2从+1witness)+Pgpool-II实现主从切换+读写分离+负载均衡
- 【DB宝61】PostgreSQL使用Pgpool-II实现读写分离+负载均衡
- 【DB宝60】PG12高可用之1主2从流复制环境搭建及切换测试
今天,我们介绍另一种高可用模式,PostgreSQL基于Pacemaker+Corosync+pcs的高可用。Pacemaker用于资源的转移,corosync用于心跳的检测。结合起来使用,实现对高可用架构的自动管理。心跳检测用来检测服务器是否还在提供服务,若出现服务器异常,就认为它挂掉了,此时pacemaker将会对资源进行转移。pcs是Corosync和Pacemaker配置工具。
pacemaker 是Linux环境中使用最为广泛的开源集群资源管理器(Cluster Resource Manager,简称CRM), Pacemaker利用集群基础架构(Corosync或者 Heartbeat)提供的消息和集群成员管理功能,实现节点和资源级别的故障检测和资源恢复,从而最大程度保证集群服务的高可用。是整个高可用集群的控制中心,用来管理整个集群的资源状态行为。客户端通过 pacemaker来配置、管理、监控整个集群的运行状态。
pacemaker 官网地址:https://clusterlabs.org/pacemaker/
pacemaker github:https://github.com/ClusterLabs/pacemaker
Corosync集群引擎是一种群组通信系统(Group Communication System),为应用内部额外提供支持高可用性特性。corosync和heartbeat都属于消息网络层,对外提供服务和主机的心跳检测,在监控的主服务被发现当机之后,即时切换到从属的备份节点,保证系统的可用性。一般来说都是选择corosync来进行心跳的检测,搭配pacemaker的资源管理系统来构建高可用的系统。
pcs是Corosync和Pacemaker配置工具。它允许用户轻松查看,修改和创建基于Pacemaker的集群。pcs包含pcsd(一个pc守护程序),它可作为pc的远程服务器并提供Web UI。全部受控的 pacemaker和配置属性的变更管理都可以通过 pcs实现。
其实,Pacemaker+Corosync+pcs也是用于“在Linux中安装MSSQL 2017 Always On Availability Group”的管理工具,具体可以参考:https://www.xmmup.com/zailinuxzhonganzhuangmssql-2017-always-on-availability-group.html
注意:请先看本文的总结部分!!!
环境介绍
IP | 端口 | 映射主机端口 | 角色 | OS | 软件 |
---|---|---|---|---|---|
172.72.6.81 | 5432 | 64381 | 主库 | CentOS 7.6 | PG 11.12 + Pacemaker 1.1.23 + Corosync + pcs |
172.72.6.82 | 5432 | 64382 | 备库 | CentOS 7.6 | PG 11.12 + Pacemaker 1.1.23 + Corosync + pcs |
172.72.6.83 | 5432 | 64383 | 备库 | CentOS 7.6 | PG 11.12 + Pacemaker 1.1.23 + Corosync + pcs |
使用docker快速申请环境,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | docker rm -f lhrpgpcs81 docker run -d --name lhrpgpcs81 -h lhrpgpcs81 \ --net=pg-network --ip 172.72.6.81 \ -p 64381:5432 \ -v /sys/fs/cgroup:/sys/fs/cgroup \ --privileged=true lhrbest/lhrpgall:2.0 \ /usr/sbin/init docker rm -f lhrpgpcs82 docker run -d --name lhrpgpcs82 -h lhrpgpcs82 \ --net=pg-network --ip 172.72.6.82 \ -p 64382:5432 \ -v /sys/fs/cgroup:/sys/fs/cgroup \ --privileged=true lhrbest/lhrpgall:2.0 \ /usr/sbin/init docker rm -f lhrpgpcs83 docker run -d --name lhrpgpcs83 -h lhrpgpcs83 \ --net=pg-network --ip 172.72.6.83 \ -p 64383:5432 \ -v /sys/fs/cgroup:/sys/fs/cgroup \ --privileged=true lhrbest/lhrpgall:2.0 \ /usr/sbin/init cat >> /etc/hosts <<"EOF" 172.72.6.81 lhrpgpcs81 172.72.6.82 lhrpgpcs82 172.72.6.83 lhrpgpcs83 172.72.6.84 vip-master 172.72.6.85 vip-slave EOF -- 3个节点都需要调整 systemctl stop pg94 pg96 pg10 pg11 pg12 pg13 postgresql-13 systemctl disable pg94 pg96 pg10 pg11 pg12 pg13 postgresql-13 mount -o remount,size=4G /dev/shm |
安装和配置集群
安装依赖包和集群软件并启动pcs
3个节点都操作:
1 2 3 4 5 6 | yum install -y pacemaker corosync pcs systemctl enable pacemaker corosync pcsd systemctl start pcsd systemctl status pcsd |
配置密码
3个节点都操作:
1 | echo "hacluster:lhr" | chpasswd |
集群认证
任意一个节点:
1 2 3 4 | [root@lhrpgpcs81 /]# pcs cluster auth -u hacluster -p lhr lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 lhrpgpcs83: Authorized lhrpgpcs82: Authorized lhrpgpcs81: Authorized |
同步配置
任意一个节点:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | [root@lhrpgpcs81 /]# pcs cluster setup --last_man_standing=1 --name pgcluster lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 Destroying cluster on nodes: lhrpgpcs81, lhrpgpcs82, lhrpgpcs83... lhrpgpcs82: Stopping Cluster (pacemaker)... lhrpgpcs83: Stopping Cluster (pacemaker)... lhrpgpcs81: Stopping Cluster (pacemaker)... lhrpgpcs83: Successfully destroyed cluster lhrpgpcs81: Successfully destroyed cluster lhrpgpcs82: Successfully destroyed cluster Sending 'pacemaker_remote authkey' to 'lhrpgpcs81', 'lhrpgpcs82', 'lhrpgpcs83' lhrpgpcs81: successful distribution of the file 'pacemaker_remote authkey' lhrpgpcs82: successful distribution of the file 'pacemaker_remote authkey' lhrpgpcs83: successful distribution of the file 'pacemaker_remote authkey' Sending cluster config files to the nodes... lhrpgpcs81: Succeeded lhrpgpcs82: Succeeded lhrpgpcs83: Succeeded Synchronizing pcsd certificates on nodes lhrpgpcs81, lhrpgpcs82, lhrpgpcs83... lhrpgpcs83: Success lhrpgpcs82: Success lhrpgpcs81: Success Restarting pcsd on the nodes in order to reload the certificates... lhrpgpcs83: Success lhrpgpcs82: Success lhrpgpcs81: Success |
启动集群
1 2 3 4 5 6 7 | systemctl status pacemaker corosync pcsd pcs cluster enable --all pcs cluster start --all pcs cluster status pcs status crm_mon -Afr -1 systemctl status pacemaker corosync pcsd |
过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | [root@lhrpgpcs81 /]# systemctl status pacemaker corosync pcsd ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: man:pacemakerd https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html ● corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: man:corosync man:corosync.conf man:corosync_overview ● pcsd.service - PCS GUI and remote configuration interface Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-05-26 09:17:43 CST; 1min 4s ago Docs: man:pcsd(8) man:pcs(8) Main PID: 1682 (pcsd) CGroup: /docker/fb1552ae5063103b6754dc18a049673c6ee97fb59bc767374aeef2768d3607be/system.slice/pcsd.service └─1682 /usr/bin/ruby /usr/lib/pcsd/pcsd May 26 09:17:42 lhrpgpcs81 systemd[1]: Starting PCS GUI and remote configuration interface... May 26 09:17:43 lhrpgpcs81 systemd[1]: Started PCS GUI and remote configuration interface. [root@lhrpgpcs81 /]# pcs cluster start --all lhrpgpcs81: Starting Cluster (corosync)... lhrpgpcs82: Starting Cluster (corosync)... lhrpgpcs83: Starting Cluster (corosync)... lhrpgpcs81: Starting Cluster (pacemaker)... lhrpgpcs83: Starting Cluster (pacemaker)... lhrpgpcs82: Starting Cluster (pacemaker)... [root@lhrpgpcs81 /]# systemctl status pacemaker corosync pcsd ● pacemaker.service - Pacemaker High Availability Cluster Manager Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; disabled; vendor preset: disabled) Active: active (running) since Thu 2022-05-26 09:19:09 CST; 3s ago Docs: man:pacemakerd https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html Main PID: 1893 (pacemakerd) CGroup: /docker/fb1552ae5063103b6754dc18a049673c6ee97fb59bc767374aeef2768d3607be/system.slice/pacemaker.service ├─1893 /usr/sbin/pacemakerd -f ├─1894 /usr/libexec/pacemaker/cib ├─1895 /usr/libexec/pacemaker/stonithd ├─1896 /usr/libexec/pacemaker/lrmd ├─1897 /usr/libexec/pacemaker/attrd ├─1898 /usr/libexec/pacemaker/pengine └─1899 /usr/libexec/pacemaker/crmd May 26 09:19:11 lhrpgpcs81 crmd[1899]: notice: Connecting to cluster infrastructure: corosync May 26 09:19:11 lhrpgpcs81 crmd[1899]: notice: Quorum acquired May 26 09:19:11 lhrpgpcs81 attrd[1897]: notice: Node lhrpgpcs83 state is now member May 26 09:19:11 lhrpgpcs81 attrd[1897]: notice: Node lhrpgpcs82 state is now member May 26 09:19:11 lhrpgpcs81 crmd[1899]: notice: Node lhrpgpcs81 state is now member May 26 09:19:11 lhrpgpcs81 crmd[1899]: notice: Node lhrpgpcs82 state is now member May 26 09:19:11 lhrpgpcs81 crmd[1899]: notice: Node lhrpgpcs83 state is now member May 26 09:19:11 lhrpgpcs81 crmd[1899]: notice: The local CRM is operational May 26 09:19:11 lhrpgpcs81 crmd[1899]: notice: State transition S_STARTING -> S_PENDING May 26 09:19:13 lhrpgpcs81 crmd[1899]: notice: Fencer successfully connected ● corosync.service - Corosync Cluster Engine Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled) Active: active (running) since Thu 2022-05-26 09:19:08 CST; 4s ago Docs: man:corosync man:corosync.conf man:corosync_overview Process: 1861 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS) Main PID: 1869 (corosync) CGroup: /docker/fb1552ae5063103b6754dc18a049673c6ee97fb59bc767374aeef2768d3607be/system.slice/corosync.service └─1869 corosync May 26 09:19:08 lhrpgpcs81 corosync[1869]: [QUORUM] Members[2]: 1 2 May 26 09:19:08 lhrpgpcs81 corosync[1869]: [MAIN ] Completed service synchronization, ready to provide service. May 26 09:19:08 lhrpgpcs81 corosync[1861]: Starting Corosync Cluster Engine (corosync): [ OK ] May 26 09:19:08 lhrpgpcs81 systemd[1]: Started Corosync Cluster Engine. May 26 09:19:08 lhrpgpcs81 corosync[1869]: [TOTEM ] A new membership (172.72.6.81:13) was formed. Members joined: 3 May 26 09:19:08 lhrpgpcs81 corosync[1869]: [CPG ] downlist left_list: 0 received May 26 09:19:08 lhrpgpcs81 corosync[1869]: [CPG ] downlist left_list: 0 received May 26 09:19:08 lhrpgpcs81 corosync[1869]: [CPG ] downlist left_list: 0 received May 26 09:19:08 lhrpgpcs81 corosync[1869]: [QUORUM] Members[3]: 1 2 3 May 26 09:19:08 lhrpgpcs81 corosync[1869]: [MAIN ] Completed service synchronization, ready to provide service. ● pcsd.service - PCS GUI and remote configuration interface Loaded: loaded (/usr/lib/systemd/system/pcsd.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2022-05-26 09:17:43 CST; 1min 29s ago Docs: man:pcsd(8) man:pcs(8) Main PID: 1682 (pcsd) CGroup: /docker/fb1552ae5063103b6754dc18a049673c6ee97fb59bc767374aeef2768d3607be/system.slice/pcsd.service └─1682 /usr/bin/ruby /usr/lib/pcsd/pcsd May 26 09:17:42 lhrpgpcs81 systemd[1]: Starting PCS GUI and remote configuration interface... May 26 09:17:43 lhrpgpcs81 systemd[1]: Started PCS GUI and remote configuration interface. [root@lhrpgpcs81 /]# pcs cluster status Cluster Status: Stack: corosync Current DC: lhrpgpcs81 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 10:59:02 2022 Last change: Thu May 26 10:58:18 2022 by hacluster via crmd on lhrpgpcs81 3 nodes configured 0 resource instances configured PCSD Status: lhrpgpcs81: Online lhrpgpcs82: Online lhrpgpcs83: Online [root@lhrpgpcs81 /]# pcs status Cluster name: pgcluster WARNINGS: No stonith devices and stonith-enabled is not false Stack: corosync Current DC: lhrpgpcs81 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 10:59:19 2022 Last change: Thu May 26 10:58:18 2022 by hacluster via crmd on lhrpgpcs81 3 nodes configured 0 resource instances configured Online: [ lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 ] No resources Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@lhrpgpcs81 /]# crm_mon -Afr -1 Stack: corosync Current DC: lhrpgpcs81 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 10:59:40 2022 Last change: Thu May 26 10:58:18 2022 by hacluster via crmd on lhrpgpcs81 3 nodes configured 0 resource instances configured Online: [ lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 ] No resources Node Attributes: * Node lhrpgpcs81: * Node lhrpgpcs82: * Node lhrpgpcs83: Migration Summary: * Node lhrpgpcs81: * Node lhrpgpcs82: * Node lhrpgpcs83: |
安装配置PG环境
使用yum安装的pacemaker为1.1.23版本,对于PG来说,最高只支持到PG11。
查看Pacemaker支持的PostgreSQL版本:
1 2 3 4 | [root@lhrpgpcs81 /]# cat /usr/lib/ocf/resource.d/heartbeat/pgsql | grep ocf_version_cmp ocf_version_cmp "$version" "9.3" ocf_version_cmp "$version" "10" ocf_version_cmp "$version" "9.4" |
主库修改参数,开启归档
我的镜像环境已经安装好了PG,所以,不再安装,只需要做基础的配置动作即可,例如防火墙,远程登录等,启动主库:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | vi /pg11/pgdata/postgresql.conf port=5432 注释unix_socket_directories行 wal_level='replica' archive_mode='on' archive_command='test ! -f /pg11/archive/%f && cp %p /pg11/archive/%f' max_wal_senders=10 wal_keep_segments=256 wal_sender_timeout=60s -- 修改权限 chown postgres.postgres -R /pg11 -- 修改 vi /etc/systemd/system/pg11.service User=postgres Environment=PGPORT=5432 systemctl daemon-reload systemctl status pg11 systemctl restart pg11 |
主库创建复制用户
1 2 3 4 5 | su - postgres psql -h /tmp psql -h 192.168.88.35 -p64381 -U postgres create user lhrrep with replication password 'lhr'; |
创建备机节点
在节点2和节点3上操作:
1 2 3 4 | systemctl stop pg11 rm -rf /pg11/pgdata pg_basebackup -h lhrpgpcs81 -U lhrrep -p 5432 -D /pg11/pgdata --wal-method=stream --checkpoint=fast --progress --verbose chown postgres.postgres -R /pg11/pgdata |
停止主库
1 | systemctl stop pg11 |
配置DB集群
配置DB集群
使用root用户执行:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | cat > lhrpg_cluster_setup.sh <<"EOF" pcs cluster cib pgsql_cfg pcs -f pgsql_cfg property set no-quorum-policy="ignore" pcs -f pgsql_cfg property set stonith-enabled="false" pcs -f pgsql_cfg resource defaults resource-stickiness="INFINITY" pcs -f pgsql_cfg resource defaults migration-threshold="1" pcs -f pgsql_cfg resource create vip-master IPaddr2 \ ip="172.72.6.84" \ nic="eth0" \ cidr_netmask="24" \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="60s" interval="10s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="block" pcs -f pgsql_cfg resource create vip-slave IPaddr2 \ ip="172.72.6.85" \ nic="eth0" \ cidr_netmask="24" \ meta migration-threshold="0" \ op start timeout="60s" interval="0s" on-fail="stop" \ op monitor timeout="60s" interval="10s" on-fail="restart" \ op stop timeout="60s" interval="0s" on-fail="ignore" pcs -f pgsql_cfg resource create pgsql pgsql \ pgctl="/pg11/pg11/bin/pg_ctl" \ psql="/pg11/pg11/bin/psql" \ pgdata="/pg11/pgdata" \ config="/pg11/pgdata/postgresql.conf" \ rep_mode="async" \ node_list="lhrpgpcs81 lhrpgpcs82 lhrpgpcs83" \ master_ip="172.72.6.84" \ repuser="lhrrep" \ primary_conninfo_opt="password=lhr keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \ restart_on_promote='true' \ op start timeout="60s" interval="0s" on-fail="restart" \ op monitor timeout="60s" interval="4s" on-fail="restart" \ op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" \ op promote timeout="60s" interval="0s" on-fail="restart" \ op demote timeout="60s" interval="0s" on-fail="stop" \ op stop timeout="60s" interval="0s" on-fail="block" \ op notify timeout="60s" interval="0s" pcs -f pgsql_cfg resource master msPostgresql pgsql \ master-max=1 master-node-max=1 clone-max=5 clone-node-max=1 notify=true pcs -f pgsql_cfg resource group add master-group vip-master pcs -f pgsql_cfg resource group add slave-group vip-slave pcs -f pgsql_cfg constraint colocation add master-group with master msPostgresql INFINITY pcs -f pgsql_cfg constraint order promote msPostgresql then start master-group symmetrical=false score=INFINITY pcs -f pgsql_cfg constraint order demote msPostgresql then stop master-group symmetrical=false score=0 pcs -f pgsql_cfg constraint colocation add slave-group with slave msPostgresql INFINITY pcs -f pgsql_cfg constraint order promote msPostgresql then start slave-group symmetrical=false score=INFINITY pcs -f pgsql_cfg constraint order demote msPostgresql then stop slave-group symmetrical=false score=0 pcs cluster cib-push pgsql_cfg EOF |
执行该shell脚本,执行完会产生pgsql_cfg的配置文件:
1 2 | chmod + lhrpg_cluster_setup.sh sh lhrpg_cluster_setup.sh |
过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | [root@lhrpgpcs81 ~]# chmod + lhrpg_cluster_setup.sh [root@lhrpgpcs81 ~]# sh lhrpg_cluster_setup.sh Warning: Defaults do not apply to resources which override them with their own defined values Warning: Defaults do not apply to resources which override them with their own defined values Assumed agent name 'ocf:heartbeat:IPaddr2' (deduced from 'IPaddr2') Assumed agent name 'ocf:heartbeat:IPaddr2' (deduced from 'IPaddr2') Assumed agent name 'ocf:heartbeat:pgsql' (deduced from 'pgsql') Adding msPostgresql master-group (score: INFINITY) (Options: first-action=promote then-action=start symmetrical=false) Adding msPostgresql master-group (score: 0) (Options: first-action=demote then-action=stop symmetrical=false) Adding msPostgresql slave-group (score: INFINITY) (Options: first-action=promote then-action=start symmetrical=false) Adding msPostgresql slave-group (score: 0) (Options: first-action=demote then-action=stop symmetrical=false) CIB updated [root@lhrpgpcs81 ~]# [root@lhrpgpcs81 ~]# ll total 12 -rw-r--r-- 1 root root 2675 May 26 09:47 lhrpg_cluster_setup.sh -rw-r--r-- 1 root root 8091 May 26 09:47 pgsql_cfg [root@lhrpgpcs81 ~]# cat pgsql_cfg <cib crm_feature_set="3.0.14" validate-with="pacemaker-2.10" epoch="21" num_updates="0" admin_epoch="0" cib-last-written="Thu May 26 14:54:22 2022" update-origin="lhrpgpcs81" update-client="crmd" update-user="hacluster" have-quorum="1" dc-uuid="1"> <configuration> <crm_config> <cluster_property_set id="cib-bootstrap-options"> <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.23-1.el7_9.1-9acf116022"/> <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/> <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="pgcluster"/> <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/> </cluster_property_set> </crm_config> <nodes> <node id="1" uname="lhrpgpcs81"/> <node id="2" uname="lhrpgpcs82"/> <node id="3" uname="lhrpgpcs83"/> </nodes> <resources> <master id="msPostgresql"> <primitive class="ocf" id="pgsql" provider="heartbeat" type="pgsql"> <instance_attributes id="pgsql-instance_attributes"> <nvpair id="pgsql-instance_attributes-config" name="config" value="/pg11/pgdata/postgresql.conf"/> <nvpair id="pgsql-instance_attributes-master_ip" name="master_ip" value="172.72.6.84"/> <nvpair id="pgsql-instance_attributes-node_list" name="node_list" value="lhrpgpcs81 lhrpgpcs82 lhrpgpcs83"/> <nvpair id="pgsql-instance_attributes-pgctl" name="pgctl" value="/pg11/pg11/bin/pg_ctl"/> <nvpair id="pgsql-instance_attributes-pgdata" name="pgdata" value="/pg11/pgdata"/> <nvpair id="pgsql-instance_attributes-primary_conninfo_opt" name="primary_conninfo_opt" value="password=lhr keepalives_idle=60 keepalives_interval=5 keepalives_count=5"/> <nvpair id="pgsql-instance_attributes-psql" name="psql" value="/pg11/pg11/bin/psql"/> <nvpair id="pgsql-instance_attributes-rep_mode" name="rep_mode" value="async"/> <nvpair id="pgsql-instance_attributes-repuser" name="repuser" value="lhrrep"/> <nvpair id="pgsql-instance_attributes-restart_on_promote" name="restart_on_promote" value="true"/> </instance_attributes> <operations> <op id="pgsql-demote-interval-0s" interval="0s" name="demote" on-fail="stop" timeout="60s"/> <op id="pgsql-methods-interval-0s" interval="0s" name="methods" timeout="5s"/> <op id="pgsql-monitor-interval-4s" interval="4s" name="monitor" on-fail="restart" timeout="60s"/> <op id="pgsql-monitor-interval-3s" interval="3s" name="monitor" on-fail="restart" role="Master" timeout="60s"/> <op id="pgsql-notify-interval-0s" interval="0s" name="notify" timeout="60s"/> <op id="pgsql-promote-interval-0s" interval="0s" name="promote" on-fail="restart" timeout="60s"/> <op id="pgsql-start-interval-0s" interval="0s" name="start" on-fail="restart" timeout="60s"/> <op id="pgsql-stop-interval-0s" interval="0s" name="stop" on-fail="block" timeout="60s"/> </operations> </primitive> <meta_attributes id="msPostgresql-meta_attributes"> <nvpair id="msPostgresql-meta_attributes-master-node-max" name="master-node-max" value="1"/> <nvpair id="msPostgresql-meta_attributes-clone-max" name="clone-max" value="5"/> <nvpair id="msPostgresql-meta_attributes-notify" name="notify" value="true"/> <nvpair id="msPostgresql-meta_attributes-master-max" name="master-max" value="1"/> <nvpair id="msPostgresql-meta_attributes-clone-node-max" name="clone-node-max" value="1"/> </meta_attributes> </master> <group id="master-group"> <primitive class="ocf" id="vip-master" provider="heartbeat" type="IPaddr2"> <instance_attributes id="vip-master-instance_attributes"> <nvpair id="vip-master-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/> <nvpair id="vip-master-instance_attributes-ip" name="ip" value="172.72.6.84"/> <nvpair id="vip-master-instance_attributes-nic" name="nic" value="eth0"/> </instance_attributes> <operations> <op id="vip-master-monitor-interval-10s" interval="10s" name="monitor" on-fail="restart" timeout="60s"/> <op id="vip-master-start-interval-0s" interval="0s" name="start" on-fail="restart" timeout="60s"/> <op id="vip-master-stop-interval-0s" interval="0s" name="stop" on-fail="block" timeout="60s"/> </operations> </primitive> </group> <group id="slave-group"> <primitive class="ocf" id="vip-slave" provider="heartbeat" type="IPaddr2"> <instance_attributes id="vip-slave-instance_attributes"> <nvpair id="vip-slave-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/> <nvpair id="vip-slave-instance_attributes-ip" name="ip" value="172.72.6.85"/> <nvpair id="vip-slave-instance_attributes-nic" name="nic" value="eth0"/> </instance_attributes> <meta_attributes id="vip-slave-meta_attributes"> <nvpair id="vip-slave-meta_attributes-migration-threshold" name="migration-threshold" value="0"/> </meta_attributes> <operations> <op id="vip-slave-monitor-interval-10s" interval="10s" name="monitor" on-fail="restart" timeout="60s"/> <op id="vip-slave-start-interval-0s" interval="0s" name="start" on-fail="stop" timeout="60s"/> <op id="vip-slave-stop-interval-0s" interval="0s" name="stop" on-fail="ignore" timeout="60s"/> </operations> </primitive> </group> </resources> <constraints> <rsc_colocation id="colocation-master-group-msPostgresql-INFINITY" rsc="master-group" rsc-role="Started" score="INFINITY" with-rsc="msPostgresql" with-rsc-role="Master"/> <rsc_order first="msPostgresql" first-action="promote" id="order-msPostgresql-master-group-INFINITY" score="INFINITY" symmetrical="false" then="master-group" then-action="start"/> <rsc_order first="msPostgresql" first-action="demote" id="order-msPostgresql-master-group-0" score="0" symmetrical="false" then="master-group" then-action="stop"/> <rsc_colocation id="colocation-slave-group-msPostgresql-INFINITY" rsc="slave-group" rsc-role="Started" score="INFINITY" with-rsc="msPostgresql" with-rsc-role="Slave"/> <rsc_order first="msPostgresql" first-action="promote" id="order-msPostgresql-slave-group-INFINITY" score="INFINITY" symmetrical="false" then="slave-group" then-action="start"/> <rsc_order first="msPostgresql" first-action="demote" id="order-msPostgresql-slave-group-0" score="0" symmetrical="false" then="slave-group" then-action="stop"/> </constraints> <rsc_defaults> <meta_attributes id="rsc_defaults-options"> <nvpair id="rsc_defaults-options-resource-stickiness" name="resource-stickiness" value="INFINITY"/> <nvpair id="rsc_defaults-options-migration-threshold" name="migration-threshold" value="1"/> </meta_attributes> </rsc_defaults> </configuration> <status> <node_state id="2" uname="lhrpgpcs82" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member"> <lrm id="2"> <lrm_resources/> </lrm> </node_state> <node_state id="3" uname="lhrpgpcs83" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member"> <lrm id="3"> <lrm_resources/> </lrm> </node_state> <node_state id="1" uname="lhrpgpcs81" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member"> <lrm id="1"> <lrm_resources/> </lrm> </node_state> </status> </cib> [root@lhrpgpcs81 ~]# |
集群状态检查
重新启动集群
数据库无需配置自启动,由集群软件自动拉起
1 2 3 4 5 6 | systemctl status pacemaker corosync pcsd systemctl restart pacemaker corosync pcsd pcs cluster start --all pcs cluster status pcs status pcs status corosync |
过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | [root@lhrpgpcs81 ~]# pcs cluster start --all lhrpgpcs81: Starting Cluster (corosync)... lhrpgpcs82: Starting Cluster (corosync)... lhrpgpcs83: Starting Cluster (corosync)... lhrpgpcs83: Starting Cluster (pacemaker)... lhrpgpcs81: Starting Cluster (pacemaker)... lhrpgpcs82: Starting Cluster (pacemaker)... [root@lhrpgpcs81 ~]# [root@lhrpgpcs81 ~]# pcs cluster status Cluster Status: Stack: corosync Current DC: lhrpgpcs81 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 15:08:59 2022 Last change: Thu May 26 15:08:05 2022 by root via crm_attribute on lhrpgpcs81 3 nodes configured 7 resource instances configured PCSD Status: lhrpgpcs82: Online lhrpgpcs83: Online lhrpgpcs81: Online [root@lhrpgpcs81 ~]# [root@lhrpgpcs81 ~]# pcs status Cluster name: pgcluster Stack: corosync Current DC: lhrpgpcs81 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 15:09:02 2022 Last change: Thu May 26 15:08:05 2022 by root via crm_attribute on lhrpgpcs81 3 nodes configured 7 resource instances configured Online: [ lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 ] Full list of resources: Master/Slave Set: msPostgresql [pgsql] Masters: [ lhrpgpcs81 ] Slaves: [ lhrpgpcs82 lhrpgpcs83 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started lhrpgpcs81 Resource Group: slave-group vip-slave (ocf::heartbeat:IPaddr2): Started lhrpgpcs82 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@lhrpgpcs81 ~]# pcs status corosync Membership information ---------------------- Nodeid Votes Name 1 1 lhrpgpcs81 (local) 2 1 lhrpgpcs82 3 1 lhrpgpcs83 |
注意:当vip-master和vip-slave都是Started状态时,集群节点配置正确。否则,请检查你的配置是否有误.
查看本机集群状态及资源状态
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | [root@lhrpgpcs81 ~]# crm_mon -Afr -1 Stack: corosync Current DC: lhrpgpcs81 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 15:09:44 2022 Last change: Thu May 26 15:08:05 2022 by root via crm_attribute on lhrpgpcs81 3 nodes configured 7 resource instances configured Online: [ lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 ] Full list of resources: Master/Slave Set: msPostgresql [pgsql] Masters: [ lhrpgpcs81 ] Slaves: [ lhrpgpcs82 lhrpgpcs83 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started lhrpgpcs81 Resource Group: slave-group vip-slave (ocf::heartbeat:IPaddr2): Started lhrpgpcs82 Node Attributes: * Node lhrpgpcs81: + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 00000000050000D0 + pgsql-status : PRI * Node lhrpgpcs82: + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async * Node lhrpgpcs83: + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async Migration Summary: * Node lhrpgpcs82: * Node lhrpgpcs83: * Node lhrpgpcs81: |
查看主节点流复制状态
1 2 3 4 5 6 7 8 9 10 | su - postgres psql -p 5432 -h /tmp select * from pg_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state -------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------ 5430 | 16419 | lhrrep | lhrpgpcs83 | 172.72.6.83 | | 44416 | 2022-05-26 15:08:04.377407+08 | | streaming | 0/9000060 | 0/9000060 | 0/9000060 | 0/9000060 | | | | 0 | async 10832 | 16419 | lhrrep | lhrpgpcs82 | 172.72.6.82 | | 51560 | 2022-05-26 15:27:51.976977+08 | | streaming | 0/9000060 | 0/9000060 | 0/9000060 | 0/9000060 | | | | 0 | async (2 rows) |
查看vip挂载情况
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | [postgres@lhrpgpcs81 ~]$ ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 102: eth0@if103: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:48:06:51 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.72.6.81/24 brd 172.72.6.255 scope global eth0 valid_lft forever preferred_lft forever inet 172.72.6.84/24 brd 172.72.6.255 scope global secondary eth0 valid_lft forever preferred_lft forever [root@lhrpgpcs82 /]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 108: eth0@if109: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:48:06:52 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.72.6.82/24 brd 172.72.6.255 scope global eth0 valid_lft forever preferred_lft forever inet 172.72.6.85/24 brd 172.72.6.255 scope global secondary eth0 valid_lft forever preferred_lft forever [root@lhrpgpcs83 /]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 110: eth0@if111: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default link/ether 02:42:ac:48:06:53 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 172.72.6.83/24 brd 172.72.6.255 scope global eth0 valid_lft forever preferred_lft forever |
数据同步验证
主节点插入数据
1 2 3 | psql -h 192.168.88.35 -p64381 -U postgres create table test(id int); insert into test values(1); |
从库查询:
1 2 | psql -h 192.168.88.35 -p64382 -U postgres select * from test; |
故障模拟
停掉主库,节点2提升为主节点,节点3自动指向节点2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | [root@lhrpgpcs81 ~]# su - postgres Last login: Thu May 26 15:28:35 CST 2022 on pts/0 [postgres@lhrpgpcs81 ~]$ pg_ctl stop -D /pg11/pgdata waiting for server to shut down.... done server stopped [postgres@lhrpgpcs81 ~]$ exit logout [root@lhrpgpcs81 ~]# [root@lhrpgpcs81 ~]# [root@lhrpgpcs81 ~]# pcs status Cluster name: pgcluster Stack: corosync Current DC: lhrpgpcs81 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 15:35:58 2022 Last change: Thu May 26 15:30:16 2022 by root via crm_attribute on lhrpgpcs81 3 nodes configured 7 resource instances configured Online: [ lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 ] Full list of resources: Master/Slave Set: msPostgresql [pgsql] Slaves: [ lhrpgpcs82 lhrpgpcs83 ] Stopped: [ lhrpgpcs81 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Stopped Resource Group: slave-group vip-slave (ocf::heartbeat:IPaddr2): Started lhrpgpcs82 Failed Resource Actions: * pgsql_monitor_3000 on lhrpgpcs81 'not running' (7): call=23, status=complete, exitreason='', last-rc-change='Thu May 26 15:35:48 2022', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@lhrpgpcs81 ~]# pcs status Cluster name: pgcluster Stack: corosync Current DC: lhrpgpcs81 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 15:36:18 2022 Last change: Thu May 26 15:36:11 2022 by root via crm_attribute on lhrpgpcs82 3 nodes configured 7 resource instances configured Online: [ lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 ] Full list of resources: Master/Slave Set: msPostgresql [pgsql] Masters: [ lhrpgpcs82 ] Slaves: [ lhrpgpcs83 ] Stopped: [ lhrpgpcs81 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started lhrpgpcs82 Resource Group: slave-group vip-slave (ocf::heartbeat:IPaddr2): Started lhrpgpcs83 Failed Resource Actions: * pgsql_monitor_3000 on lhrpgpcs81 'not running' (7): call=23, status=complete, exitreason='', last-rc-change='Thu May 26 15:35:48 2022', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@lhrpgpcs81 ~]# crm_mon -Arf -1 Stack: corosync Current DC: lhrpgpcs81 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 15:37:45 2022 Last change: Thu May 26 15:36:11 2022 by root via crm_attribute on lhrpgpcs82 3 nodes configured 7 resource instances configured Online: [ lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 ] Full list of resources: Master/Slave Set: msPostgresql [pgsql] Masters: [ lhrpgpcs82 ] Slaves: [ lhrpgpcs83 ] Stopped: [ lhrpgpcs81 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started lhrpgpcs82 Resource Group: slave-group vip-slave (ocf::heartbeat:IPaddr2): Started lhrpgpcs83 Node Attributes: * Node lhrpgpcs81: + master-pgsql : -INFINITY + pgsql-data-status : DISCONNECT + pgsql-status : STOP * Node lhrpgpcs82: + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000009011078 + pgsql-status : PRI * Node lhrpgpcs83: + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async Migration Summary: * Node lhrpgpcs82: * Node lhrpgpcs83: * Node lhrpgpcs81: pgsql: migration-threshold=1 fail-count=1 last-failure='Thu May 26 15:35:48 2022' Failed Resource Actions: * pgsql_monitor_3000 on lhrpgpcs81 'not running' (7): call=23, status=complete, exitreason='', last-rc-change='Thu May 26 15:35:48 2022', queued=0ms, exec=0ms postgres=# select * from pg_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state -------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+-----------+------------+---------------+------------ 19348 | 16419 | lhrrep | lhrpgpcs83 | 172.72.6.83 | | 44920 | 2022-05-26 15:36:10.651723+08 | | streaming | 0/9011120 | 0/9011120 | 0/9011120 | 0/9011120 | | | | 0 | async (1 row) |
修复节点1
在master宕机启动时,需要删除临时锁文件方可进行集群角色转换。即执行rm -rf /var/lib/pgsql/tmp/PGSQL.lock
。
拉起数据库不需要执行pg_ctl start,只需重启服务(systemctl restart corosync pacemaker pcsd
),软件会自动把数据库拉起。
节点1启动后,作为节点2的从库:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | [root@lhrpgpcs81 ~]# find / -name PGSQL.lock /var/lib/pgsql/tmp/PGSQL.lock [root@lhrpgpcs81 ~]# rm -rf /var/lib/pgsql/tmp/PGSQL.lock [root@lhrpgpcs81 ~]# systemctl restart corosync pacemaker pcsd [root@lhrpgpcs81 ~]# pcs status Cluster name: pgcluster Stack: corosync Current DC: lhrpgpcs83 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 15:59:08 2022 Last change: Thu May 26 15:58:46 2022 by root via crm_attribute on lhrpgpcs82 3 nodes configured 7 resource instances configured Online: [ lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 ] Full list of resources: Master/Slave Set: msPostgresql [pgsql] Masters: [ lhrpgpcs82 ] Slaves: [ lhrpgpcs81 lhrpgpcs83 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started lhrpgpcs82 Resource Group: slave-group vip-slave (ocf::heartbeat:IPaddr2): Started lhrpgpcs83 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [root@lhrpgpcs81 ~]# crm_mon -Arf -1 Stack: corosync Current DC: lhrpgpcs83 (version 1.1.23-1.el7_9.1-9acf116022) - partition with quorum Last updated: Thu May 26 16:00:00 2022 Last change: Thu May 26 15:58:46 2022 by root via crm_attribute on lhrpgpcs82 3 nodes configured 7 resource instances configured Online: [ lhrpgpcs81 lhrpgpcs82 lhrpgpcs83 ] Full list of resources: Master/Slave Set: msPostgresql [pgsql] Masters: [ lhrpgpcs82 ] Slaves: [ lhrpgpcs81 lhrpgpcs83 ] Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started lhrpgpcs82 Resource Group: slave-group vip-slave (ocf::heartbeat:IPaddr2): Started lhrpgpcs83 Node Attributes: * Node lhrpgpcs81: + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async * Node lhrpgpcs82: + master-pgsql : 1000 + pgsql-data-status : LATEST + pgsql-master-baseline : 0000000005000140 + pgsql-status : PRI * Node lhrpgpcs83: + master-pgsql : -INFINITY + pgsql-data-status : STREAMING|ASYNC + pgsql-status : HS:async Migration Summary: * Node lhrpgpcs81: * Node lhrpgpcs82: * Node lhrpgpcs83: |
集群常用操作
1 2 3 4 5 6 7 8 9 10 | pcs status //查看集群状态 pcs resource show //查看资源 pcs resource create ClusterIP IPaddr2 ip=192.168.0.120 cidr_netmask=32 //创建一个虚拟IP资源 pcs resource cleanup //xx表示虚拟资源名称,当集群有资源处于unmanaged的状态时,可以用这个命令清理掉失败的信息,然后重置资源状态 pcs resource list //查看资源列表 pcs resource restart //重启资源 pcs resource enable //启动资源 pcs resource disable //关闭资源 pcs resource delete //删除资源 crm_mon -Arf -1 //查看同步状态和资源 |
修改集群配置
1 2 3 4 5 6 7 8 | cibadmin --query > tmp.xml //将当前集群配置信息保存到tmp.xml文件中 cibadmin --query > tmp.xml vi tmp.xml //使用编辑器对XML文件进行修改 vim tmp.xml cibadmin --replace --xml-file tmp.xml //将修改后的XML文件替换掉当前集群的配置信息 cibadmin --replace --xml-file tmp.xml |
cibadmin是用于操作 Heartbeat CIB 的低级管理命令。它可以用来转储、更新或修改所有或部分 CIB,删除整个 CIB 或执行各种 CIB 管理操作。
集群配置信息是Pacemaker集群中CIB信息的关键组成部分,Pacemaker的集群配置信息决定了集群最终应该如何工作以及集群最终的运行状态,因为只有一个正确的集群配置才能驱动集群资源运行在正常的状态。通常情况下,集群的配置信息由集群配置选项(crm_config)、集群节点(nodes)、集群资源(resources)和资源约束(constraints)四个配置段组成,通过cibadmin --query可查。
参数介绍
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | # 将cib配置保存到文件 pcs cluster cib pgsql_cfg # 在pacemaker级别忽略quorum pcs -f pgsql_cfg property set no-quorum-policy="ignore" # 禁用STONITH pcs -f pgsql_cfg property set stonith-enabled="false" # 设置资源粘性,防止节点在故障恢复后发生迁移 pcs -f pgsql_cfg resource defaults resource-stickiness="INFINITY" # 设置多少次失败后迁移 pcs -f pgsql_cfg resource defaults migration-threshold="3" # 设置master节点虚ip pcs -f pgsql_cfg resource create vip-master IPaddr2 ip="192.168.174.210" cidr_netmask="24" op start timeout="60s" interval="0s" on-fail="restart" op monitor timeout="60s" interval="10s" on-fail="restart" op stop timeout="60s" interval="0s" on-fail="block" # 设置slave节点虚ip pcs -f pgsql_cfg resource create vip-slave IPaddr2 ip="192.168.174.211" cidr_netmask="24" op start timeout="60s" interval="0s" on-fail="restart" op monitor timeout="60s" interval="10s" on-fail="restart" op stop timeout="60s" interval="0s" on-fail="block" # 设置pgsql集群资源 # pgctl、psql、pgdata和config等配置根据自己的环境修改,node list填写节点的hostname,master_ip填写虚master_ip pcs -f pgsql_cfg resource create pgsql pgsql pgctl="/usr/bin/pg_ctl" psql="/usr/bin/psql" pgdata="/var/lib/pgsql/data/" config="/var/lib/pgsql/data/postgresql.conf" rep_mode="sync" node_list="master slave1" master_ip="192.168.174.210" repuser="replicator" primary_conninfo_opt="password=123456 keepalives_idle=60 keepalives_interval=5 keepalives_count=5" restore_command="cp /var/lib/data/pgsql/xlog_archive/%f %p" restart_on_promote='true' op start timeout="60s" interval="0s" on-fail="restart" op monitor timeout="60s" interval="4s" on-fail="restart" op monitor timeout="60s" interval="3s" on-fail="restart" role="Master" op promote timeout="60s" interval="0s" on-fail="restart" op demote timeout="60s" interval="0s" on-fail="stop" op stop timeout="60s" interval="0s" on-fail="block" # 设置master/slave模式,clone-max=2,两个节点 pcs -f pgsql_cfg resource master pgsql-cluster pgsql master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true # 配置master ip组 pcs -f pgsql_cfg resource group add master-group vip-master # 配置slave ip组 pcs -f pgsql_cfg resource group add slave-group vip-slave # 配置master ip组绑定master节点 pcs -f pgsql_cfg constraint colocation add master-group with master pgsql-cluster INFINITY # 配置启动master节点 pcs -f pgsql_cfg constraint order promote pgsql-cluster then start master-group symmetrical=false score=INFINITY # 配置停止master节点 pcs -f pgsql_cfg constraint order demote pgsql-cluster then stop master-group symmetrical=false score=0 # 配置slave ip组绑定slave节点 pcs -f pgsql_cfg constraint colocation add slave-group with slave pgsql-cluster INFINITY # 配置启动slave节点 pcs -f pgsql_cfg constraint order promote pgsql-cluster then start slave-group symmetrical=false score=INFINITY # 配置停止slave节点 pcs -f pgsql_cfg constraint order demote pgsql-cluster then stop slave-group symmetrical=false score=0 # 把配置文件push到cib pcs cluster cib-push pgsql_cfg |
总结
1、使用yum安装的pacemaker为1.1.23版本,对于PG来说,最高只支持到PG11,由于PG12的主从发生了变化,在由pacemaker启动PG时,会自动生成recovery.conf文件,所以会报错"FATAL: using recovery command file "recovery.conf" is not supported"。
1 2 3 4 5 | [root@lhrpgpcs82 pgdata]# more recovery.conf standby_mode = 'on' primary_conninfo = 'host=172.72.6.84 port=5432 user=lhrrep application_name=lhrpgpcs82 password=lhr keepalives_idle=60 keepalives_interval=5 keepalives_count=5' restore_command = '' recovery_target_timeline = 'latest' |
我编译安装最新版本pacemaker-Pacemaker-2.1.3-rc2.tar.gz后,仍然会生成recovery.conf该文件(也有可能是个人能力有限,欢迎大家批评指正)。所以,建议,对于PG 12及其以上的版本的PG库使用repmgr或patroni来进行高可用自动故障切换管理。
在CentOS 7.6上编译安装Pacemaker-2.1.3:https://www.xmmup.com/zaicentos-7-6shangbianyianzhuangpacemaker-2-1-3.html
2、$PGDATA
目录必须属于postgres用户。
3、postgresql.conf参数文件中,不能配置unix_socket_directories参数,否则启动会报错,权限不足。
4、共享内存段需要调整:mount -o remount,size=4G /dev/shm
5、日志文件:
1 2 3 | tailf /var/log/messages tailf /var/log/pacemaker.log tailf /var/log/cluster/corosync.log |