使用Promethues+Grafana对Greenplum数据库监控
Tags: GrafanaGreenPlumgreenplum_exporterPrometheusPrometheusAlert发送告警邮件监控钉钉告警
Promethues与Grafana简介
1、Prometheus简介
Prometheus是由SoundCloud开发的开源监控报警系统和时序列数据库(TSDB),使用Go语言开发。Prometheus目前在开源社区相当活跃。Prometheus性能也足够支撑上万台规模的集群。其架构图如下:
- Prometheus Server, 负责从 Exporter 拉取和存储监控数据,并提供一套灵活的查询语言(PromQL)供用户使用。
- Exporter, 负责收集目标对象(host, container…)的性能数据,并通过 HTTP 接口供 Prometheus Server 获取。
- 可视化组件,监控数据的可视化展现对于监控方案至关重要。以前 Prometheus 自己开发了一套工具,不过后来废弃了,因为开源社区出现了更为优秀的产品 Grafana。Grafana 能够与 Prometheus 无缝集成,提供完美的数据展示能力。
- Alertmanager,用户可以定义基于监控数据的告警规则,规则会触发告警。一旦 Alermanager 收到告警,会通过预定义的方式发出告警通知。支持的方式包括 Email、PagerDuty、Webhook 等.
2、Grafana简介
Grafana是一个跨平台的开源的度量分析和可视化工具,可以通过将采集的数据查询然后可视化的展示,并及时通知。它主要有以下六大特点:
- 1、展示方式:快速灵活的客户端图表,面板插件有许多不同方式的可视化指标和日志,官方库中具有丰富的仪表盘插件,比如热图、折线图、图表等多种展示方式;
- 2、数据源:Graphite,InfluxDB,OpenTSDB,Prometheus,Elasticsearch,CloudWatch和KairosDB等;
- 3、通知提醒:以可视方式定义最重要指标的警报规则,Grafana将不断计算并发送通知,在数据达到阈值时通过Slack、PagerDuty等获得通知;
- 4、混合展示:在同一图表中混合使用不同的数据源,可以基于每个查询指定数据源,甚至自定义数据源;
- 5、注释:使用来自不同数据源的丰富事件注释图表,将鼠标悬停在事件上会显示完整的事件元数据和标记;
- 6、过滤器:Ad-hoc过滤器允许动态创建新的键/值过滤器,这些过滤器会自动应用于使用该数据源的所有查询。
Greenplum监控的实现
Greenplum的监控可类似于PostgreSQL来实现,但又存在差异,不同点在于:
- 要实现一个Greenplum的Exporter指标采集器;
- 使用Grafana绘制一个可视化状态图;
- 基于Prometheus配置报警规则;
1、Greenplum的Exporter指标采集器
这里类比PostgreSQL数据库的Exporter实现方法,实现了一个Greenplum的Exporter,项目地址为:
https://github.com/tangyibo/greenplum_exporter
在greenplum_expoter里主要扩展了实现了客户连接信息、账号连接信息、Segment存储信息、集群节点同步状态、数据库锁监控等相关指标,具体指标如下:
2、支持的监控指标
No. | 指标名称 | 类型 | 标签组 | 度量单位 | 指标描述 | 数据源获取方法 | GP版本 |
---|---|---|---|---|---|---|---|
1 | greenplum_cluster_state | Gauge | version; master(master主机名);standby(standby主机名) | boolean | gp 可达状态 ?:1→ 可用;0→ 不可用 | SELECT count(*) from gp_dist_random('gp_id'); select version(); SELECT hostname from gp_segment_configuration where content=-1 and role='p'; | ALL |
2 | greenplum_cluster_uptime | Gauge | - | int | 启动持续的时间 | select extract(epoch from now() - pg_postmaster_start_time()); | ALL |
3 | greenplum_cluster_sync | Gauge | - | int | Master同步Standby状态? 1→ 正常;0→ 异常 | SELECT count(*) from pg_stat_replication where state='streaming' | ALL |
4 | greenplum_cluster_max_connections | Gauge | - | int | 最大连接个数 | show max_connections; show superuser_reserved_connections; | ALL |
5 | greenplum_cluster_total_connections | Gauge | - | int | 当前连接个数 | select count() total, count() filter(where current_query='') idle, count() filter(where current_query<>'') active, count() filter(where current_query<>'' and not waiting) running, count(*) filter(where current_query<>'' and waiting) waiting from pg_stat_activity where procpid <> pg_backend_pid(); | ALL |
6 | greenplum_cluster_idle_connections | Gauge | - | int | idle连接数 | 同上 | ALL |
7 | greenplum_cluster_active_connections | Gauge | - | int | active query | 同上 | ALL |
8 | greenplum_cluster_running_connections | Gauge | - | int | query executing | 同上 | ALL |
9 | greenplum_cluster_waiting_connections | Gauge | - | int | query waiting execute | 同上 | ALL |
10 | greenplum_node_segment_status | Gauge | hostname; address; dbid; content; preferred_role; port; replication_port | int | segment的状态status: 1(U)→ up; 0(D)→ down | select * from gp_segment_configuration; | ALL |
11 | greenplum_node_segment_role | Gauge | hostname; address; dbid; content; preferred_role; port; replication_port | int | segment的role角色: 1(P)→ primary; 2(M)→ mirror | 同上 | ALL |
12 | greenplum_node_segment_mode | Gauge | hostname; address; dbid; content; preferred_role; port; replication_port | int | segment的mode:1(S)→ Synced; 2(R)→ Resyncing; 3(C)→ Change Tracking; 4(N)→ Not Syncing | 同上 | ALL |
13 | greenplum_node_segment_disk_free_mb_size | Gauge | hostname | MB | segment主机磁盘空间剩余大小(MB) | SELECT dfhostname as segment_hostname,sum(dfspace)/count(dfspace)/(1024*1024) as segment_disk_free_gb from gp_toolkit.gp_disk_free GROUP BY dfhostname | ALL |
14 | greenplum_cluster_total_connections_per_client | Gauge | client | int | 每个客户端的total连接数 | select usename, count() total, count() filter(where current_query='') idle, count(*) filter(where current_query<>'') active from pg_stat_activity group by 1; | ALL |
15 | greenplum_cluster_idle_connections_per_client | Gauge | client | int | 每个客户端的idle连接数 | 同上 | ALL |
16 | greenplum_cluster_active_connections_per_client | Gauge | client | int | 每个客户端的active连接数 | 同上 | ALL |
17 | greenplum_cluster_total_online_user_count | Gauge | - | int | 在线账号数 | 同上 | ALL |
18 | greenplum_cluster_total_client_count | Gauge | - | int | 当前所有连接的客户端个数 | 同上 | ALL |
19 | greenplum_cluster_total_connections_per_user | Gauge | usename | int | 每个账号的total连接数 | select client_addr, count() total, count() filter(where current_query='') idle, count(*) filter(where current_query<>'') active from pg_stat_activity group by 1; | ALL |
20 | greenplum_cluster_idle_connections_per_user | Gauge | usename | int | 每个账号的idle连接数 | 同上 | ALL |
21 | greenplum_cluster_active_connections_per_user | Gauge | usename | int | 每个账号的active连接数 | 同上 | ALL |
22 | greenplum_cluster_config_last_load_time_seconds | Gauge | - | int | 系统配置加载时间 | SELECT pg_conf_load_time() | Only GPOSS6 and GPDB6 |
23 | greenplum_node_database_name_mb_size | Gauge | dbname | MB | 每个数据库占用的存储空间大小 | SELECT dfhostname as segment_hostname,sum(dfspace)/count(dfspace)/(1024*1024) as segment_disk_free_gb from gp_toolkit.gp_disk_free GROUP BY dfhostname | ALL |
24 | greenplum_node_database_table_total_count | Gauge | dbname | - | 每个数据库内表的总数量 | SELECT count(*) as total from information_schema.tables where table_schema not in ('gp_toolkit','information_schema','pg_catalog'); | ALL |
25 | greenplum_exporter_total_scraped | Counter | - | int | - | - | ALL |
26 | greenplum_exporter_total_error | Counter | - | int | - | - | ALL |
27 | greenplum_exporter_scrape_duration_second | Gauge | - | int | - | - | ALL |
28 | greenplum_server_users_name_list | Gauge | - | int | 用户总数 | SELECT usename from pg_catalog.pg_user; | ALL |
29 | greenplum_server_users_total_count | Gauge | - | int | 用户明细 | 同上 | ALL |
30 | greenplum_server_locks_table_detail | Gauge | pid;datname;usename;locktype;mode;application_name;state;lock_satus;query | int | 锁信息 | SELECT * from pg_locks | ALL |
31 | greenplum_server_database_hit_cache_percent_rate | Gauge | - | float | 缓存命中率 | select sum(blks_hit)/(sum(blks_read)+sum(blks_hit))*100 from pg_stat_database; | ALL |
32 | greenplum_server_database_transition_commit_percent_rate | Gauge | - | float | 事务提交率 | select sum(xact_commit)/(sum(xact_commit)+sum(xact_rollback))*100 from pg_stat_database; | ALL |
3、使用Grafana绘制一个可视化状态图
根据以上监测指标,即可使用Grafana配置图像了,具体内容请见:
https://github.com/tangyibo/greenplum_exporter/blob/master/grafana/greenplum_dashboard.json
Greenplum监控安装实现
该章节里讲述在CentOS7操作系统环境下的安装过程。
安装Promethues
下载地址:
https://prometheus.io/download/
https://github.com/prometheus/prometheus
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | docker run -d --name lhrgpPromethues -h lhrgpPromethues \ --net=lhrnw --ip 172.72.6.50 \ -p 2222:22 -p 23389:3389 -p 29090:9090 -p 29093:9093 -p 23000:3000 -p 29297:9297 \ -v /sys/fs/cgroup:/sys/fs/cgroup \ --privileged=true \ lhrbest/lhrcentos76:9.0 \ /usr/sbin/init docker exec -it lhrgpPromethues bash cd /soft -- wget https://ghproxy.com/https://github.com/prometheus/prometheus/releases/download/v2.42.0/prometheus-2.42.0.linux-amd64.tar.gz wget https://github.com/prometheus/prometheus/releases/download/v2.42.0/prometheus-2.42.0.linux-amd64.tar.gz tar -zxvf prometheus-2.42.0.linux-amd64.tar.gz -C /usr/local/ ln -s /usr/local/prometheus-2.42.0.linux-amd64 /usr/local/prometheus ln -s /usr/local/prometheus/prometheus /usr/local/bin/prometheus -- 可以直接运行,也可以创建systemd服务 prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/usr/local/prometheus/data/ --web.enable-lifecycle --storage.tsdb.retention.time=60d & lsof -i:9090 netstat -tulnp | grep 9090 http://192.168.8.8:29090 |
--web.enable-lifecycle 加上此参数可以远程热加载配置文件,无需重启prometheus,调用指令是curl -X POST http://ip:9090/-/reload
-- storage.tsdb.retention.time 数据默认保存时间为15天,启动时加上此参数可以控制数据保存时间
创建Systemd服务
1 2 3 4 5 6 7 8 9 10 11 12 13 | cat > /etc/systemd/system/prometheus.service <<EOF [Unit] Description=prometheus After=network.target [Service] Type=simple User=root ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path=/usr/local/prometheus/data/ --web.enable-lifecycle --storage.tsdb.retention.time=60d Restart=on-failure [Install] WantedBy=multi-user.target EOF |
启动Prometheus
1 2 3 | systemctl daemon-reload systemctl start prometheus systemctl status prometheus |
5、访问WEB界面
访问如下地址以检测验证成功安装:
安装Greenplum-Expoter
- Github: https://github.com/tangyibo/greenplum_exporter
- Gitee: https://gitee.com/inrgihc/greenplum_exporter
1 2 3 4 5 6 7 8 9 10 11 12 13 | wget https://github.com/tangyibo/greenplum_exporter/releases/download/v1.1/greenplum_exporter-1.1-rhel7.x86_64.rpm rpm -ivh greenplum_exporter-1.1-rhel7.x86_64.rpm echo 'GPDB_DATA_SOURCE_URL=postgres://gpadmin:lhr@172.72.6.40:5432/postgres?sslmode=disable' > /usr/local/greenplum_exporter/etc/greenplum.conf -- postgres://[数据库连接账号,必须为gpadmin]:[账号密码,即gpadmin的密码]@[数据库的IP地址]:[数据库端口号]/[数据库名称,必须为postgres]?[参数名]=[参数值]&[参数名]=[参数值] systemctl enable greenplum_exporter systemctl start greenplum_exporter systemctl status greenplum_exporter netstat -tulnp | grep 9297 curl http://172.72.6.50:9297/metrics |
说明:
1、该Exporter支持Greenplum V5.x及Greenplum V6.x等版本。
2、软件默认安装在: /usr/local/greenplum_exporter
访问如下地址以检测验证成功安装:
http://192.168.8.8:29297/metrics
若是有多个GreenPlum要进行监控,则可以配置多个service:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | echo 'GPDB_DATA_SOURCE_URL=postgres://gpadmin:lhr@172.72.6.40:5432/postgres?sslmode=disable' > /usr/local/greenplum_exporter/etc/greenplum_172.18.0.15.conf cat > /etc/systemd/system/greenplum_exporter_172.18.0.15.service <<"EOF" [Unit] Description=greenplum exporter After=network.target [Service] Type=simple User=prometheus WorkingDirectory=/usr/local/greenplum_exporter/ EnvironmentFile=/usr/local/greenplum_exporter/etc/greenplum_172.18.0.15.conf ExecStart=/usr/local/greenplum_exporter/bin/greenplum_exporter --web.listen-address=0.0.0.0:9298 --log.level=error Restart=on-failure [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl start greenplum_exporter_172.18.0.15 systemctl status greenplum_exporter_172.18.0.15 -- 测试 curl http://127.0.0.1:9298/metrics netstat -tulnp | grep 929 |
注意修改web.listen-address的端口。
也可以命令行执行:
1 2 | export GPDB_DATA_SOURCE_URL=postgres://gpadmin:password@10.17.20.11:5432/postgres?sslmode=disable ./greenplum_exporter --web.listen-address="0.0.0.0:9297" --web.telemetry-path="/metrics" --log.level=error |
也可以docker运行多个环境:
1 2 3 | docker run -d -p 9297:9297 \ -e GPDB_DATA_SOURCE_URL=postgres://gpadmin:password@10.17.20.11:5432/postgres?sslmode=disable \ inrgihc/greenplum-exporter:latest |
配置Promethues
1、将greenplum_expter配置到prometheus.yml的采集target列表中
1 2 3 4 5 6 | vim /usr/local/prometheus/prometheus.yml - job_name: 'greenplum' static_configs: - targets: ['127.0.0.1:9297'] labels: instance: GPDB_GP40 |
通过命令热加载,也可以直接重启Prometheus:
1 | curl -XPOST http://localhost:9090/-/reload |
访问:http://192.168.8.8:29090/targets?search=
安装Grafana可视化工具
下载:https://grafana.com/grafana/download?pg=get&plcmt=selfmanaged-box1-cta1
1 2 3 4 5 6 7 | wget https://dl.grafana.com/enterprise/release/grafana-enterprise-9.3.6-1.x86_64.rpm sudo yum install -y grafana-enterprise-9.3.6-1.x86_64.rpm systemctl daemon-reload systemctl enable grafana-server.service systemctl start grafana-server systemctl status grafana-server |
访问:http://192.168.8.8:23000/?orgId=1 admin/admin
配置Grafana
当使用admin账号登录grafana后,再进行如下配置操作:
①、添加promethus数据源
URL填入:http://127.0.0.1:9090
②、配置greenplum监控模板图Dashboard
Grafana Dashboard ID: 13822
Grafana Dashboard URL: https://grafana.com/grafana/dashboards/13822
将https://github.com/tangyibo/greenplum_exporter/blob/master/grafana/greenplum_dashboard.json中配置的内容粘贴到上图红色框框内,或者填写仪表图的Grafana ID 13822, 然后点击load按钮即可加载。
import导入即可看到监控图。
监控GreenPlum主机
在GreenPlum的所有主机上(包括master和segment)安装node_exporter
下载: https://prometheus.io/download/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | -- wget https://ghproxy.com/https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz mv ./node_exporter-1.5.0.linux-amd64/node_exporter /usr/local/bin/ # node_exporter --help nohup /usr/local/bin/node_exporter & lsof -i:9100 # 设置启动参数node_exporter.service cat > /etc/systemd/system/node_exporter.service << EOF [Unit] Description=node_exporter Documentation=https://prometheus.io/ After=network.target [Service] Type=simple User=root ExecStart=/usr/local/bin/node_exporter Restart=on-failure [Install] WantedBy=multi-user.target EOF systemctl enable node_exporter systemctl start node_exporter systemctl status node_exporter |
修改Prometheus的配置文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 | vim /usr/local/prometheus/prometheus.yml - job_name: 'GreenPlum_nodes' static_configs: - targets: ['172.72.6.40:9100'] labels: instance: mdw_172.72.6.40 - targets: ['172.72.6.41:9100'] labels: instance: mdw_172.72.6.41 - targets: ['172.72.6.42:9100'] labels: instance: mdw_172.72.6.42 |
通过命令热加载,也可以直接重启Prometheus:
1 | curl -XPOST http://localhost:9090/-/reload |
访问:http://192.168.8.8:29090/targets?search=
最后在Grafana中下载监控展示模板即可:16098、1860、12633
配置告警
下载安装alertmanager和PrometheusAlert
使用PrometheusAlert是为了简化配置,而且不容易出错,请参考https://www.xmmup.com/jiankongyunweigaojinggongjuzhiprometheusalert.html。
下载地址:
https://prometheus.io/download/#alertmanager
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | -- alertmanager wget https://ghproxy.com/https://github.com/prometheus/alertmanager/releases/download/v0.25.0/alertmanager-0.25.0.linux-amd64.tar.gz tar -zxvf alertmanager-0.25.0.linux-amd64.tar.gz -C /usr/local/ ln -s /usr/local/alertmanager-0.25.0.linux-amd64 /usr/local/alertmanager ln -s /usr/local/alertmanager-0.25.0.linux-amd64/alertmanager /usr/local/bin/alertmanager -- #打开PrometheusAlert releases页面,根据需要选择需要的版本下载到本地解压并进入解压后的目录,如linux版本(https://github.com/feiyu563/PrometheusAlert/releases/download/v4.8.2/linux.zip) wget https://github.com/feiyu563/PrometheusAlert/releases/download/v4.8.2/linux.zip unzip linux.zip mv linux /usr/local/prometheusAlert cd /usr/local/prometheusAlert/ chmod +x PrometheusAlert #运行PrometheusAlert ./PrometheusAlert (#后台运行请执行 nohup ./PrometheusAlert &) #启动后可使用浏览器打开以下地址查看:http://127.0.0.1:8080 #默认登录帐号和密码在app.conf中有配置 vi /usr/local/prometheusAlert/conf/app.conf #钉钉告警 告警logo图标地址 logourl= https://www.malibucity.org/images/GraphicLinks/2/button%20template%20interior_alert_1.png #钉钉告警 恢复logo图标地址 rlogourl= https://www.malibucity.org/images/GraphicLinks/2/button%20template%20interior_alert_1.png #是否开启告警记录 0为关闭,1为开启 AlertRecord=1 #是否开启告警记录定时删除 0为关闭,1为开启 RecordLive=1 #告警记录定时删除周期,单位天 RecordLiveDay=7 #---------------------↓webhook----------------------- #是否开启钉钉告警通道,可同时开始多个通道0为关闭,1为开启 open-dingding=1 #默认钉钉机器人地址 ddurl=https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXX #是否开启 @所有人(0为关闭,1为开启) dd_isatall=1 #---------------------↓邮件配置----------------------- #是否开启邮件 open-email=1 #邮件发件服务器地址 Email_host=smtp.qq.com #邮件发件服务器端口 Email_port=465 #邮件帐号 Email_user=2168723934@qq.com #邮件密码 Email_password=XXXXXXXX #邮件标题 Email_title=Prometheus运维告警 #默认发送邮箱 Default_emails=lhrbest@qq.com,646634621@qq.com -- alertmanager配置 cat > /usr/local/alertmanager/alertmanager.yml <<"EOF" global: resolve_timeout: 5m route: group_by: ['instance'] group_wait: 10s group_interval: 10s repeat_interval: 10m receiver: 'web.hook.prometheusalert' receivers: - name: 'web.hook.prometheusalert' webhook_configs: - url: 'http://127.0.0.1:8080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=XXXXXXXXXXXX&at=1828888888' - url: 'http://127.0.0.1:8080/prometheusalert?type=email&tpl=prometheus-email&email=lhrbest@qq.com,646634621@qq.com' EOF -- alertmanager配置(建议使用这种方式配置路由策略) cat > /usr/local/alertmanager/alertmanager.yml <<"EOF" global: resolve_timeout: 5m route: group_by: ['instance'] group_wait: 10s group_interval: 10s repeat_interval: 60m receiver: 'web.hook.prometheusalert' receivers: - name: 'web.hook.prometheusalert' webhook_configs: - url: 'http://127.0.0.1:8080/prometheusalert?type=dd&tpl=prometheus-dd' - url: 'http://127.0.0.1:8080/prometheusalert?type=email&tpl=prometheus-email' EOF -- 启动alertmanager: nohup /usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --cluster.advertise-address="0.0.0.0:9093" & [root@lhrdb soft]# netstat -tulnp | grep 9093 tcp6 0 0 :::9093 :::* LISTEN 3074174/alertmanage [root@lhrdb soft]# [root@lhrprometheus ~]# lsof -i:9093 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME alertmana 21970 root 8u IPv6 39999134 0t0 TCP *:copycat (LISTEN) |
Prometheus 配置alertmanager
1 2 3 4 5 6 7 8 9 10 | -- vi /usr/local/prometheus/prometheus.yml 新增关于alertmanager的配置: rule_files: - "/usr/local/prometheus/*rule.yml" alerting: alertmanagers: - static_configs: - targets: ['127.0.0.1:9093'] curl -XPOST http://localhost:9090/-/reload |
修改告警模板
在图形界面修改:
钉钉告警模板
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | {{ $var := .externalURL}}{{ range $k,$v:=.alerts }} {{if eq $v.status "resolved"}} ## [Prometheus恢复信息]({{$v.generatorURL}}) #### [{{$v.labels.alertname}}]({{$var}}) ##### 当前时间: {{GetCSTtime ""}} - 告警级别:{{$v.labels.severity}} - 开始时间:{{GetCSTtime $v.startsAt}} - 结束时间:{{GetCSTtime $v.endsAt}} - 故障主机:**{{$v.labels.instance}}** - 详细内容:{{$v.annotations.description}}  {{else}} ## [Prometheus告警信息]({{$v.generatorURL}}) #### [{{$v.labels.alertname}}]({{$var}}) ##### 当前时间: {{GetCSTtime ""}} - 告警级别:{{$v.labels.severity}} - 开始时间:{{GetCSTtime $v.startsAt}} - 结束时间:{{""}} - 故障主机:**{{$v.labels.instance}}** - 详细内容:{{$v.annotations.description}} ##  {{end}} {{ end }} {{ $urimsg:=""}}{{ range $key,$value:=.commonLabels }}{{$urimsg = print $urimsg $key "%3D%22" $value "%22%2C" }}{{end}}[★★★ 点我屏蔽该告警 ★★★ ]({{$var}}/#/silences/new?filter=%7B{{SplitString $urimsg 0 -3}}%7D) |
邮件告警模板配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | {{ $var := .externalURL}}{{ range $k,$v:=.alerts }} {{if eq $v.status "resolved"}} <h1><a href ={{$v.generatorURL}}>Prometheus恢复信息</a></h1> <h2><a href ={{$var}}>{{$v.labels.alertname}}</a></h2> <h5>告警级别:{{$v.labels.severity}}</h5> <h5>开始时间:{{GetCSTtime $v.startsAt}}</h5> <h5>结束时间:{{GetCSTtime $v.endsAt}}</h5> <h5>故障主机:{{$v.labels.instance}}</h5> <h5>详细内容:{{$v.annotations.description}}</h5> <img src=https://www.malibucity.org/images/GraphicLinks/2/button%20template%20interior_alert_1.png /> {{else}} <h1><a href ={{$v.generatorURL}}>Prometheus告警信息</a></h1> <h2><a href ={{$var}}>{{$v.labels.alertname}}</a></h2> <h5>告警级别:{{$v.labels.severity}}</h5> <h5>开始时间:{{GetCSTtime $v.startsAt}}</h5> <h5>结束时间:{{""}}</h5> <h5>故障主机:{{$v.labels.instance}}</h5> <h5>详细内容:{{$v.annotations.description}}</h5> <img src=https://www.malibucity.org/images/GraphicLinks/2/button%20template%20interior_alert_1.png /> {{end}} {{ end }} |
修改告警路由
如果在/usr/local/alertmanager/alertmanager.yml
中没有配置钉钉或邮件的收件人,或者需要根据需要进行定制,则可以配置告警路由,请参考:https://www.xmmup.com/jiankongyunweigaojinggongjuzhiprometheusalert.html#gao_jing_lu_you_gong_neng
告警路由的功能是通过过滤来自Prometheus告警消息中的label来实现将消息转发到不同的模板和模板对应的接收目标。整体设计类似Alertmanager的路由功能。
报警规则
Linux服务器告警
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | cat > /usr/local/prometheus/linux-alert2-rule.yml <<"EOF" groups: - name: 主机状态-监控告警 rules: - alert: 主机状态 expr: up{job="Linux"} == 0 for: 5m labels: status: 非常严重 annotations: summary: "{{$labels.instance}}:服务器宕机" description: "{{$labels.instance}}:服务器延时超过5分钟" - alert: CPU使用情况 expr: 100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) > 90 for: 5m labels: status: 一般告警 annotations: summary: "{{$labels.mountpoint}} CPU使用率过高!" description: "{{$labels.mountpoint }} CPU使用大于90%(目前使用:{{$value}}%)" - alert: 内存使用 expr: 100 -(node_memory_MemTotal_bytes -node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes ) / node_memory_MemTotal_bytes * 100> 80 for: 5m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 内存使用率过高!" description: "{{$labels.mountpoint }} 内存使用大于80%(目前使用:{{$value}}%)" - alert: IO性能 expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 60 for: 5m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 流入磁盘IO使用率过高!" description: "{{$labels.mountpoint }} 流入磁盘IO大于60%(目前使用:{{$value}})" - alert: 网络 expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400 for: 5m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 流入网络带宽过高!" description: "{{$labels.mountpoint }}流入网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}" - alert: 网络 expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 100) > 102400 for: 5m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 流出网络带宽过高!" description: "{{$labels.mountpoint }}流出网络带宽持续2分钟高于100M. RX带宽使用率{{$value}}" - alert: TCP会话 expr: node_netstat_Tcp_CurrEstab > 1000 for: 5m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} TCP_ESTABLISHED过高!" description: "{{$labels.mountpoint }} TCP_ESTABLISHED大于1000%(目前使用:{{$value}}%)" - alert: 磁盘容量 expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80 for: 5m labels: status: 严重告警 annotations: summary: "{{$labels.mountpoint}} 磁盘分区使用率过高!" description: "{{$labels.mountpoint }} 磁盘分区使用大于80%(目前使用:{{$value}}%)" EOF |
GreenPlum数据库告警
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | cat > /usr/local/prometheus/greenplum-alert-rule.yml <<"EOF" groups: - name: 'Greenplum' rules: - alert: GreenplumClusterDown expr: greenplum_up == 0 for: 5m labels: severity: critical annotations: summary: "Greenplum down (instance {{ $labels.instance }})" description: "Greenplum 数据库宕机了\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: GreenplumTooManyConnections expr: greenplum_cluster_total_connections > 240 for: 5m labels: severity: critical annotations: summary: "Greenplum too many connections (instance {{ $labels.instance }})" description: "Greenplum server has too many connections\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: GreenplumStandbyDown expr: greenplum_cluster_sync == 0 for: 5m labels: severity: critical annotations: summary: "Greenplum standby down (instance {{ $labels.instance }})" description: "Greenplum standby node is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" - alert: GreenplumSegmentDown expr: greenplum_node_segment_status == 0 for: 5m labels: severity: critical annotations: summary: "Greenplum segment down (instance {{ $labels.instance }})" description: "Greenplum segment node is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" EOF |
生效
1 | curl -XPOST http://localhost:9090/-/reload |
结果
http://192.168.8.8:9090/alerts?search=
邮件告警:
钉钉告警:
总结
1、目前最新的greenplum_exporter-1.1-rhel7.x86_64.rpm
版本监控数据有部分问题,地址:https://github.com/tangyibo/greenplum_exporter,作者未曾更新修复相关bug。想要修改SQL语句,只能去源码修改并重新编译安装。
2、展示模板下载地址:https://grafana.com/grafana/dashboards/
3、目前的“Grafana Dashboard ID: 13822,Grafana Dashboard URL: https://grafana.com/grafana/dashboards/13822”对于多个GreenPlum集群展示有问题,不能根据job进行区分,会出现如下情况:
目前的办法:
1、开多个Prometheus服务+多个Grafana服务。
2、使用麦老师修改后的dashboards,地址:https://grafana.com/grafana/dashboards/18122 ,直接加载18122
也可以。
4、还有其它修改版,地址参考:https://github.com/ChrisYuan/greenplum_exporter,也是基于https://github.com/tangyibo/greenplum_exporter进行分发,但经过我测试,变化不大,感觉原版更好。
参考
https://blog.csdn.net/inrgihc/article/details/108686638
https://dbswitch.gitee.io/docs-site/#/docs/monitor/monitor-greenplum
https://github.com/tangyibo/greenplum_exporter
https://github.com/ChrisYuan/greenplum_exporter
https://github.com/feiyu563/PrometheusAlert