华为云 ECS 上部署 Prometheus + Grafana 监控体系

华为云 ECS 上部署 Prometheus + Grafana 监控体系
ECS 规格**ECS-Monitor** | 2vCPU / 4GiBs6.medium.2 | Ubuntu 22.04 | 40GiB SSD | 1 | 跑 Prometheus Grafana Alertmanager || **ECS-Target** | 2vCPU / 2GiBs6.small.2 | Ubuntu 22.04 | 40GiB SSD | 1~N | 被监控节点跑 Node Exporter |网络规划| VPC | CIDR 192.168.0.0/16 || 子网 | CIDR 192.168.0.0/24 || EIP | ECS-Monitor 必须绑定用于访问 Grafana 页面 |入方向规则| **22** | TCP | 0.0.0.0/0 或你的 IP | SSH 登录 || **3000** | TCP | 你的本地 IP | Grafana Web UI || **9090** | TCP | 你的本地 IP | Prometheus Web UI || **9093** | TCP | 你的本地 IP | Alertmanager Web UI || **9100** | TCP | ECS-Monitor 内网 IP | Node Exporter 指标接口 |我这里用华为云的ECS这里使用的是MobaXterm用你自己设置的密码登录后创建专用用户和目录bash# 创建 prometheus 系统用户不允许登录sudo useradd --no-create-home --shell /bin/false prometheus# 创建配置和数据目录sudo mkdir -p /etc/prometheussudo mkdir -p /var/lib/prometheus# 设置目录权限sudo chown -R prometheus:prometheus /etc/prometheussudo chown prometheus:prometheus /var/lib/prometheus下载并安装 Prometheus# 更新系统sudo apt update# 下载 Prometheus二进制包cd /tmpwget https://github.com/prometheus/prometheus/releases/download/v2.54.1/prometheus-2.54.1.linux-amd64.tar.gz下载的时间比较长耐心等待....我这里网速实在是太慢于是用了 apt 装怕有人需要这里给出命令sudo apt updatesudo apt install -y prometheus安装完毕后打开浏览器ip地址用的是我自己虚拟机的公网ip地址步骤三安装 Alertmanagersudo apt install -y prometheus-alertmanagersudo systemctl status prometheus-alertmanager配置告警通知邮件示例sudo vim /etc/alertmanager/alertmanager.yml输入下面这些global:resolve_timeout: 5msmtp_smarthost: smtp.qq.com:465smtp_from: your_emailqq.comsmtp_auth_username: your_emailqq.comsmtp_auth_password: your_auth_code # QQ邮箱授权码不是密码smtp_require_tls: falseroute:group_by: [alertname, instance]group_wait: 10sgroup_interval: 10srepeat_interval: 1hreceiver: emailreceivers:- name: emailemail_configs:- to: target_emailqq.comsend_resolved: true # 恢复时也发通知把上面的邮箱号进行改动如果你跟我一样是用apt下载的他会有一段默认配置全部删掉在上面的基础上加一条inhibit_rules: - source_match: severity: critical target_match: severity: warning equal: [alertname, instance]重启服务sudo systemctl restart prometheussudo systemctl restart alertmanager安装 Grafana添加 Grafana 官方仓库# 安装依赖sudo apt install -y wget software-properties-common apt-transport-https# 添加 GPG 密钥sudo mkdir -p /etc/apt/keyrings/wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg /dev/null# 添加仓库echo deb [signed-by/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main | sudo tee -a /etc/apt/sources.list.d/grafana.list# 更新sudo apt update安装并启动 Grafanasudo apt install -y grafana# 启动服务sudo systemctl start grafana-serversudo systemctl enable grafana-serversudo systemctl status grafana-server做到这里的时候我发现我3000端口被禁止使用于是我切换了端口为3030grep -n http_port /etc/grafana/grafana.ini看看输出是什么。如果看到;http_port 3000或者http_port 3000就用下面这个命令sudo sed -i s/^[;#]*[[:space:]]*http_port[[:space:]]*[[:space:]]*3000/http_port 3030/ /etc/grafana/grafana.ini然后重启sudo systemctl restart grafana-serversudo ss -tlnp | grep 3030正常就可以进去了进入内部站号admin密码admin被监控节点安装 Node Exporter在 **每台 ECS-Target** 上执行我这里只用一台重新在华为云中注册一台虚拟机....不再演示登录到另一台虚拟机sudo apt update# 下载最新版 Node Exportercd /tmpcurl -s https://api.github.com/repos/prometheus/node_exporter/releases/latest \| grep browser_download_url \| grep linux-amd64 \| cut -d -f 4 \| wget -qi -我这里通过apt装网速不行命令如下sudo apt updatesudo apt install -y prometheus-node-exporter启动服务看到9100端口就代表成功了在ECS-Monitor上配置 Prometheus 采集sudo vim /etc/prometheus/prometheus.yml把里面的内容全部替换为global:scrape_interval: 15sevaluation_interval: 15sexternal_labels:monitor: examplealerting:alertmanagers:- static_configs:- targets: [localhost:9093]rule_files:- /etc/prometheus/rules.ymlscrape_configs:- job_name: prometheusscrape_interval: 5sscrape_timeout: 5sstatic_configs:- targets: [localhost:9090]- job_name: nodestatic_configs:- targets: [localhost:9100, 192.168.x.x:9100]里面有的ip改成内网被检查的那台服务器的内网ip创建文件sudo tee /etc/prometheus/rules.yml /dev/null EOFgroups:- name: instance-downinterval: 15srules:- alert: InstanceDownexpr: up 0for: 1mlabels:severity: criticalannotations:summary: 实例 {{ $labels.instance }} 已宕机description: Job {{ $labels.job }} 的实例 {{ $labels.instance }} 已经宕机超过 1 分钟- name: resource-usageinterval: 15srules:- alert: CPUHighexpr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{modeidle}[5m])) * 100) 80for: 5mlabels:severity: warningannotations:summary: CPU 使用率过高description: 实例 {{ $labels.instance }} 的 CPU 使用率超过 80%持续 5 分钟- alert: MemoryHighexpr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 85for: 5mlabels:severity: warningannotations:summary: 内存使用率过高description: 实例 {{ $labels.instance }} 的内存使用率超过 85%持续 5 分钟- alert: DiskHighexpr: (1 - (node_filesystem_avail_bytes{fstype~ext4|xfs} / node_filesystem_size_bytes{fstype~ext4|xfs})) * 100 80for: 5mlabels:severity: warningannotations:summary: 磁盘使用率过高description: 实例 {{ $labels.instance }} 的磁盘 {{ $labels.mountpoint }} 使用率超过 80%EOF直接复制使用后输入promtool check config /etc/prometheus/prometheus.yml验证重启sudo systemctl restart prometheussudo systemctl restart prometheus-alertmanager进入进入add new connection搜索prometheus输入http://localhost:9090然后导入仪表盘浏览器直接访问http://116.204.78.22:3030/dashboard/import在Import via grafana.com输入框填1860点击import在 ECS-Monitor 上bash复制sudo vim /etc/prometheus/prometheus.yml找到yaml复制- job_name: node static_configs: - targets: [localhost:9100, 192.168.0.15:9100]改成yaml复制- job_name: node_exporter static_configs: - targets: [localhost:9100, 192.168.0.15:9100]保存后bash复制promtool check config /etc/prometheus/prometheus.yml sudo systemctl restart prometheus整个 Prometheus Node Exporter Grafana 监控链路已经跑通。日常访问地址Grafanahttp://116.204.78.22:3030Prometheus targetshttp://116.204.78.22:9090/targetsAlertshttp://116.204.78.22:9090/alerts