Tác giả: Trần Vinh

Thông tin liên hệ:

devopsedu.vn: vinhtran
facebook: N/A
linkedin: N/A

Trong bài viết này, mình sẽ hướng dẫn chi tiết cách triển khai hệ thống monitoring sử dụng Prometheus, Grafana, Alertmanager, và Node Exporter để giám sát hệ thống và gửi cảnh báo qua Telegram. Mình đã chuẩn bị sẵn các file cấu hình và docker-compose.yml trên GitHub, các bạn chỉ cần clone về và thay đổi một số thông tin là có thể sử dụng ngay.

Nội dung

Triển khai hệ thống monitoring sử dụng Prometheus, Grafana, Alertmanager, và Node Exporter

1. Các thành phần trong hệ thống

Prometheus: Thu thập và lưu trữ metrics từ các dịch vụ.
Grafana: Trực quan hóa dữ liệu từ Prometheus.
Alertmanager: Quản lý và gửi cảnh báo từ Prometheus.
Node Exporter: Thu thập metrics từ hệ thống (CPU, RAM, Disk, v.v.).
Telegram: Nhận cảnh báo từ Alertmanager.

2. Yêu cầu hệ thống

Máy chủ đã cài đặt Docker và Docker Compose.
Một bot Telegram để nhận cảnh báo (có thể tạo qua BotFather).

3. Các bước thực hiện

Bước 1: Clone repository từ GitHub

Mình đã chuẩn bị sẵn các file cấu hình và docker-compose.yml trong repository GitHub. Các bạn có thể clone về máy bằng lệnh sau:

# git clone git@github.com:vindevops99/DevOps-Tools.git
# cd DevOps-Tools/monitor

Bước 2: Cấu hình Telegram Bot

Tạo một bot Telegram thông qua BotFather và lấy API Token.
Tạo một nhóm (group) trên Telegram, thêm bot vào nhóm và lấy Chat ID của nhóm. Bạn có thể lấy Chat ID bằng cách gửi một tin nhắn bất kỳ trong nhóm và truy cập vào URL sau:
```
https://api.telegram.org/bot<YOUR_BOT_TOKEN>/getUpdates
```
Trong đó, <YOUR_BOT_TOKEN> là API Token của bot.

Mở file alertmanager/config.yml và thay đổi các thông tin sau:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'instance', 'ip', 'host', 'job', 'severity', 'teamowner']
  group_wait: 30s
  group_interval: 30s
  repeat_interval: 3h
  receiver: teams-devops
  routes:
    - match:
        severity: warning
      continue: true
      receiver: teams-devops
    - match:
        severity: critical
      continue: true
      receiver: teams-devops

receivers:
  - name: 'teams-devops'
    telegram_configs:
      - send_resolved: true
        http_config:
          follow_redirects: true
        api_url: "https://api.telegram.org"
        bot_token: "<YOUR_BOT_TOKEN>"  # Thay bằng bot token của bạn
        chat_id: <YOUR_CHAT_ID>        # Thay bằng chat ID của bạn
        message: '{{ template "telegram.default.message" . }}'
        parse_mode: HTML

Bước 3: Kiểm tra và chỉnh sửa file cấu hình

1. Prometheus Configuration (prometheus/prometheus.yml)

File này định nghĩa các job để thu thập metrics từ các dịch vụ. Bạn có thể thêm hoặc chỉnh sửa các job tùy theo nhu cầu.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/alert-rules/*.yml

scrape_configs:
  - job_name: 'node_exporter'
    scrape_interval: 30s
    static_configs:
      - targets: ['node_exporter:9100']
        labels:
          instance: monitor-server
          host: monitor-server
          ip: 10.0.10.60
          site: company
          application: node_exporter
          infrastructure: prometheus
          company: company
          teamowner: DevOps
          environment: Prod
          tag: infrastructure

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: monitor-server
          host: monitor-server
          ip: 10.0.10.60
          site: company
          application: node_exporter
          infrastructure: prometheus
          company: company
          teamowner: DevOps
          environment: Prod
          tag: infrastructure

  - job_name: 'grafana'
    static_configs:
      - targets: ['grafana:3000']
        labels:
          instance: monitor-server
          host: monitor-server
          ip: 10.0.10.60
          site: company
          application: node_exporter
          infrastructure: prometheus
          company: company
          teamowner: DevOps
          environment: Prod
          tag: infrastructure

  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']
        labels:
          instance: monitor-server
          host: monitor-server
          ip: 10.0.10.60
          site: company
          application: node_exporter
          infrastructure: prometheus
          company: company
          teamowner: DevOps
          environment: Prod
          tag: infrastructure

2. Alert Rules (prometheus/alert-rules/alert-rules.yml)

File này chứa các rule cảnh báo. Bạn có thể thêm hoặc chỉnh sửa các rule tùy theo nhu cầu.

groups:
  - name: targets
    rules:
      - alert: service_down
        expr: up == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job_name }} is down"
          description: "Service {{ $labels.job_name }} is down"

  - name: host
    rules:
      - alert: Disk Usage High
        expr: (node_filesystem_size_bytes{mountpoint="/"} - node_filesystem_avail_bytes{mountpoint="/"}) / node_filesystem_size_bytes{mountpoint="/"} > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Root disk usage more than 80%"
          description: "Root disk usage more than 80%"

      - alert: CPU Usage High
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage more than 80%"
          description: "CPU usage more than 80%"

      - alert: Memory Usage High
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory usage more than 80%"
          description: "Memory usage more than 80%"

3. Grafana Configuration

Grafana đã được cấu hình sẵn để sử dụng Prometheus làm datasource. Bạn có thể thêm các dashboard tùy chỉnh trong thư mục grafana/provisioning/dashboards.

4. Alertmanager Template (alertmanager/template/default.tmpl)

File này định nghĩa template cho cảnh báo gửi đến Telegram. Bạn có thể chỉnh sửa nội dung tin nhắn tùy theo nhu cầu.

{{ define "text_alert_list" }}{{ range . }}
{{ if .Labels.alertname }}<b>alertname:</b> {{ .Labels.alertname }}{{ end }}
{{ if .Labels.severity }}<b>severity:</b> {{ .Labels.severity }}{{ end }}
{{ if .Annotations.description }}<b>description:</b> {{ .Annotations.description }}{{ end }}
{{ if .Labels.instance }}<b>instance:</b> {{ .Labels.instance }}{{ end }}
{{ if .Labels.ip }}<b>ip:</b> {{ .Labels.ip }}{{ end }}
{{ if .Labels.teamowner }}<b>teamowner:</b> {{ .Labels.teamowner }}{{ end }}
{{ end }}{{ end }}

{{ define "telegram.default.message" }}{{ if gt (len .Alerts.Firing) 0 }}<b>🔥 Company Infra Alerts Firing:</b> {{ template "text_alert_list" .Alerts.Firing }}{{ if gt (len .Alerts.Resolved) 0 }}
{{ end }}{{ end }}{{ if gt (len .Alerts.Resolved) 0 }}<b>✅ Company Infra Alerts Resolved: </b>{{ template "text_alert_list" .Alerts.Resolved }}{{ end }}{{ end }}

Bước 4: Khởi động hệ thống

Sau khi đã cấu hình xong, bạn cần vào thư mục chưa file docker-compose.yml để khởi động hệ thống bằng lệnh:

# docker-compose up -d

Các container sẽ được khởi động và bạn có thể truy cập vào các dịch vụ:

Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (đăng nhập với admin/admin)
Alertmanager: http://localhost:9093
Node Exporter: http://localhost:9100/metrics

Bước 5: Kiểm tra cảnh báo Telegram

Khi có cảnh báo (ví dụ: CPU usage > 80%), Alertmanager sẽ gửi thông báo đến Telegram theo template đã cấu hình trong alertmanager/template/default.tmpl.

4. Kết luận

Với các bước trên, bạn đã triển khai thành công hệ thống monitoring sử dụng Prometheus, Grafana, Alertmanager, và Node Exporter với Docker. Hệ thống này không chỉ giúp bạn giám sát tình trạng hệ thống mà còn gửi cảnh báo kịp thời qua Telegram. Nếu có bất kỳ câu hỏi hoặc vấn đề gì, hãy để lại bình luận hoặc liên hệ với mình nhé! Chúc các bạn thành công! 😊