久操视频一区人人操观看,a级黄色片网站免费,一区二区三区欧美激情视频在线观看

一、概述

1.1 背景介紹

生產(chǎn)環(huán)境跑著幾百臺機器，出了故障全靠人肉巡檢和用戶反饋，這種被動運維的日子我們團隊經(jīng)歷了兩年。2019年開始全面切換到Prometheus+Grafana體系，到現(xiàn)在穩(wěn)定運行了五年多，監(jiān)控覆蓋了主機、容器、中間件、業(yè)務(wù)指標四個層面，日均采集指標點超過2000萬。

Prometheus采用拉取（Pull）模型，主動去各個target抓取指標，這跟Zabbix的推送模式有本質(zhì)區(qū)別。好處是監(jiān)控端掌握主動權(quán)，target掛了能立刻感知到，不會出現(xiàn)"agent掛了但監(jiān)控系統(tǒng)不知道"的尷尬局面。底層用的是自研的TSDB時序數(shù)據(jù)庫，單機寫入性能實測能到每秒百萬級樣本，查詢響應(yīng)在毫秒級。

Grafana負責(zé)可視化這一層，支持幾十種數(shù)據(jù)源，面板類型豐富，從折線圖到熱力圖到拓撲圖都能搞定。兩者配合，再加上Alertmanager做告警，基本覆蓋了監(jiān)控體系的完整鏈路。

1.2 技術(shù)特點

Pull模型+服務(wù)發(fā)現(xiàn)：Prometheus主動拉取指標，配合Consul、Kubernetes等服務(wù)發(fā)現(xiàn)機制，新上線的服務(wù)自動納入監(jiān)控，不需要手動配置。我們線上跑了400多個微服務(wù)實例，全部通過K8s服務(wù)發(fā)現(xiàn)自動注冊，運維零干預(yù)。

PromQL查詢語言：這是Prometheus的核心競爭力。支持向量運算、聚合函數(shù)、預(yù)測函數(shù)，能寫出類似predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0這樣的預(yù)測表達式，提前24小時預(yù)警磁盤空間不足。學(xué)習(xí)曲線比SQL陡一些，但上手后效率很高。

本地TSDB+遠程存儲擴展：默認數(shù)據(jù)存本地磁盤，單機能扛住大部分場景。數(shù)據(jù)量大了可以對接Thanos、VictoriaMetrics等遠程存儲，實現(xiàn)長期存儲和全局查詢。我們的做法是本地保留15天熱數(shù)據(jù)，Thanos Sidecar同步到S3做冷存儲，保留一年。

1.3 適用場景

云原生環(huán)境監(jiān)控：K8s集群、Docker容器、微服務(wù)架構(gòu)，Prometheus是事實標準。CNCF畢業(yè)項目，生態(tài)最完善，各種exporter開箱即用。

中大規(guī)?；A(chǔ)設(shè)施監(jiān)控：幾十到幾千臺主機的規(guī)模，單機Prometheus就能扛住。超過這個規(guī)模用聯(lián)邦集群或Thanos方案橫向擴展。

業(yè)務(wù)指標監(jiān)控：通過客戶端SDK埋點，把QPS、延遲、錯誤率等業(yè)務(wù)指標暴露出來，和基礎(chǔ)設(shè)施指標放在同一個平臺統(tǒng)一查看和告警。

1.4 環(huán)境要求

組件	版本要求	說明
操作系統(tǒng)	CentOS 7+ / Ubuntu 20.04+	推薦Ubuntu 22.04 LTS，內(nèi)核5.15+對cgroup v2支持更好
Prometheus	2.45+ (LTS) 或 2.53+	生產(chǎn)環(huán)境建議用LTS版本，當(dāng)前LTS是2.45.x系列
Grafana	10.0+	10.x版本UI重構(gòu)，性能提升明顯，建議直接上10.2+
Node Exporter	1.7+	低于1.6的版本在ARM架構(gòu)上有內(nèi)存泄漏問題
硬件配置	4C8G起步	監(jiān)控500個target以內(nèi)夠用，超過1000個建議8C16G，磁盤用SSD

二、詳細步驟

2.1 準備工作

2.1.1 系統(tǒng)檢查

# 檢查系統(tǒng)版本
cat /etc/os-release

# 檢查CPU和內(nèi)存，Prometheus對內(nèi)存有要求，采集1000個target大約需要4-6GB
free -h
nproc

# 檢查磁盤空間，TSDB數(shù)據(jù)目錄建議預(yù)留100GB以上
df -h

# 檢查時間同步狀態(tài)，Prometheus對時間敏感，偏差超過1分鐘會導(dǎo)致數(shù)據(jù)錯亂
timedatectl status
# 如果NTP沒開，立刻開啟
sudo timedatectlset-ntptrue

2.1.2 創(chuàng)建用戶和目錄

# 創(chuàng)建prometheus用戶，不允許登錄
sudo useradd --no-create-home --shell /bin/falseprometheus

# 創(chuàng)建目錄結(jié)構(gòu)
sudo mkdir -p /etc/prometheus
sudo mkdir -p /var/lib/prometheus
sudo mkdir -p /etc/prometheus/rules
sudo mkdir -p /etc/prometheus/file_sd

# 設(shè)置權(quán)限
sudo chown -R prometheus:prometheus /etc/prometheus
sudo chown -R prometheus:prometheus /var/lib/prometheus

2.1.3 防火墻配置

# Prometheus默認端口9090，Grafana默認3000，Node Exporter默認9100
sudo ufw allow 9090/tcp
sudo ufw allow 3000/tcp
sudo ufw allow 9100/tcp
sudo ufw reload

# CentOS用firewalld
sudo firewall-cmd --permanent --add-port=9090/tcp
sudo firewall-cmd --permanent --add-port=3000/tcp
sudo firewall-cmd --permanent --add-port=9100/tcp
sudo firewall-cmd --reload

2.2 核心配置

2.2.1 Prometheus安裝（二進制方式）

# 下載Prometheus 2.53.0
cd/tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz

# 解壓
tar xzf prometheus-2.53.0.linux-amd64.tar.gz
cdprometheus-2.53.0.linux-amd64

# 拷貝二進制文件
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

# 拷貝控制臺模板
sudo cp -r consoles /etc/prometheus/
sudo cp -r console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles
sudo chown -R prometheus:prometheus /etc/prometheus/console_libraries

# 驗證安裝
prometheus --version
# 輸出類似：prometheus, version 2.53.0 (branch: HEAD, revision: ...)

2.2.2 Prometheus主配置文件

sudo tee /etc/prometheus/prometheus.yml > /dev/null <

	

	說明：scrape_interval設(shè)成15秒是經(jīng)過反復(fù)測試的。10秒采集頻率在target超過500個時，Prometheus的CPU占用會明顯上升；30秒又會導(dǎo)致短時間的毛刺抓不到。15秒是個性價比最高的選擇。

	2.2.3 文件服務(wù)發(fā)現(xiàn)配置

	
# 節(jié)點列表配置
sudo tee /etc/prometheus/file_sd/nodes.yml > /dev/null <

	

	說明：文件服務(wù)發(fā)現(xiàn)比static_configs靈活，改了文件Prometheus會自動reload，不需要重啟。生產(chǎn)環(huán)境我們用腳本從CMDB同步機器列表到這個文件，每5分鐘更新一次。

	2.2.4 Prometheus Systemd服務(wù)

	
sudo tee /etc/systemd/system/prometheus.service > /dev/null <

	

	參數(shù)說明：

	--storage.tsdb.retention.time=15d：數(shù)據(jù)保留15天，根據(jù)磁盤大小調(diào)整。每個target每天大約產(chǎn)生1-2MB數(shù)據(jù)，500個target保留15天大約需要15GB。

	--storage.tsdb.retention.size=50GB：按大小限制，和時間限制取先到者。這個是兜底策略，防止磁盤被撐爆。

	--web.enable-lifecycle：開啟后可以通過HTTP API熱重載配置，curl -X POST http://localhost:9090/-/reload。生產(chǎn)環(huán)境必開，不然每次改配置都要重啟。

	--query.max-concurrency=20：并發(fā)查詢數(shù)，默認是20。Grafana面板多的話可能不夠，我們調(diào)到了40。

	--storage.tsdb.min-block-duration=2h和max-block-duration=2h：如果用Thanos Sidecar，這兩個必須都設(shè)成2h，否則Sidecar上傳會出問題。

	2.2.5 Node Exporter安裝

	
# 下載
cd/tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xzf node_exporter-1.8.1.linux-amd64.tar.gz

# 安裝
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/falsenode_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Systemd服務(wù)
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <

	

	說明：--collector.filesystem.mount-points-exclude這個參數(shù)一定要加，不然會采集到/sys、/proc這些虛擬文件系統(tǒng)的指標，數(shù)據(jù)量大還沒用。--collector.systemd開啟后可以監(jiān)控systemd服務(wù)狀態(tài)，排查服務(wù)異常很有用。

	2.2.6 Grafana安裝

	
# 添加Grafana APT源
sudo apt install -y apt-transport-https software-properties-common
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo"deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main"| sudo tee /etc/apt/sources.list.d/grafana.list

# 安裝
sudo apt update
sudo apt install -y grafana

# 修改配置
sudo tee /etc/grafana/grafana.ini > /dev/null <

	

	說明：min_refresh_interval設(shè)成10s，防止有人把Dashboard刷新間隔設(shè)成1秒把Prometheus查掛。線上真出過這事，一個同事設(shè)了1秒刷新，20個面板同時查，直接把Prometheus的查詢隊列打滿了。

	2.2.7 Docker方式部署（備選方案）

	
# 創(chuàng)建docker-compose.yml
mkdir -p /opt/monitoring
cat > /opt/monitoring/docker-compose.yml <

	

	2.3 啟動和驗證

	2.3.1 啟動服務(wù)

	
# 先檢查配置文件語法
promtool check config /etc/prometheus/prometheus.yml
# 輸出：Checking /etc/prometheus/prometheus.yml
#  SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file

# 啟動Prometheus
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctlenableprometheus

# 查看狀態(tài)
sudo systemctl status prometheus
# 確認Active: active (running)

# 查看啟動日志，確認沒有報錯
journalctl -u prometheus -n 50 --no-pager


	

	2.3.2 功能驗證

	
# 驗證Prometheus是否正常運行
curl -s http://localhost:9090/-/healthy
# 輸出：Prometheus Server is Healthy.

curl -s http://localhost:9090/-/ready
# 輸出：Prometheus Server is Ready.

# 查看已注冊的target
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | head -30

# 驗證Node Exporter
curl -s http://localhost:9100/metrics | head -20

# 驗證Grafana
curl -s http://localhost:3000/api/health
# 輸出：{"commit":"...","database":"ok","version":"10.2.3"}

# 執(zhí)行一個簡單的PromQL查詢
curl -s'http://localhost:9090/api/v1/query?query=up'| python3 -m json.tool
# 所有target的up值應(yīng)該為1


	

	2.3.3 Grafana添加Prometheus數(shù)據(jù)源

	
# 通過API自動添加數(shù)據(jù)源
curl -X POST http://admin:P@ssw0rd_Change_Me@localhost:3000/api/datasources 
 -H'Content-Type: application/json'
 -d'{
  "name": "Prometheus",
  "type": "prometheus",
  "url": "http://localhost:9090",
  "access": "proxy",
  "isDefault": true,
  "jsonData": {
   "timeInterval": "15s",
   "queryTimeout": "60s",
   "httpMethod": "POST"
  }
 }'
# 輸出：{"datasource":{"id":1,...},"id":1,"message":"Datasource added","name":"Prometheus"}


	

	說明：httpMethod設(shè)成POST而不是GET，因為復(fù)雜的PromQL查詢可能很長，GET請求的URL長度有限制，超過8KB會被Nginx或負載均衡器截斷。我們線上就踩過這個坑，一個聚合了20個label的查詢，GET請求直接返回414 URI Too Long。

	三、示例代碼和配置

	3.1 完整配置示例

	3.1.1 生產(chǎn)級prometheus.yml完整配置

	
# 文件路徑：/etc/prometheus/prometheus.yml
# 適用場景：中等規(guī)模生產(chǎn)環(huán)境（200-800個target）
global:
scrape_interval:15s
evaluation_interval:15s
scrape_timeout:10s
external_labels:
 cluster:'prod-bj-01'
 environment:'production'
 region:'cn-beijing'

rule_files:
-"/etc/prometheus/rules/node_rules.yml"
-"/etc/prometheus/rules/container_rules.yml"
-"/etc/prometheus/rules/app_rules.yml"
-"/etc/prometheus/rules/recording_rules.yml"

alerting:
alertmanagers:
 -static_configs:
   -targets:
     -'10.0.1.50:9093'
     -'10.0.1.51:9093'
     -'10.0.1.52:9093'
  timeout:10s
  api_version:v2

scrape_configs:
# Prometheus自監(jiān)控
-job_name:'prometheus'
 static_configs:
  -targets:['localhost:9090']
 metrics_path:/metrics
 scheme:http

# Node Exporter主機監(jiān)控
-job_name:'node-exporter'
 file_sd_configs:
  -files:
    -'/etc/prometheus/file_sd/nodes_*.yml'
   refresh_interval:30s
 relabel_configs:
  # 從地址中提取主機名
  -source_labels:[__address__]
   regex:'(.+):(d+)'
   target_label:hostname
   replacement:'${1}'
  # 丟棄帶有ignore標簽的target
  -source_labels:[__meta_ignore]
   regex:'true'
   action:drop

# Kubernetes服務(wù)發(fā)現(xiàn) - Pod監(jiān)控
-job_name:'kubernetes-pods'
 kubernetes_sd_configs:
  -role:pod
   kubeconfig_file:/etc/prometheus/kubeconfig
   namespaces:
    names:
     -default
     -app-prod
     -middleware
 relabel_configs:
  # 只采集帶有prometheus.io/scrape注解的Pod
  -source_labels:[__meta_kubernetes_pod_annotation_prometheus_io_scrape]
   action:keep
   regex:true
  # 使用注解中指定的path
  -source_labels:[__meta_kubernetes_pod_annotation_prometheus_io_path]
   action:replace
   target_label:__metrics_path__
   regex:(.+)
  # 使用注解中指定的端口
  -source_labels:[__address__,__meta_kubernetes_pod_annotation_prometheus_io_port]
   action:replace
   regex:([^:]+)(?::d+)?;(d+)
   replacement:$1:$2
   target_label:__address__
  # 添加namespace標簽
  -source_labels:[__meta_kubernetes_namespace]
   action:replace
   target_label:namespace
  # 添加pod名稱標簽
  -source_labels:[__meta_kubernetes_pod_name]
   action:replace
   target_label:pod

# MySQL Exporter
-job_name:'mysql-exporter'
 file_sd_configs:
  -files:
    -'/etc/prometheus/file_sd/mysql.yml'
   refresh_interval:60s
 scrape_interval:30s
 scrape_timeout:15s

# Redis Exporter
-job_name:'redis-exporter'
 file_sd_configs:
  -files:
    -'/etc/prometheus/file_sd/redis.yml'
   refresh_interval:60s

# Nginx VTS Exporter
-job_name:'nginx-vts'
 file_sd_configs:
  -files:
    -'/etc/prometheus/file_sd/nginx.yml'
   refresh_interval:60s

# 黑盒探測
-job_name:'blackbox-http'
 metrics_path:/probe
 params:
  module:[http_2xx]
 file_sd_configs:
  -files:
    -'/etc/prometheus/file_sd/blackbox_http.yml'
   refresh_interval:60s
 relabel_configs:
  -source_labels:[__address__]
   target_label:__param_target
  -source_labels:[__param_target]
   target_label:instance
  -target_label:__address__
   replacement:'10.0.1.60:9115'

# 聯(lián)邦集群 - 從子Prometheus拉取聚合指標
-job_name:'federation-staging'
 honor_labels:true
 metrics_path:'/federate'
 params:
  'match[]':
   -'{job=~".+"}'
 static_configs:
  -targets:
    -'10.0.3.10:9090'
   labels:
    federated_from:'staging-cluster'
 scrape_interval:30s
 scrape_timeout:25s


	

	3.1.2 Recording Rules預(yù)聚合規(guī)則

	
# 文件路徑：/etc/prometheus/rules/recording_rules.yml
# 預(yù)聚合規(guī)則能大幅降低查詢時的計算量
# 我們線上一個Dashboard從加載8秒降到了0.5秒，就是靠預(yù)聚合
groups:
-name:node_recording_rules
 interval:15s
 rules:
  # CPU使用率預(yù)聚合
  -record:instanceratio
   expr:|
     1 - avg by (instance) (
      rate(node_cpu_seconds_total{mode="idle"}[5m])
     )

  # 內(nèi)存使用率預(yù)聚合
  -record:instanceratio
   expr:|
     1 - (
      node_memory_MemAvailable_bytes
      / node_memory_MemTotal_bytes
     )

  # 磁盤使用率預(yù)聚合
  -record:instanceratio
   expr:|
     1 - (
      node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}
      / node_filesystem_size_bytes{mountpoint="/",fstype!="tmpfs"}
     )

  # 網(wǎng)絡(luò)接收速率
  -record:instancerate5m
   expr:|
     rate(node_network_receive_bytes_total{device!~"lo|veth.*|docker.*|br.*"}[5m])

  # 網(wǎng)絡(luò)發(fā)送速率
  -record:instancerate5m
   expr:|
     rate(node_network_transmit_bytes_total{device!~"lo|veth.*|docker.*|br.*"}[5m])

  # 磁盤IO使用率
  -record:instanceratio
   expr:|
     rate(node_disk_io_time_seconds_total[5m])

-name:app_recording_rules
 interval:15s
 rules:
  # HTTP請求QPS
  -record:jobrate5m
   expr:|
     sum by (job) (rate(http_requests_total[5m]))

  # HTTP請求延遲P99
  -record:jobp99
   expr:|
     histogram_quantile(0.99,
      sum by (job, le) (
       rate(http_request_duration_seconds_bucket[5m])
      )
     )

  # HTTP錯誤率
  -record:jobratio5m
   expr:|
     sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
     / sum by (job) (rate(http_requests_total[5m]))


	

	3.1.3 Grafana Provisioning自動化配置

	
# 文件路徑：/etc/grafana/provisioning/datasources/prometheus.yml
# Grafana啟動時自動加載數(shù)據(jù)源，不需要手動在UI上配
apiVersion:1

datasources:
-name:Prometheus-Prod
 type:prometheus
 access:proxy
 url:http://10.0.1.40:9090
 isDefault:true
 editable:false
 jsonData:
  timeInterval:'15s'
  queryTimeout:'60s'
  httpMethod:POST
  exemplarTraceIdDestinations:
   -name:traceID
    datasourceUid:tempo
 version:1

-name:Prometheus-Staging
 type:prometheus
 access:proxy
 url:http://10.0.3.10:9090
 isDefault:false
 editable:false
 jsonData:
  timeInterval:'15s'
  queryTimeout:'60s'
  httpMethod:POST
 version:1

# 文件路徑：/etc/grafana/provisioning/dashboards/default.yml
apiVersion:1

providers:
-name:'default'
 orgId:1
 folder:'Infrastructure'
 type:file
 disableDeletion:false
 updateIntervalSeconds:30
 allowUiUpdates:true
 options:
  path:/var/lib/grafana/dashboards
  foldersFromFilesStructure:true


	

	3.1.4 告警規(guī)則文件

	
# 文件路徑：/etc/prometheus/rules/node_rules.yml
groups:
-name:node_alerts
 rules:
  -alert:NodeDown
   expr:up{job="node-exporter"}==0
   for:2m
   labels:
    severity:critical
   annotations:
    summary:"節(jié)點{{ $labels.instance }}宕機"
    description:"節(jié)點{{ $labels.instance }}已經(jīng)超過2分鐘無法訪問"

  -alert:NodeCPUHigh
   expr:instanceratio>0.85
   for:5m
   labels:
    severity:warning
   annotations:
    summary:"節(jié)點{{ $labels.instance }}CPU使用率過高"
    description:"CPU使用率{{ $value | humanizePercentage }}，持續(xù)超過5分鐘"

  -alert:NodeMemoryHigh
   expr:instanceratio>0.90
   for:5m
   labels:
    severity:warning
   annotations:
    summary:"節(jié)點{{ $labels.instance }}內(nèi)存使用率過高"
    description:"內(nèi)存使用率{{ $value | humanizePercentage }}，持續(xù)超過5分鐘"

  -alert:NodeDiskWillFull
   expr:predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h],24*3600)0.90
   for:5m
   labels:
    severity:critical
   annotations:
    summary:"節(jié)點{{ $labels.instance }}磁盤使用率超過90%"
    description:"磁盤使用率{{ $value | humanizePercentage }}"


	

	3.2 實際應(yīng)用案例

	案例一：CMDB自動同步Target列表

	場景描述：我們有400多臺服務(wù)器，手動維護file_sd配置文件不現(xiàn)實。寫了個腳本每5分鐘從CMDB API拉取機器列表，自動生成Prometheus的file_sd配置。

	實現(xiàn)代碼：

	
#!/bin/bash
# 文件名：/opt/scripts/sync_cmdb_targets.sh
# 功能：從CMDB同步機器列表到Prometheus file_sd配置
# Crontab: */5 * * * * /opt/scripts/sync_cmdb_targets.sh

set-euo pipefail

CMDB_API="http://cmdb.internal:8080/api/v1/hosts"
CMDB_TOKEN="your-cmdb-api-token"
OUTPUT_DIR="/etc/prometheus/file_sd"
TEMP_FILE=$(mktemp)
LOG_FILE="/var/log/prometheus/cmdb_sync.log"

log() {
 echo"[$(date '+%Y-%m-%d %H:%M:%S')]$1">>"$LOG_FILE"
}

# 從CMDB獲取主機列表
response=$(curl -s -w"
%{http_code}"
  -H"Authorization: Bearer${CMDB_TOKEN}"
 "${CMDB_API}?status=running&page_size=1000")

http_code=$(echo"$response"| tail -1)
body=$(echo"$response"| head -n -1)

if["$http_code"!="200"];then
 log"ERROR: CMDB API返回${http_code}"
 exit1
fi

# 用jq解析JSON，按角色分組生成file_sd配置
forroleinapp-server db-server cache-server gateway;do
 echo"$body"| jq -r --arg role"$role"'
  [
   {
    "targets": [.data[] | select(.role == $role) | .ip + ":9100"],
    "labels": {
     "env": "production",
     "role": $role,
     "dc": (.data[0].datacenter // "unknown")
    }
   }
  ]'>"${TEMP_FILE}"

  target_count=$(echo"$body"| jq --arg role"$role"'[.data[] | select(.role == $role)] | length')

 if["$target_count"-gt 0 ];then
    mv"${TEMP_FILE}""${OUTPUT_DIR}/nodes_${role}.yml"
   log"INFO: 同步${role}完成，共${target_count}個target"
 else
   log"WARN:${role}沒有找到任何target，跳過更新"
 fi
done

rm -f"${TEMP_FILE}"
log"INFO: CMDB同步完成"


	

	運行結(jié)果：

	
[2024-12-15 1001] INFO: 同步 app-server 完成，共 186 個target
[2024-12-15 1001] INFO: 同步 db-server 完成，共 24 個target
[2024-12-15 1002] INFO: 同步 cache-server 完成，共 18 個target
[2024-12-15 1002] INFO: 同步 gateway 完成，共 8 個target
[2024-12-15 1002] INFO: CMDB同步完成


	

	案例二：Prometheus存儲容量規(guī)劃腳本

	場景描述：經(jīng)常被問"Prometheus磁盤要多大"，寫了個腳本根據(jù)當(dāng)前采集量自動計算存儲需求。

	實現(xiàn)代碼：

	
#!/bin/bash
# 文件名：/opt/scripts/prometheus_capacity_plan.sh
# 功能：根據(jù)當(dāng)前指標量估算存儲需求

PROM_URL="http://localhost:9090"

echo"========== Prometheus 存儲容量規(guī)劃 =========="
echo""

# 獲取當(dāng)前活躍時間序列數(shù)
active_series=$(curl -s"${PROM_URL}/api/v1/query?query=prometheus_tsdb_head_series"| 
  jq -r'.data.result[0].value[1]')
echo"當(dāng)前活躍時間序列數(shù):${active_series}"

# 獲取每秒采集樣本數(shù)
samples_per_sec=$(curl -s"${PROM_URL}/api/v1/query?query=rate(prometheus_tsdb_head_samples_appended_total[5m])"| 
  jq -r'.data.result[0].value[1]'| xargsprintf"%.0f")
echo"每秒采集樣本數(shù):${samples_per_sec}"

# 獲取當(dāng)前TSDB塊大小
tsdb_size=$(curl -s"${PROM_URL}/api/v1/query?query=prometheus_tsdb_storage_blocks_bytes"| 
  jq -r'.data.result[0].value[1]')
tsdb_size_gb=$(echo"scale=2;${tsdb_size}/1024/1024/1024"| bc)
echo"當(dāng)前TSDB存儲大小:${tsdb_size_gb}GB"

# 獲取數(shù)據(jù)保留時間
retention=$(curl -s"${PROM_URL}/api/v1/status/runtimeinfo"| 
  jq -r'.data.storageRetention')
echo"數(shù)據(jù)保留策略:${retention}"

# 估算每天數(shù)據(jù)量（每個樣本約1-2字節(jié)壓縮后）
bytes_per_sample=1.5
daily_bytes=$(echo"scale=2;${samples_per_sec}* 86400 *${bytes_per_sample}"| bc)
daily_gb=$(echo"scale=2;${daily_bytes}/1024/1024/1024"| bc)
echo""
echo"---------- 容量估算 ----------"
echo"每天數(shù)據(jù)量(估算):${daily_gb}GB"

fordaysin7 15 30 90;do
  total=$(echo"scale=2;${daily_gb}*${days}"| bc)
 # 加20%余量
  total_with_buffer=$(echo"scale=2;${total}* 1.2"| bc)
 echo"保留${days}天需要:${total_with_buffer}GB (含20%余量)"
done

echo""
echo"建議：磁盤使用率超過70%就該擴容了，別等到80%再動手"


	

	運行結(jié)果：

	
========== Prometheus 存儲容量規(guī)劃 ==========

當(dāng)前活躍時間序列數(shù): 487632
每秒采集樣本數(shù): 32508
當(dāng)前TSDB存儲大小: 28.47 GB

數(shù)據(jù)保留策略: 15d

---------- 容量估算 ----------
每天數(shù)據(jù)量(估算): 3.91 GB
保留 7 天需要: 32.84 GB (含20%余量)
保留 15 天需要: 70.38 GB (含20%余量)
保留 30 天需要: 140.76 GB (含20%余量)
保留 90 天需要: 422.28 GB (含20%余量)

建議：磁盤使用率超過70%就該擴容了，別等到80%再動手


	

	四、最佳實踐和注意事項

	4.1 最佳實踐

	4.1.1 性能優(yōu)化

	存儲優(yōu)化 - retention和compaction調(diào)優(yōu)：默認的compaction策略在大數(shù)據(jù)量下會導(dǎo)致磁盤IO飆升。生產(chǎn)環(huán)境建議把--storage.tsdb.min-block-duration和--storage.tsdb.max-block-duration都設(shè)成2h（尤其是用Thanos的場景）。retention按實際需求設(shè)，我們的經(jīng)驗是本地保留15天，超過15天的查詢走Thanos。

	
# 查看當(dāng)前TSDB塊狀態(tài)
curl -s http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool

# 手動觸發(fā)compaction（謹慎使用，會占用大量IO）
curl -X POST http://localhost:9090/api/v1/admin/tsdb/compact


	

	查詢優(yōu)化 - 善用Recording Rules：復(fù)雜的PromQL查詢在Dashboard上反復(fù)執(zhí)行會拖慢Prometheus。把高頻查詢寫成Recording Rules預(yù)聚合，查詢延遲能從秒級降到毫秒級。我們有個Dashboard原來加載要12秒，加了Recording Rules后降到800毫秒。

	
# 檢查規(guī)則文件語法
promtool check rules /etc/prometheus/rules/recording_rules.yml

# 測試PromQL表達式
promtool query instant http://localhost:9090'instanceratio'


	

	采集優(yōu)化 - 合理設(shè)置scrape_interval：不是所有target都需要15秒采集一次?；A(chǔ)設(shè)施指標15秒夠了，業(yè)務(wù)指標可以10秒，而一些變化緩慢的指標（比如硬件信息）60秒采集一次就行。按job單獨設(shè)置scrape_interval能減少30%左右的采集壓力。

	標簽優(yōu)化 - 控制時間序列基數(shù)：這是Prometheus性能殺手。一個label的值如果有上萬種可能（比如用戶ID、請求URL），時間序列數(shù)會爆炸式增長。我們踩過一次坑，有個開發(fā)把user_id作為label暴露出來，一天之內(nèi)時間序列從50萬漲到了800萬，Prometheus直接OOM。

	
# 查看高基數(shù)指標
curl -s http://localhost:9090/api/v1/status/tsdb | 
  jq'.data.seriesCountByMetricName | sort_by(-.value) | .[0:10]'

# 查看高基數(shù)label
curl -s http://localhost:9090/api/v1/status/tsdb | 
  jq'.data.labelValueCountByLabelName | sort_by(-.value) | .[0:10]'


	

	4.1.2 安全加固

	Basic Auth認證：Prometheus 2.x原生支持basic auth，生產(chǎn)環(huán)境必須開啟，裸奔的Prometheus誰都能查數(shù)據(jù)。

	
# /etc/prometheus/web.yml
basic_auth_users:
admin:$2a$12$KmR3iR5eJx5Oj5Yl5FpNOuJGQwMOsKOqJ7Mcp7hVQ8sKqGzLkjS6

# 生成bcrypt密碼
htpasswd -nBC 12""| tr -d':
'

# 啟動時指定web配置
# --web.config.file=/etc/prometheus/web.yml


	

	TLS加密傳輸：Prometheus到Exporter之間的通信默認是明文HTTP，內(nèi)網(wǎng)環(huán)境可以接受，但跨機房或有安全合規(guī)要求的必須上TLS。

	
# /etc/prometheus/web.yml 完整配置
tls_server_config:
cert_file:/etc/prometheus/ssl/prometheus.crt
key_file:/etc/prometheus/ssl/prometheus.key
client_auth_type:RequireAndVerifyClientCert
client_ca_file:/etc/prometheus/ssl/ca.crt

basic_auth_users:
admin:$2a$12$KmR3iR5eJx5Oj5Yl5FpNOuJGQwMOsKOqJ7Mcp7hVQ8sKqGzLkjS6


	

	網(wǎng)絡(luò)隔離：Prometheus只監(jiān)聽內(nèi)網(wǎng)IP，不要綁定0.0.0.0。Grafana如果需要外網(wǎng)訪問，前面掛Nginx做反向代理，加上IP白名單和WAF。

	
# Prometheus只監(jiān)聽內(nèi)網(wǎng)
--web.listen-address=10.0.1.40:9090

# Nginx反向代理Grafana
# /etc/nginx/conf.d/grafana.conf


	

	API訪問控制：--web.enable-admin-api開啟后可以通過API刪除數(shù)據(jù)，生產(chǎn)環(huán)境要謹慎。建議只在需要時臨時開啟，或者通過Nginx限制只有運維機器能訪問admin API。

	4.1.3 高可用配置

	Prometheus雙副本：最簡單的HA方案是跑兩個完全相同配置的Prometheus實例，采集同樣的target。Alertmanager配置兩個都連，利用Alertmanager自身的去重能力避免重復(fù)告警。數(shù)據(jù)有微小差異（毫秒級時間戳不同），但對監(jiān)控場景影響不大。

	Thanos方案：需要全局查詢和長期存儲時用Thanos。每個Prometheus旁邊跑一個Thanos Sidecar，數(shù)據(jù)上傳到對象存儲（S3/MinIO），Thanos Query做全局查詢和去重。我們線上用這個方案跑了三年，管理著5個Prometheus實例的數(shù)據(jù)，查詢體驗和單機Prometheus基本一致。

	
# Thanos Sidecar啟動命令
thanos sidecar 
 --tsdb.path=/var/lib/prometheus 
 --prometheus.url=http://localhost:9090 
 --objstore.config-file=/etc/thanos/bucket.yml 
 --grpc-address=0.0.0.0:10901 
 --http-address=0.0.0.0:10902


	

	備份策略：Prometheus的TSDB支持snapshot備份，不影響正常運行。

	
# 創(chuàng)建快照
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot
# 快照保存在 /var/lib/prometheus/snapshots/ 目錄下

# 定時備份腳本
# 每天凌晨3點備份，保留7天
0 3 * * * /opt/scripts/prometheus_backup.sh


	

	4.2 注意事項

	4.2.1 配置注意事項

	WARNING：以下幾點改錯了可能導(dǎo)致數(shù)據(jù)丟失或監(jiān)控中斷，操作前務(wù)必備份。

	修改--storage.tsdb.retention.time縮短保留時間后，超出范圍的數(shù)據(jù)會在下次compaction時被刪除，不可恢復(fù)。改之前先確認歷史數(shù)據(jù)是否還需要。

	external_labels一旦設(shè)定不要隨意修改，Thanos和聯(lián)邦集群依賴這個標簽做數(shù)據(jù)去重。改了之后會被當(dāng)成新的數(shù)據(jù)源，歷史數(shù)據(jù)查詢會出問題。

	relabel_configs寫錯了會導(dǎo)致target被意外drop或者label被覆蓋。改完之后先用promtool check config驗證，再通過/-/reload熱加載，觀察Targets頁面確認無誤。

	4.2.2 常見錯誤

				錯誤現(xiàn)象
			
				原因分析
			
				解決方案
		

				target狀態(tài)顯示"context deadline exceeded"
			
				scrape_timeout小于target的響應(yīng)時間
			
				增大scrape_timeout，或優(yōu)化exporter的響應(yīng)速度
		

				"out of order sample" 日志大量出現(xiàn)
			
				時間戳亂序，通常是時鐘不同步導(dǎo)致
			
				檢查NTP同步狀態(tài)，確保所有節(jié)點時間偏差小于1秒
		

				Prometheus啟動后立刻OOM被kill
			
				時間序列數(shù)過多，head block加載耗盡內(nèi)存
			
				增加內(nèi)存，或用--storage.tsdb.no-lockfile排查，清理高基數(shù)指標
		

				Grafana面板顯示"No data"
			
				數(shù)據(jù)源配置錯誤或PromQL語法錯誤
			
				先在Prometheus UI上測試查詢，確認有數(shù)據(jù)返回
		

				熱重載后配置沒生效
			
				配置文件有語法錯誤，reload靜默失敗
			
				查看Prometheus日志，用promtool check config預(yù)檢
		

	4.2.3 兼容性問題

	版本兼容：Prometheus 2.x的TSDB格式和1.x完全不兼容，無法直接升級遷移。2.x內(nèi)部各版本之間向后兼容，但建議不要跨太多版本升級，先在測試環(huán)境驗證。

	平臺兼容：Node Exporter在不同Linux發(fā)行版上采集的指標可能有差異，比如CentOS 7的cgroup v1和Ubuntu 22.04的cgroup v2，容器相關(guān)指標的路徑不同。

	組件依賴：Grafana 10.x要求Prometheus 2.40+，低版本Prometheus的某些API接口Grafana調(diào)不通。Thanos Sidecar對Prometheus版本也有要求，具體看Thanos的兼容性矩陣。

	五、故障排查和監(jiān)控

	5.1 故障排查

	5.1.1 日志查看

	
# 查看Prometheus日志
sudo journalctl -u prometheus -f --no-pager

# 查看最近的錯誤日志
sudo journalctl -u prometheus --since"1 hour ago"| grep -i"error|warn|fatal"

# 查看Grafana日志
sudo tail -f /var/log/grafana/grafana.log

# 查看Node Exporter日志
sudo journalctl -u node_exporter -f --no-pager


	

	5.1.2 常見問題排查

	問題一：TSDB損壞導(dǎo)致Prometheus無法啟動

	這個問題我們遇到過兩次，都是服務(wù)器意外斷電導(dǎo)致的。Prometheus的WAL（Write-Ahead Log）沒來得及刷盤，重啟后TSDB校驗失敗。

	
# 查看錯誤日志
journalctl -u prometheus -n 100 | grep -i"corrupt|error|wal"
# 典型報錯：opening storage failed: repair failed

# 嘗試自動修復(fù)
promtool tsdb repair /var/lib/prometheus

# 如果修復(fù)失敗，刪除損壞的WAL重新啟動（會丟失最近2小時未持久化的數(shù)據(jù)）
sudo systemctl stop prometheus
ls -la /var/lib/prometheus/wal/
# 備份后刪除WAL
sudo mv /var/lib/prometheus/wal /var/lib/prometheus/wal.bak
sudo mkdir /var/lib/prometheus/wal
sudo chown prometheus:prometheus /var/lib/prometheus/wal
sudo systemctl start prometheus


	

	解決方案：

	先用promtool tsdb repair嘗試修復(fù)

	修復(fù)失敗則備份并刪除WAL目錄

	重啟Prometheus，檢查數(shù)據(jù)完整性

	事后加UPS或者用帶電池的RAID卡，避免斷電導(dǎo)致數(shù)據(jù)損壞

	問題二：Prometheus OOM被系統(tǒng)kill

	
# 確認是否被OOM Killer干掉
dmesg | grep -i"oom|killed process"
journalctl -k | grep -i"oom"

# 查看當(dāng)前內(nèi)存使用
curl -s http://localhost:9090/api/v1/query?query=process_resident_memory_bytes | 
  jq -r'.data.result[0].value[1]'| awk'{printf "%.2f GB
", $1/1024/1024/1024}'

# 查看時間序列數(shù)量，這是內(nèi)存消耗的主要因素
curl -s http://localhost:9090/api/v1/query?query=prometheus_tsdb_head_series | 
  jq -r'.data.result[0].value[1]'


	

	解決方案：

	時間序列數(shù)超過500萬就要警惕了，超過1000萬基本需要8C32G以上的配置

	排查高基數(shù)指標，用TSDB Status頁面找出序列數(shù)最多的metric

	通過relabel_configs在采集時丟棄不需要的label

	拆分Prometheus實例，按業(yè)務(wù)線或環(huán)境分開采集

	問題三：Target狀態(tài)為DOWN但服務(wù)實際正常

	
# 手動curl測試target的metrics端點
curl -v http://10.0.1.10:9100/metrics 2>&1 | head -20

# 檢查網(wǎng)絡(luò)連通性
telnet 10.0.1.10 9100

# 檢查Prometheus到target的DNS解析
dig +short 10.0.1.10

# 查看Prometheus的target詳情
curl -s http://localhost:9090/api/v1/targets | 
  jq'.data.activeTargets[] | select(.health=="down") | {instance: .labels.instance, lastError: .lastError}'


	

	解決方案：

	檢查防火墻規(guī)則，確認9100端口對Prometheus服務(wù)器開放

	檢查Exporter是否綁定了127.0.0.1而不是0.0.0.0

	如果用了服務(wù)發(fā)現(xiàn)，檢查發(fā)現(xiàn)的地址是否正確

	scrape_timeout是否太短，某些Exporter響應(yīng)慢需要調(diào)大超時

	問題四：高基數(shù)（High Cardinality）導(dǎo)致性能下降

	
# 查看序列數(shù)最多的前10個指標
curl -s http://localhost:9090/api/v1/status/tsdb | 
  jq -r'.data.seriesCountByMetricName | sort_by(-.value) | .[0:10][] | "(.name): (.value)"'

# 查看label值最多的前10個label
curl -s http://localhost:9090/api/v1/status/tsdb | 
  jq -r'.data.labelValueCountByLabelName | sort_by(-.value) | .[0:10][] | "(.name): (.value)"'

# 查看某個具體指標的序列數(shù)
curl -s'http://localhost:9090/api/v1/query?query=count(http_requests_total)'| 
  jq'.data.result[0].value[1]'


	

	解決方案：

	找到高基數(shù)的label，和開發(fā)溝通去掉不必要的label

	用metric_relabel_configs在采集后丟棄高基數(shù)的label

	如果是歷史數(shù)據(jù)導(dǎo)致的，用admin API刪除特定時間序列：

	
# 刪除特定指標的數(shù)據(jù)（危險操作，先在測試環(huán)境驗證）
curl -X POST -g'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]=http_requests_total{user_id!=""}'
# 清理已刪除數(shù)據(jù)的磁盤空間
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones


	

	5.1.3 調(diào)試模式

	
# Prometheus開啟debug日志（會產(chǎn)生大量日志，排查完記得關(guān)掉）
# 修改systemd服務(wù)文件，添加 --log.level=debug
sudo systemctl edit prometheus
# 在[Service]段添加：
# ExecStart=
# ExecStart=/usr/local/bin/prometheus --log.level=debug ...其他參數(shù)

# 或者通過API動態(tài)調(diào)整日志級別（需要開啟--web.enable-lifecycle）
curl -X PUT http://localhost:9090/-/log-level?level=debug

# Grafana開啟debug日志
# 修改 /etc/grafana/grafana.ini
# [log]
# level = debug

# 查看Prometheus內(nèi)部指標，排查性能問題
curl -s http://localhost:9090/metrics | grep prometheus_engine_query_duration
curl -s http://localhost:9090/metrics | grep prometheus_tsdb


	

	5.2 性能監(jiān)控

	5.2.1 關(guān)鍵指標監(jiān)控

	
# Prometheus自身的關(guān)鍵指標
# 采集延遲
curl -s'http://localhost:9090/api/v1/query?query=prometheus_target_interval_length_seconds{quantile="0.99"}'| jq .

# 查詢引擎耗時
curl -s'http://localhost:9090/api/v1/query?query=prometheus_engine_query_duration_seconds{quantile="0.99"}'| jq .

# WAL大小
curl -s'http://localhost:9090/api/v1/query?query=prometheus_tsdb_wal_storage_size_bytes'| jq .

# 內(nèi)存使用
curl -s'http://localhost:9090/api/v1/query?query=process_resident_memory_bytes{job="prometheus"}'| jq .

# 采集失敗數(shù)
curl -s'http://localhost:9090/api/v1/query?query=sum(up{job="node-exporter"}==0)'| jq .


	

	5.2.2 監(jiān)控指標說明

				指標名稱
			
				正常范圍
			
				告警閾值
			
				說明
		

				prometheus_tsdb_head_series
			
				根據(jù)規(guī)模定
			
				>5000000
			
				活躍時間序列數(shù)，超過500萬要關(guān)注內(nèi)存
		

				prometheus_target_scrape_pool_exceeded_target_limit_total
			
				0
			
				>0
			
				target數(shù)量超限，需要調(diào)整target_limit
		

				prometheus_engine_query_duration_seconds{quantile="0.99"}
			
				<2s
			
				>10s
			
				P99查詢延遲，超過10秒說明查詢太重
		

				process_resident_memory_bytes
			
				<總內(nèi)存70%
			
				>總內(nèi)存80%
			
				內(nèi)存使用，超過80%有OOM風(fēng)險
		

				prometheus_tsdb_compactions_failed_total
			
				0
			
				>0
			
				compaction失敗，可能是磁盤空間不足
		

				prometheus_rule_evaluation_failures_total
			
				0
			
				>0
			
				規(guī)則評估失敗，檢查PromQL語法
		

	5.2.3 Prometheus自監(jiān)控告警規(guī)則

	
# 文件路徑：/etc/prometheus/rules/prometheus_self_rules.yml
groups:
-name:prometheus_self_monitoring
 rules:
  -alert:PrometheusTargetDown
   expr:up{job="prometheus"}==0
   for:1m
   labels:
    severity:critical
   annotations:
    summary:"Prometheus實例{{ $labels.instance }}宕機"

  -alert:PrometheusHighMemory
   expr:process_resident_memory_bytes{job="prometheus"}/node_memory_MemTotal_bytes*100>80
   for:5m
   labels:
    severity:warning
   annotations:
    summary:"Prometheus內(nèi)存使用率超過80%"
    description:"當(dāng)前內(nèi)存使用:{{ $value | humanize }}%"

  -alert:PrometheusHighQueryDuration
   expr:prometheus_engine_query_duration_seconds{quantile="0.99"}>10
   for:5m
   labels:
    severity:warning
   annotations:
    summary:"Prometheus P99查詢延遲超過10秒"

  -alert:PrometheusTSDBCompactionsFailed
   expr:increase(prometheus_tsdb_compactions_failed_total[1h])>0
   for:5m
   labels:
    severity:critical
   annotations:
    summary:"Prometheus TSDB compaction失敗"
    description:"過去1小時有compaction失敗，檢查磁盤空間和TSDB狀態(tài)"

  -alert:PrometheusRuleEvaluationFailures
   expr:increase(prometheus_rule_evaluation_failures_total[5m])>0
   for:5m
   labels:
    severity:warning
   annotations:
    summary:"Prometheus規(guī)則評估失敗"

  -alert:PrometheusHighScrapeInterval
   expr:prometheus_target_interval_length_seconds{quantile="0.99"}>30
   for:5m
   labels:
    severity:warning
   annotations:
    summary:"采集間隔P99超過30秒，可能存在采集積壓"

  -alert:PrometheusHighCardinality
   expr:prometheus_tsdb_head_series>5000000
   for:10m
   labels:
    severity:warning
   annotations:
    summary:"時間序列數(shù)超過500萬"
    description:"當(dāng)前序列數(shù):{{ $value }}，注意內(nèi)存使用情況"


	

	5.3 備份與恢復(fù)

	5.3.1 備份策略

	
#!/bin/bash
# 文件名：/opt/scripts/prometheus_backup.sh
# 功能：Prometheus TSDB快照備份
# Crontab: 0 3 * * * /opt/scripts/prometheus_backup.sh

set-euo pipefail

PROM_URL="http://localhost:9090"
BACKUP_DIR="/data/backup/prometheus"
TSDB_PATH="/var/lib/prometheus"
KEEP_DAYS=7
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="/var/log/prometheus/backup.log"

log() {
 echo"[$(date '+%Y-%m-%d %H:%M:%S')]$1">>"$LOG_FILE"
}

# 創(chuàng)建快照
log"INFO: 開始創(chuàng)建TSDB快照"
snapshot_response=$(curl -s -X POST"${PROM_URL}/api/v1/admin/tsdb/snapshot")
snapshot_name=$(echo"$snapshot_response"| jq -r'.data.name')

if[ -z"$snapshot_name"] || ["$snapshot_name"="null"];then
 log"ERROR: 快照創(chuàng)建失敗:${snapshot_response}"
 exit1
fi

log"INFO: 快照創(chuàng)建成功:${snapshot_name}"

# 壓縮備份
mkdir -p"${BACKUP_DIR}"
tar czf"${BACKUP_DIR}/prometheus_snapshot_${DATE}.tar.gz"
  -C"${TSDB_PATH}/snapshots""${snapshot_name}"

backup_size=$(du -sh"${BACKUP_DIR}/prometheus_snapshot_${DATE}.tar.gz"| awk'{print $1}')
log"INFO: 備份文件大小:${backup_size}"

# 刪除快照目錄（釋放磁盤空間）
rm -rf"${TSDB_PATH}/snapshots/${snapshot_name}"

# 清理過期備份
find"${BACKUP_DIR}"-name"prometheus_snapshot_*.tar.gz"-mtime +${KEEP_DAYS}-delete
deleted_count=$(find"${BACKUP_DIR}"-name"prometheus_snapshot_*.tar.gz"-mtime +${KEEP_DAYS}| wc -l)
log"INFO: 清理過期備份${deleted_count}個"

log"INFO: 備份完成"


	

	5.3.2 恢復(fù)流程

	停止Prometheus服務(wù)：

	
sudo systemctl stop prometheus


	

	恢復(fù)數(shù)據(jù)：

	
# 備份當(dāng)前數(shù)據(jù)目錄
sudo mv /var/lib/prometheus /var/lib/prometheus.old

# 解壓備份
sudo mkdir -p /var/lib/prometheus
sudo tar xzf /data/backup/prometheus/prometheus_snapshot_20241215_030001.tar.gz 
  -C /var/lib/prometheus --strip-components=1

# 設(shè)置權(quán)限
sudo chown -R prometheus:prometheus /var/lib/prometheus


	

	驗證完整性：

	
# 用promtool檢查TSDB完整性
promtool tsdb list /var/lib/prometheus


	

	重啟服務(wù)：

	
sudo systemctl start prometheus

# 驗證恢復(fù)后的數(shù)據(jù)
curl -s'http://localhost:9090/api/v1/query?query=up'| jq'.data.result | length'


	

	六、總結(jié)

	6.1 技術(shù)要點回顧

	Prometheus的Pull模型決定了它的架構(gòu)優(yōu)勢：監(jiān)控端掌握主動權(quán)，target掛了能立刻感知。scrape_interval設(shè)15秒是性價比最高的選擇，采集500個target的CPU開銷控制在10%以內(nèi)。

	TSDB存儲引擎的性能瓶頸在時間序列基數(shù)，不在數(shù)據(jù)量。50萬個時間序列用4GB內(nèi)存就能跑，但500萬個序列至少要16GB?？刂苐abel的cardinality是運維Prometheus的核心技能。

	Recording Rules是查詢性能優(yōu)化的第一手段。把Dashboard上反復(fù)執(zhí)行的復(fù)雜PromQL寫成預(yù)聚合規(guī)則，查詢延遲能降一個數(shù)量級。我們線上的Dashboard平均加載時間從6秒降到了1.2秒。

	Grafana的Provisioning機制實現(xiàn)了配置即代碼，數(shù)據(jù)源和Dashboard都可以通過YAML文件管理，配合Git版本控制，做到環(huán)境一致性和變更可追溯。

	高可用方案選擇：小規(guī)模用雙副本Prometheus+Alertmanager集群，中大規(guī)模上Thanos或VictoriaMetrics。我們團隊從雙副本演進到Thanos，過渡很平滑。

	安全不能忽視：basic auth + TLS是底線，admin API要做訪問控制，Grafana要關(guān)閉匿名訪問和注冊功能。

	6.2 進階學(xué)習(xí)方向

	Thanos全局監(jiān)控方案：當(dāng)Prometheus單機扛不住或者需要跨集群查詢時，Thanos是目前最成熟的方案。重點學(xué)習(xí)Sidecar模式、Store Gateway、Compactor組件的部署和調(diào)優(yōu)。

	學(xué)習(xí)資源：Thanos官方文檔 https://thanos.io/tip/thanos/getting-started.md/

	實踐建議：先在測試環(huán)境搭一套最小化的Thanos（Sidecar + Query + Store），跑通數(shù)據(jù)鏈路后再考慮生產(chǎn)部署

	PromQL高級用法：掌握子查詢（subquery）、predict_linear預(yù)測函數(shù)、histogram_quantile分位數(shù)計算。這些在寫告警規(guī)則和Dashboard時經(jīng)常用到。

	學(xué)習(xí)資源：PromLabs出的PromQL教程 https://promlabs.com/promql-cheat-sheet/

	實踐建議：在Prometheus UI的Graph頁面多練習(xí)，從簡單的rate/sum開始，逐步組合復(fù)雜表達式

	OpenTelemetry集成：監(jiān)控體系的未來趨勢是Metrics、Traces、Logs三者融合。Prometheus已經(jīng)支持OpenTelemetry協(xié)議的指標接收，Grafana也在推Tempo（Traces）和Loki（Logs）的集成。

	學(xué)習(xí)資源：OpenTelemetry官方文檔 https://opentelemetry.io/docs/

	實踐建議：先在一個服務(wù)上試點接入OpenTelemetry SDK，把Metrics和Traces關(guān)聯(lián)起來

	6.3 參考資料

	Prometheus官方文檔- 最權(quán)威的參考，配置參數(shù)說明很詳細

	Grafana官方文檔- Dashboard配置和數(shù)據(jù)源對接指南

	Prometheus GitHub- 源碼和Issue，很多疑難問題的答案在Issue里

	Awesome Prometheus Alerts- 社區(qū)整理的告警規(guī)則集合，開箱即用

	附錄

	A. 命令速查表

	
# Prometheus操作
promtool check config /etc/prometheus/prometheus.yml  # 檢查配置語法
promtool check rules /etc/prometheus/rules/*.yml    # 檢查規(guī)則語法
promtool tsdb repair /var/lib/prometheus        # 修復(fù)TSDB
curl -X POST http://localhost:9090/-/reload      # 熱重載配置
curl -X POST http://localhost:9090/api/v1/admin/tsdb/snapshot # 創(chuàng)建快照
curl -s http://localhost:9090/api/v1/targets | jq .  # 查看target狀態(tài)
curl -s http://localhost:9090/api/v1/alerts | jq .   # 查看活躍告警
curl -s http://localhost:9090/api/v1/status/tsdb | jq .# 查看TSDB狀態(tài)

# Grafana操作
grafana-cli plugins list-remote            # 列出可用插件
grafana-cli plugins install grafana-piechart-panel   # 安裝插件
sudo systemctl restart grafana-server         # 重啟Grafana
curl -s http://admin:pass@localhost:3000/api/datasources | jq . # 查看數(shù)據(jù)源

# Node Exporter操作
curl -s http://localhost:9100/metrics | grep node_cpu # 查看CPU指標
curl -s http://localhost:9100/metrics | wc -l     # 統(tǒng)計指標行數(shù)


	

	B. 配置參數(shù)詳解

	Prometheus啟動參數(shù)：

				參數(shù)
			
				默認值
			
				說明
		

				--config.file
			
				prometheus.yml
			
				主配置文件路徑
		

				--storage.tsdb.path
			
				data/
			
				TSDB數(shù)據(jù)存儲目錄
		

				--storage.tsdb.retention.time
			
				15d
			
				數(shù)據(jù)保留時間
		

				--storage.tsdb.retention.size
			
				0 (無限制)
			
				數(shù)據(jù)保留大小上限
		

				--storage.tsdb.min-block-duration
			
				2h
			
				最小block時長
		

				--storage.tsdb.max-block-duration
			
				36h (retention的10%)
			
				最大block時長，用Thanos時設(shè)2h
		

				--web.listen-address
			
				0.0.0.0:9090
			
				監(jiān)聽地址
		

				--web.enable-lifecycle
			
				false
			
				開啟熱重載和關(guān)閉API
		

				--web.enable-admin-api
			
				false
			
				開啟管理API（刪除數(shù)據(jù)等）
		

				--query.max-concurrency
			
				20
			
				最大并發(fā)查詢數(shù)
		

				--query.timeout
			
				2m
			
				查詢超時時間
		

				--query.max-samples
			
				50000000
			
				單次查詢最大樣本數(shù)
		

	prometheus.yml全局配置：

				參數(shù)
			
				默認值
			
				說明
		

				scrape_interval
			
				1m
			
				全局采集間隔，生產(chǎn)建議15s
		

				scrape_timeout
			
				10s
			
				采集超時，必須小于scrape_interval
		

				evaluation_interval
			
				1m
			
				規(guī)則評估間隔，建議和scrape_interval一致
		

				external_labels
			
				無
			
				外部標簽，聯(lián)邦和遠程存儲時用于標識來源
		

	C. 術(shù)語表

				術(shù)語
			
				英文
			
				解釋
		

				時間序列
			
				Time Series
			
				由指標名和一組label唯一標識的數(shù)據(jù)流，每個數(shù)據(jù)點包含時間戳和值
		

				基數(shù)
			
				Cardinality
			
				一個指標的時間序列數(shù)量，由label的組合數(shù)決定。高基數(shù)是性能殺手
		

				拉取模型
			
				Pull Model
			
				Prometheus主動從target拉取指標，區(qū)別于Push模型
		

				服務(wù)發(fā)現(xiàn)
			
				Service Discovery
			
				自動發(fā)現(xiàn)監(jiān)控target的機制，支持Consul、K8s、文件等多種方式
		

				Recording Rule
			
				Recording Rule
			
				預(yù)聚合規(guī)則，把復(fù)雜查詢的結(jié)果保存為新的時間序列，加速查詢
		

				TSDB
			
				Time Series Database
			
				Prometheus內(nèi)置的時序數(shù)據(jù)庫，負責(zé)數(shù)據(jù)的存儲和查詢
		

				WAL
			
				Write-Ahead Log
			
				預(yù)寫日志，保證數(shù)據(jù)在crash后不丟失
		

				Compaction
			
				Compaction
			
				TSDB的壓縮合并過程，把小block合并成大block，提高查詢效率
		

				Exporter
			
				Exporter
			
				指標暴露組件，把第三方系統(tǒng)的指標轉(zhuǎn)換成Prometheus格式
		

				PromQL
			
				Prometheus Query Language
			
				Prometheus的查詢語言，支持向量運算和聚合

錯誤現(xiàn)象	原因分析	解決方案
target狀態(tài)顯示"context deadline exceeded"	scrape_timeout小于target的響應(yīng)時間	增大scrape_timeout，或優(yōu)化exporter的響應(yīng)速度
"out of order sample" 日志大量出現(xiàn)	時間戳亂序，通常是時鐘不同步導(dǎo)致	檢查NTP同步狀態(tài)，確保所有節(jié)點時間偏差小于1秒
Prometheus啟動后立刻OOM被kill	時間序列數(shù)過多，head block加載耗盡內(nèi)存	增加內(nèi)存，或用--storage.tsdb.no-lockfile排查，清理高基數(shù)指標
Grafana面板顯示"No data"	數(shù)據(jù)源配置錯誤或PromQL語法錯誤	先在Prometheus UI上測試查詢，確認有數(shù)據(jù)返回
熱重載后配置沒生效	配置文件有語法錯誤，reload靜默失敗	查看Prometheus日志，用promtool check config預(yù)檢

指標名稱	正常范圍	告警閾值	說明
prometheus_tsdb_head_series	根據(jù)規(guī)模定	>5000000	活躍時間序列數(shù)，超過500萬要關(guān)注內(nèi)存
prometheus_target_scrape_pool_exceeded_target_limit_total	0	>0	target數(shù)量超限，需要調(diào)整target_limit
prometheus_engine_query_duration_seconds{quantile="0.99"}	<2s	>10s	P99查詢延遲，超過10秒說明查詢太重
process_resident_memory_bytes	<總內(nèi)存70%	>總內(nèi)存80%	內(nèi)存使用，超過80%有OOM風(fēng)險
prometheus_tsdb_compactions_failed_total	0	>0	compaction失敗，可能是磁盤空間不足
prometheus_rule_evaluation_failures_total	0	>0	規(guī)則評估失敗，檢查PromQL語法

參數(shù)	默認值	說明
--config.file	prometheus.yml	主配置文件路徑
--storage.tsdb.path	data/	TSDB數(shù)據(jù)存儲目錄
--storage.tsdb.retention.time	15d	數(shù)據(jù)保留時間
--storage.tsdb.retention.size	0 (無限制)	數(shù)據(jù)保留大小上限
--storage.tsdb.min-block-duration	2h	最小block時長
--storage.tsdb.max-block-duration	36h (retention的10%)	最大block時長，用Thanos時設(shè)2h
--web.listen-address	0.0.0.0:9090	監(jiān)聽地址
--web.enable-lifecycle	false	開啟熱重載和關(guān)閉API
--web.enable-admin-api	false	開啟管理API（刪除數(shù)據(jù)等）
--query.max-concurrency	20	最大并發(fā)查詢數(shù)
--query.timeout	2m	查詢超時時間
--query.max-samples	50000000	單次查詢最大樣本數(shù)

參數(shù)	默認值	說明
scrape_interval	1m	全局采集間隔，生產(chǎn)建議15s
scrape_timeout	10s	采集超時，必須小于scrape_interval
evaluation_interval	1m	規(guī)則評估間隔，建議和scrape_interval一致
external_labels	無	外部標簽，聯(lián)邦和遠程存儲時用于標識來源

術(shù)語	英文	解釋
時間序列	Time Series	由指標名和一組label唯一標識的數(shù)據(jù)流，每個數(shù)據(jù)點包含時間戳和值
基數(shù)	Cardinality	一個指標的時間序列數(shù)量，由label的組合數(shù)決定。高基數(shù)是性能殺手
拉取模型	Pull Model	Prometheus主動從target拉取指標，區(qū)別于Push模型
服務(wù)發(fā)現(xiàn)	Service Discovery	自動發(fā)現(xiàn)監(jiān)控target的機制，支持Consul、K8s、文件等多種方式
Recording Rule	Recording Rule	預(yù)聚合規(guī)則，把復(fù)雜查詢的結(jié)果保存為新的時間序列，加速查詢
TSDB	Time Series Database	Prometheus內(nèi)置的時序數(shù)據(jù)庫，負責(zé)數(shù)據(jù)的存儲和查詢
WAL	Write-Ahead Log	預(yù)寫日志，保證數(shù)據(jù)在crash后不丟失
Compaction	Compaction	TSDB的壓縮合并過程，把小block合并成大block，提高查詢效率
Exporter	Exporter	指標暴露組件，把第三方系統(tǒng)的指標轉(zhuǎn)換成Prometheus格式
PromQL	Prometheus Query Language	Prometheus的查詢語言，支持向量運算和聚合

聲明：本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場。文章及其配圖僅供工程師學(xué)習(xí)之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問題，請聯(lián)系本站處理。舉報投訴

函數(shù)

函數(shù)

+關(guān)注

關(guān)注
3

文章
4417

瀏覽量
67494
模型

模型

+關(guān)注

關(guān)注
1

文章
3751

瀏覽量
52091
Prometheus

Prometheus

+關(guān)注

關(guān)注
0

文章
36

瀏覽量
2053

原文標題：告別監(jiān)控盲區(qū)：Prometheus + Grafana 企業(yè)級監(jiān)控落地實戰(zhàn)

文章出處：【微信號：magedu-Linux，微信公眾號：馬哥Linux運維】歡迎添加關(guān)注！文章轉(zhuǎn)載請注明出處。

91欧美超碰AV自拍|国产成年人性爱视频免费看|亚洲日韩欧美一厂二区入|人人看人人爽人人操aV|丝袜美腿视频一区二区在线看|人人操人人爽人人爱|婷婷五月天超碰|97色色欧美亚州A√|另类A√无码精品一级av|欧美特级日韩特级

搜索歷史

使用Prometheus和Grafana的企業(yè)級監(jiān)控落地實戰(zhàn)

評論