Configs2026年4月10日·1 分钟阅读

Prometheus — Open Source Monitoring & Alerting Toolkit

Prometheus is the CNCF-graduated monitoring system and time series database. Pull-based metrics collection, powerful PromQL queries, and built-in alerting for cloud-native infrastructure.

AI
AI Open Source · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

docker run -d --name prometheus -p 9090:9090 
  -v ./prometheus.yml:/etc/prometheus/prometheus.yml 
  prom/prometheus:latest

Create prometheus.yml:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

Open http://localhost:9090 — start querying metrics with PromQL.

介绍

Prometheus is an open-source monitoring system and time series database, originally built at SoundCloud and now a CNCF graduated project (same status as Kubernetes). It collects metrics from configured targets at given intervals, evaluates rule expressions, displays results, and triggers alerts when specified conditions are observed.

With 63.5K+ GitHub stars and Apache-2.0 license, Prometheus is the de facto standard for cloud-native monitoring, deeply integrated with Kubernetes and the entire CNCF ecosystem.

What Prometheus Does

  • Metrics Collection: Pull-based metrics scraping from instrumented applications and exporters
  • Time Series DB: Efficient local storage optimized for time series data with compression
  • PromQL: Powerful query language for slicing, dicing, and aggregating time series data
  • Alerting: Alert rules with Alertmanager for routing, grouping, and notification
  • Service Discovery: Auto-discover targets from Kubernetes, Consul, DNS, EC2, and more
  • Exporters: 500+ exporters for databases, hardware, messaging, storage, and cloud services
  • Federation: Hierarchical federation for scaling across multiple Prometheus instances

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ Targets      │◀────│  Prometheus  │────▶│  Alertmanager│
│ (Exporters)  │pull │  Server      │     │  (Notify)    │
│              │     │  TSDB + Rules│     └──────────────┘
└──────────────┘     └──────┬───────┘
                            │
                     ┌──────┴───────┐
                     │  Grafana     │
                     │  (Visualize) │
                     └──────────────┘

Key design principle: Pull-based — Prometheus scrapes metrics from HTTP endpoints, rather than having applications push metrics. This makes it easier to detect when a target is down.

Self-Hosting

Docker Compose (Full Stack)

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert.rules.yml:/etc/prometheus/alert.rules.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro

volumes:
  prometheus-data:

Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "alert.rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

PromQL Essentials

# Instant vector — current value
up{job="node"}

# Range vectorvalues over time
node_cpu_seconds_total[5m]

# Rate — per-second rate of increase
rate(http_requests_total[5m])

# Aggregation — sum across instances
sum(rate(http_requests_total[5m])) by (method, status)

# Histogram quantile — P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Arithmetic — error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) * 100

# Prediction — disk full in 4 hours?
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0

Instrumenting Your App

Go

import "github.com/prometheus/client_golang/prometheus"

var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
    []string{"method", "status"},
)

func init() { prometheus.MustRegister(httpRequests) }

// In handler:
httpRequests.WithLabelValues("GET", "200").Inc()

Python

from prometheus_client import Counter, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total requests', ['method', 'status'])

REQUEST_COUNT.labels(method='GET', status='200').inc()
start_http_server(8000)  # Expose /metrics on port 8000

Popular Exporters

Exporter Metrics
Node Exporter CPU, memory, disk, network (Linux)
cAdvisor Container resource usage
MySQL Exporter Query performance, connections
PostgreSQL Exporter Database stats, replication
Redis Exporter Memory, keys, commands
Blackbox Exporter HTTP, DNS, TCP, ICMP probes
NGINX Exporter Requests, connections, status

Prometheus vs Alternatives

Feature Prometheus InfluxDB Datadog Victoria Metrics
Open Source Yes (Apache-2.0) Partial No Yes (Apache-2.0)
Collection Pull-based Push-based Agent Pull + Push
Query PromQL InfluxQL/Flux Proprietary MetricsQL
CNCF Graduated No No No
Long-term storage Needs remote Built-in Built-in Built-in
Kubernetes Native Plugin Agent Native

常见问题

Q: Prometheus 的数据能保存多久? A: 默认保留 15 天。可以通过 --storage.tsdb.retention.time 调整。长期存储建议使用 Thanos 或 Cortex 等远端存储方案。

Q: Prometheus 适合日志收集吗? A: 不适合。Prometheus 专门用于数值型指标(metrics)。日志收集推荐 Loki(同为 Grafana Labs 出品),与 Prometheus 完美配合。

Q: 一个 Prometheus 实例能抓取多少指标? A: 单实例可以处理数百万活跃时间序列。超大规模环境可以使用联邦(federation)或 Thanos/Mimir 进行水平扩展。

来源与致谢

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产