What Prometheus Does
- Metrics Collection: Pull-based metrics scraping from instrumented applications and exporters
- Time Series DB: Efficient local storage optimized for time series data with compression
- PromQL: Powerful query language for slicing, dicing, and aggregating time series data
- Alerting: Alert rules with Alertmanager for routing, grouping, and notification
- Service Discovery: Auto-discover targets from Kubernetes, Consul, DNS, EC2, and more
- Exporters: 500+ exporters for databases, hardware, messaging, storage, and cloud services
- Federation: Hierarchical federation for scaling across multiple Prometheus instances
Architecture
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Targets │◀────│ Prometheus │────▶│ Alertmanager│
│ (Exporters) │pull │ Server │ │ (Notify) │
│ │ │ TSDB + Rules│ └──────────────┘
└──────────────┘ └──────┬───────┘
│
┌──────┴───────┐
│ Grafana │
│ (Visualize) │
└──────────────┘Key design principle: Pull-based — Prometheus scrapes metrics from HTTP endpoints, rather than having applications push metrics. This makes it easier to detect when a target is down.
Self-Hosting
Docker Compose (Full Stack)
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert.rules.yml:/etc/prometheus/alert.rules.yml
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
volumes:
prometheus-data:Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- "alert.rules.yml"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: truePromQL Essentials
# Instant vector — current value
up{job="node"}
# Range vector — values over time
node_cpu_seconds_total[5m]
# Rate — per-second rate of increase
rate(http_requests_total[5m])
# Aggregation — sum across instances
sum(rate(http_requests_total[5m])) by (method, status)
# Histogram quantile — P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Arithmetic — error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Prediction — disk full in 4 hours?
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0Instrumenting Your App
Go
import "github.com/prometheus/client_golang/prometheus"
var httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "status"},
)
func init() { prometheus.MustRegister(httpRequests) }
// In handler:
httpRequests.WithLabelValues("GET", "200").Inc()Python
from prometheus_client import Counter, start_http_server
REQUEST_COUNT = Counter('http_requests_total', 'Total requests', ['method', 'status'])
REQUEST_COUNT.labels(method='GET', status='200').inc()
start_http_server(8000) # Expose /metrics on port 8000Popular Exporters
| Exporter | Metrics |
|---|---|
| Node Exporter | CPU, memory, disk, network (Linux) |
| cAdvisor | Container resource usage |
| MySQL Exporter | Query performance, connections |
| PostgreSQL Exporter | Database stats, replication |
| Redis Exporter | Memory, keys, commands |
| Blackbox Exporter | HTTP, DNS, TCP, ICMP probes |
| NGINX Exporter | Requests, connections, status |
Prometheus vs Alternatives
| Feature | Prometheus | InfluxDB | Datadog | Victoria Metrics |
|---|---|---|---|---|
| Open Source | Yes (Apache-2.0) | Partial | No | Yes (Apache-2.0) |
| Collection | Pull-based | Push-based | Agent | Pull + Push |
| Query | PromQL | InfluxQL/Flux | Proprietary | MetricsQL |
| CNCF | Graduated | No | No | No |
| Long-term storage | Needs remote | Built-in | Built-in | Built-in |
| Kubernetes | Native | Plugin | Agent | Native |
常见问题
Q: Prometheus 的数据能保存多久?
A: 默认保留 15 天。可以通过 --storage.tsdb.retention.time 调整。长期存储建议使用 Thanos 或 Cortex 等远端存储方案。
Q: Prometheus 适合日志收集吗? A: 不适合。Prometheus 专门用于数值型指标(metrics)。日志收集推荐 Loki(同为 Grafana Labs 出品),与 Prometheus 完美配合。
Q: 一个 Prometheus 实例能抓取多少指标? A: 单实例可以处理数百万活跃时间序列。超大规模环境可以使用联邦(federation)或 Thanos/Mimir 进行水平扩展。
来源与致谢
- GitHub: prometheus/prometheus — 63.5K+ ⭐ | Apache-2.0
- 官网: prometheus.io