What is Prometheus — Open Source Monitoring & Alerting Toolkit?

Prometheus is the CNCF-graduated monitoring system and time series database. Pull-based metrics collection, powerful PromQL queries, and built-in alerting for cloud-native infrastructure.

Is Prometheus — Open Source Monitoring & Alerting Toolkit free to use?

Yes. Prometheus — Open Source Monitoring & Alerting Toolkit is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Prometheus — Open Source Monitoring & Alerting Toolkit?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Prometheus — Open Source Monitoring & Alerting Toolkit

What Prometheus Does

Metrics Collection: Pull-based metrics scraping from instrumented applications and exporters
Time Series DB: Efficient local storage optimized for time series data with compression
PromQL: Powerful query language for slicing, dicing, and aggregating time series data
Alerting: Alert rules with Alertmanager for routing, grouping, and notification
Service Discovery: Auto-discover targets from Kubernetes, Consul, DNS, EC2, and more
Exporters: 500+ exporters for databases, hardware, messaging, storage, and cloud services
Federation: Hierarchical federation for scaling across multiple Prometheus instances

Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ Targets      │◀────│  Prometheus  │────▶│  Alertmanager│
│ (Exporters)  │pull │  Server      │     │  (Notify)    │
│              │     │  TSDB + Rules│     └──────────────┘
└──────────────┘     └──────┬───────┘
                            │
                     ┌──────┴───────┐
                     │  Grafana     │
                     │  (Visualize) │
                     └──────────────┘

Key design principle: Pull-based — Prometheus scrapes metrics from HTTP endpoints, rather than having applications push metrics. This makes it easier to detect when a target is down.

Self-Hosting

Docker Compose (Full Stack)

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert.rules.yml:/etc/prometheus/alert.rules.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro

volumes:
  prometheus-data:

Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "alert.rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

PromQL Essentials

# Instant vector — current value
up{job="node"}

# Range vector — values over time
node_cpu_seconds_total[5m]

# Rate — per-second rate of increase
rate(http_requests_total[5m])

# Aggregation — sum across instances
sum(rate(http_requests_total[5m])) by (method, status)

# Histogram quantile — P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Arithmetic — error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) * 100

# Prediction — disk full in 4 hours?
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0

Instrumenting Your App

Go

import "github.com/prometheus/client_golang/prometheus"

var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
    []string{"method", "status"},
)

func init() { prometheus.MustRegister(httpRequests) }

// In handler:
httpRequests.WithLabelValues("GET", "200").Inc()

Python

from prometheus_client import Counter, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total requests', ['method', 'status'])

REQUEST_COUNT.labels(method='GET', status='200').inc()
start_http_server(8000)  # Expose /metrics on port 8000

Popular Exporters

Exporter	Metrics
Node Exporter	CPU, memory, disk, network (Linux)
cAdvisor	Container resource usage
MySQL Exporter	Query performance, connections
PostgreSQL Exporter	Database stats, replication
Redis Exporter	Memory, keys, commands
Blackbox Exporter	HTTP, DNS, TCP, ICMP probes
NGINX Exporter	Requests, connections, status

Prometheus vs Alternatives

Feature	Prometheus	InfluxDB	Datadog	Victoria Metrics
Open Source	Yes (Apache-2.0)	Partial	No	Yes (Apache-2.0)
Collection	Pull-based	Push-based	Agent	Pull + Push
Query	PromQL	InfluxQL/Flux	Proprietary	MetricsQL
CNCF	Graduated	No	No	No
Long-term storage	Needs remote	Built-in	Built-in	Built-in
Kubernetes	Native	Plugin	Agent	Native

常见问题

Q: Prometheus 的数据能保存多久？ A: 默认保留 15 天。可以通过 --storage.tsdb.retention.time 调整。长期存储建议使用 Thanos 或 Cortex 等远端存储方案。

Q: Prometheus 适合日志收集吗？ A: 不适合。Prometheus 专门用于数值型指标（metrics）。日志收集推荐 Loki（同为 Grafana Labs 出品），与 Prometheus 完美配合。

Q: 一个 Prometheus 实例能抓取多少指标？ A: 单实例可以处理数百万活跃时间序列。超大规模环境可以使用联邦（federation）或 Thanos/Mimir 进行水平扩展。

来源与致谢

GitHub: prometheus/prometheus — 63.5K+ ⭐ | Apache-2.0
官网: prometheus.io