# Prometheus — Open Source Monitoring & Alerting Toolkit

> Prometheus is the CNCF-graduated monitoring system and time series database. Pull-based metrics collection, powerful PromQL queries, and built-in alerting for cloud-native infrastructure.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

## Quick Use

```bash
docker run -d --name prometheus -p 9090:9090 
  -v ./prometheus.yml:/etc/prometheus/prometheus.yml 
  prom/prometheus:latest
```

Create `prometheus.yml`:
```yaml
global:
  scrape_interval: 15s
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
```

Open `http://localhost:9090` — start querying metrics with PromQL.

## Intro

**Prometheus** is an open-source monitoring system and time series database, originally built at SoundCloud and now a CNCF graduated project (same status as Kubernetes). It collects metrics from configured targets at given intervals, evaluates rule expressions, displays results, and triggers alerts when specified conditions are observed.

With 63.5K+ GitHub stars and Apache-2.0 license, Prometheus is the de facto standard for cloud-native monitoring, deeply integrated with Kubernetes and the entire CNCF ecosystem.

## What Prometheus Does

- **Metrics Collection**: Pull-based metrics scraping from instrumented applications and exporters
- **Time Series DB**: Efficient local storage optimized for time series data with compression
- **PromQL**: Powerful query language for slicing, dicing, and aggregating time series data
- **Alerting**: Alert rules with Alertmanager for routing, grouping, and notification
- **Service Discovery**: Auto-discover targets from Kubernetes, Consul, DNS, EC2, and more
- **Exporters**: 500+ exporters for databases, hardware, messaging, storage, and cloud services
- **Federation**: Hierarchical federation for scaling across multiple Prometheus instances

## Architecture

```
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ Targets      │◀────│  Prometheus  │────▶│  Alertmanager│
│ (Exporters)  │pull │  Server      │     │  (Notify)    │
│              │     │  TSDB + Rules│     └──────────────┘
└──────────────┘     └──────┬───────┘
                            │
                     ┌──────┴───────┐
                     │  Grafana     │
                     │  (Visualize) │
                     └──────────────┘
```

Key design principle: **Pull-based** — Prometheus scrapes metrics from HTTP endpoints, rather than having applications push metrics. This makes it easier to detect when a target is down.

## Self-Hosting

### Docker Compose (Full Stack)

```yaml
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert.rules.yml:/etc/prometheus/alert.rules.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro

volumes:
  prometheus-data:
```

### Configuration

```yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "alert.rules.yml"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
```

## PromQL Essentials

```promql
# Instant vector — current value
up{job="node"}

# Range vector — values over time
node_cpu_seconds_total[5m]

# Rate — per-second rate of increase
rate(http_requests_total[5m])

# Aggregation — sum across instances
sum(rate(http_requests_total[5m])) by (method, status)

# Histogram quantile — P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Arithmetic — error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m])) * 100

# Prediction — disk full in 4 hours?
predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600) < 0
```

## Instrumenting Your App

### Go

```go
import "github.com/prometheus/client_golang/prometheus"

var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "http_requests_total",
        Help: "Total HTTP requests",
    },
    []string{"method", "status"},
)

func init() { prometheus.MustRegister(httpRequests) }

// In handler:
httpRequests.WithLabelValues("GET", "200").Inc()
```

### Python

```python
from prometheus_client import Counter, start_http_server

REQUEST_COUNT = Counter('http_requests_total', 'Total requests', ['method', 'status'])

REQUEST_COUNT.labels(method='GET', status='200').inc()
start_http_server(8000)  # Expose /metrics on port 8000
```

## Popular Exporters

| Exporter | Metrics |
|----------|---------|
| Node Exporter | CPU, memory, disk, network (Linux) |
| cAdvisor | Container resource usage |
| MySQL Exporter | Query performance, connections |
| PostgreSQL Exporter | Database stats, replication |
| Redis Exporter | Memory, keys, commands |
| Blackbox Exporter | HTTP, DNS, TCP, ICMP probes |
| NGINX Exporter | Requests, connections, status |

## Prometheus vs Alternatives

| Feature | Prometheus | InfluxDB | Datadog | Victoria Metrics |
|---------|-----------|----------|---------|-----------------|
| Open Source | Yes (Apache-2.0) | Partial | No | Yes (Apache-2.0) |
| Collection | Pull-based | Push-based | Agent | Pull + Push |
| Query | PromQL | InfluxQL/Flux | Proprietary | MetricsQL |
| CNCF | Graduated | No | No | No |
| Long-term storage | Needs remote | Built-in | Built-in | Built-in |
| Kubernetes | Native | Plugin | Agent | Native |

## FAQ

**Q: How long can Prometheus retain data?**
A: 15 days by default. Adjust via `--storage.tsdb.retention.time`. For long-term storage, use remote storage solutions like Thanos or Cortex.

**Q: Is Prometheus good for log collection?**
A: No. Prometheus is specifically for numeric metrics. For logs, use Loki (also from Grafana Labs) — it pairs perfectly with Prometheus.

**Q: How many metrics can a single Prometheus instance handle?**
A: A single instance handles millions of active time series. For very large scale, use federation or Thanos/Mimir for horizontal scaling.

## Source & Thanks

- GitHub: [prometheus/prometheus](https://github.com/prometheus/prometheus) — 63.5K+ ⭐ | Apache-2.0
- Website: [prometheus.io](https://prometheus.io)

---
Source: https://tokrepo.com/en/workflows/prometheus-open-source-monitoring-alerting-toolkit-ed3a8de4
Author: AI Open Source