Thanos — Global Prometheus with Unlimited Retention and High Availability

Introduction

Prometheus is the de-facto standard for cloud-native monitoring — but a single Prometheus instance limits retention and query scope. Thanos extends Prometheus with object storage (S3/GCS/Azure/MinIO) for unlimited retention, global querying across clusters, and HA via deduplication — all while keeping the Prometheus HTTP API compatible.

With over 13,000 GitHub stars, Thanos is used by eBay, Adidas, GitLab, and hundreds of companies running multi-cluster Kubernetes. It's a CNCF Incubating project.

What Thanos Does

Thanos adds components that work alongside (or on top of) Prometheus: Sidecar uploads blocks to object storage, Store Gateway serves historical data from the bucket, Query federates multiple sources, Compactor downsamples old data, Ruler evaluates alerts across the federation, and Receive handles remote_write when sidecar isn't practical.

Architecture Overview

                         [Object Storage (S3 / GCS / Azure)]
                                    ^
                                    |
                          [Thanos Sidecar]    <-- upload old blocks
                                    |
  cluster A:  Prometheus  ----------+
                                    |
                          [Thanos Sidecar]
                                    |
  cluster B:  Prometheus  ----------+

                          [Thanos Store]    --> reads old blocks from bucket
                          [Thanos Compactor] --> downsample + compact
                          [Thanos Ruler]     --> global alert eval
                                    ^
                                    |
                          [Thanos Query]     <-- Grafana queries here
                                    |
                           Prometheus API (fully compatible)

Self-Hosting & Configuration

# Minimal Kubernetes Thanos stack
apiVersion: apps/v1
kind: StatefulSet
metadata: { name: prometheus }
spec:
  serviceName: prometheus
  template:
    spec:
      containers:
        - name: prometheus
          image: prom/prometheus:v2.53.0
          args: ["--storage.tsdb.retention.time=24h", "--web.enable-lifecycle"]
        - name: thanos-sidecar
          image: quay.io/thanos/thanos:v0.36.1
          args:
            - sidecar
            - --tsdb.path=/prometheus
            - --prometheus.url=http://localhost:9090
            - --objstore.config-file=/etc/thanos/bucket.yml
          volumeMounts:
            - { name: data, mountPath: /prometheus }
            - { name: bucket-cfg, mountPath: /etc/thanos }

# Thanos Query exposes the usual Prometheus endpoint to Grafana
# point Grafana data source at thanos-query:10902

Key Features

Unlimited retention — blocks in object storage, pay for S3 bytes
Global query — federated PromQL across clusters
High availability — query dedup across replicated Prometheus pairs
Downsampling — 5m and 1h aggregates for fast long-range queries
Remote write — Receive component accepts Prometheus remote_write
Compatible API — drop-in for Grafana, Alertmanager
CNCF Incubating — stable governance, 300+ contributors
Cache layers — Index cache (Redis/memcached) for bucket reads

Comparison with Similar Tools

Feature	Thanos	Cortex	Grafana Mimir	VictoriaMetrics	M3DB
Architecture	Sidecar + object store	Microservices	Mimir (Cortex fork)	Single-binary cluster	Distributed TSDB
Storage	S3/GCS/Azure	S3/GCS/Azure	S3/GCS/Azure	Local + replication	DFS + RocksDB
Query API	Prometheus	Prometheus	Prometheus	Prometheus + own	Prometheus (via proxy)
Setup	Moderate	Complex	Complex	Very simple	Complex
Cost	Low (S3 pricing)	Low	Low	Moderate	High
Best For	Pragmatic HA + retention	Multi-tenant	Grafana shops	Simplicity + speed	Uber-scale

FAQ

Q: Thanos vs VictoriaMetrics — which is better? A: VictoriaMetrics is simpler (single binary) and faster; Thanos uses cheaper object storage and is the reference HA solution for vanilla Prometheus. Pick Thanos if you already run Prometheus; VictoriaMetrics if starting fresh.

Q: Do I need the Receive component? A: Only if Prometheus can't run close to your apps (e.g., short-lived edge workloads). Otherwise Sidecar + object storage is the recommended pattern.

Q: How much does object storage cost? A: Often cents per GB per month. A cluster scraping 100K series at 1m resolution for a year might cost $5–20/month in S3 — dramatically cheaper than SSD block storage.

Q: Does Thanos replace Prometheus? A: No — it augments it. Your Prometheus instances still do scrape + short-term local storage; Thanos handles long-term and global view.

Sources

GitHub: https://github.com/thanos-io/thanos
Docs: https://thanos.io
Foundation: CNCF (Incubating)
License: Apache-2.0

Thanos — Global Prometheus with Unlimited Retention and High Availability

Introduction

What Thanos Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

LibreTranslate — Self-Hosted Translation API with No Rate Limits

Monica — Personal Relationship Manager for Remembering What Matters

Focalboard — Open-Source Project Management Alternative to Trello and Notion