Thanos — Global Prometheus with Unlimited Retention and High Availability
Thanos extends Prometheus with global query, unlimited storage via object storage, and HA replication. It is the proven way to run Prometheus at multi-cluster, multi-year scale without changing your existing workflow.
Installation avec revue préalable
Cet actif nécessite une revue. Le prompt copié demande un dry-run, affiche les écritures, puis continue seulement après confirmation.
npx -y tokrepo@latest install 63ff1c2c-37c8-11f1-9bc6-00163e2b0d79 --target codexDry-run d'abord, confirmez les écritures, puis lancez cette commande.
What it is
Thanos extends Prometheus with global query capabilities, unlimited storage via object storage (S3, GCS, Azure), and high-availability replication. It is the proven way to run Prometheus at multi-cluster, multi-year scale without changing your existing Prometheus setup. Thanos runs as a set of sidecar and gateway components alongside your Prometheus instances.
Thanos is designed for SRE and platform teams running Prometheus across multiple clusters who need a unified query interface and long-term metric retention.
How it saves time or tokens
Prometheus alone has limited local storage and no built-in multi-cluster querying. Running Prometheus at scale requires either over-provisioning disk or losing old metrics. Thanos solves both problems: a sidecar uploads Prometheus blocks to cheap object storage for indefinite retention, and the Query component federates queries across all Prometheus instances. You keep your existing Prometheus setup and add Thanos components alongside it.
How to use
- Run the Thanos sidecar next to each Prometheus instance:
thanos sidecar \
--tsdb.path=/prometheus \
--prometheus.url=http://localhost:9090 \
--objstore.config-file=bucket.yml
- Configure object storage in
bucket.yml:
type: S3
config:
bucket: thanos-metrics
endpoint: s3.amazonaws.com
access_key: '...'
secret_key: '...'
- Run the Thanos Query component for global queries:
thanos query \
--store=sidecar-1:10901 \
--store=sidecar-2:10901 \
--store=store-gateway:10901
Access the global Prometheus UI at the Thanos Query endpoint.
Example
A complete Thanos deployment architecture:
Cluster A: Prometheus + Thanos Sidecar ─┐
├─> Thanos Query ─> Grafana
Cluster B: Prometheus + Thanos Sidecar ─┤
│
Object Storage (S3) <── Sidecar uploads ──┘
│
└──> Thanos Store Gateway ──> Thanos Query
(serves historical data)
Grafana points at Thanos Query as its Prometheus data source and gets a unified view across all clusters and time ranges.
Related on TokRepo
- Monitoring tools — Browse observability and monitoring tools
- DevOps tools — Explore infrastructure tooling
Common pitfalls
- Not deploying a Store Gateway for historical queries. Without it, Thanos Query can only reach live Prometheus instances. The Store Gateway serves data from object storage for long-term queries.
- Forgetting to configure compaction. Thanos Compact merges and downsamples historical blocks in object storage. Without it, storage costs grow linearly and queries over long time ranges slow down.
- Running Thanos Query without deduplication. If you run HA Prometheus pairs, enable
--query.replica-labelto deduplicate metrics from replica instances. - Starting with an overly complex configuration instead of defaults. Begin with the minimal setup, verify it works, then customize incrementally. This approach catches configuration errors early and keeps troubleshooting straightforward.
Questions fréquentes
No. Thanos runs alongside Prometheus. You keep your existing Prometheus instances for scraping and alerting. Thanos adds global querying, long-term storage, and HA replication on top of Prometheus.
Object storage is extremely cheap. S3 costs about $0.023/GB/month. With Thanos compaction and downsampling, a year of metrics from a medium cluster might cost a few dollars per month in storage.
Yes. Thanos Compact downsamples old data to 5-minute and 1-hour resolution. This dramatically reduces storage size for historical data while maintaining enough resolution for trend analysis.
Thanos and Cortex/Mimir solve the same problem (scaling Prometheus) with different architectures. Thanos uses a sidecar model that is simpler to deploy alongside existing Prometheus. Cortex/Mimir uses a pull-based model with remote write. Thanos is simpler to adopt; Mimir may scale better for very large deployments.
Yes. Thanos Ruler evaluates recording rules and alerting rules against Thanos Query (the global view), enabling rules that span multiple Prometheus instances.
Sources citées (3)
- Thanos GitHub— Thanos extends Prometheus with global query
- Thanos Documentation— Thanos architecture and components
- CNCF Thanos— CNCF incubating project
En lien sur TokRepo
Fil de discussion
Actifs similaires
CockroachDB — Distributed SQL for the Global Cloud
CockroachDB is a cloud-native, distributed SQL database designed for high availability, effortless horizontal scale, and geographic data placement. PostgreSQL-compatible wire protocol with serializable transactions across regions.
Cortex — Horizontally Scalable Long-Term Storage for Prometheus
Cortex is a CNCF project that provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus metrics, letting you run Prometheus-as-a-Service at scale.
Prometheus Node Exporter — Hardware and OS Metrics for Unix Systems
Node Exporter is the official Prometheus exporter for machine-level metrics, exposing CPU, memory, disk, filesystem, and network statistics from Linux and other Unix systems via an HTTP endpoint.
Redux — Predictable Global State Management for JS Apps
Redux is the original predictable state container for JavaScript apps. Modern Redux uses Redux Toolkit (RTK) which reduces boilerplate 80% and includes RTK Query for server state. Still the standard for large-scale React apps.