Stack Déploiement + Monitoring + Observabilité
Dix outils pour développeurs qui shippent en prod : cibles de déploiement (Vercel / Kamal / Coolify), error tracking, OpenTelemetry, métriques, logs, dashboards, uptime et alertes — chaînés dans un ordre délibéré pour vraiment attraper la prochaine panne.
What's in this pack
This is the stack a working backend engineer would assemble the week before their app gets real users — not the heroic post-outage scramble. Every pick here is open-source-first, runs on a $20 VPS or smaller, and plugs into the next tool in the chain. The order matters: each layer feeds the next.
| # | Pick | Layer | What it does |
|---|---|---|---|
| 1 | Vercel CLI | deploy (PaaS) | preview URL on every git push, zero config for Next/Nuxt/Astro |
| 2 | Kamal | deploy (container) | zero-downtime Docker deploys to any bare VPS — Basecamp's tool |
| 3 | Coolify | deploy (self-hosted PaaS) | open-source Vercel/Heroku replacement for your own server |
| 4 | Sentry | errors + APM | exception capture, release health, performance traces |
| 5 | OpenTelemetry Collector | telemetry pipeline | vendor-neutral fan-in for traces, metrics, logs |
| 6 | Prometheus | metrics | pull-based time-series DB, the industry default |
| 7 | Grafana Loki | logs | log aggregation that thinks like Prometheus — cheap, indexed by label |
| 8 | Grafana | dashboards | the wall display every other tool plugs into |
| 9 | Uptime Kuma | uptime + status page | self-hosted heartbeat that pages you when the site dies |
| 10 | Prometheus Alertmanager | alert routing | dedupe, group, route alerts to PagerDuty / Slack / email |
Install in this order (deploy → traces → logs → metrics → uptime → alerts → dashboards)
The order is deliberate. Don't install dashboards first. Empty dashboards teach you nothing. Wire the data sources first; the dashboard is the last 10% of the work.
- Pick one deploy target. Vercel CLI if you're shipping a JS framework and want preview URLs on every PR. Kamal if you've outgrown Heroku-style pricing and want to own the box. Coolify if you want the Vercel UX on your own hardware. Pick one. Skip the other two.
- Sentry next. Errors are the single highest-signal telemetry you'll add. Five lines of SDK init and you start catching exceptions you didn't know existed. Set up release tracking from day one so you can answer "did this start with the last deploy?"
- OpenTelemetry Collector. Don't lock yourself to one vendor's SDK. The Collector is a single Go binary that receives OTLP from your app and fans out to Sentry, Prometheus, Loki, or anything else. Configure it once, swap backends without touching app code.
- Prometheus for metrics. Scrape
/metricsfrom your app, your Node Exporter, your database exporters. The four golden signals — latency, traffic, errors, saturation — go here. - Loki for logs. If you already use Prometheus, Loki is the obvious log store: same label model, same query language flavor, runs on the same VM. Don't index every JSON field; index by service, env, level — let
LogQLfilter the rest. - Uptime Kuma for the heartbeat. External-perspective ping. Catches the outages your internal stack can't see (DNS, TLS cert, CDN). Public status page included.
- Alertmanager wired to Prometheus. Alerts should fire on symptoms (p95 latency > 2s, error rate > 1%), not causes (CPU > 80%). Route P1 to pager, P2 to Slack, P3 to a daily digest.
- Grafana last. Now that data is flowing, build three dashboards: one for the on-call engineer (latency, error rate, recent deploys), one for the product owner (signups, conversions, cost per user), one for the exec (uptime %, MAU, week-over-week). Generic dashboards get ignored.
Tradeoffs you'll hit
- Vercel vs Kamal vs Coolify — Vercel = zero-ops, scales to zero, gets expensive at scale and you don't own the stack. Kamal = own the box, Docker is the only abstraction, cheap and predictable. Coolify = the middle ground; self-hosted UI on top of Docker. Most teams ship the MVP on Vercel, migrate to Kamal/Coolify when the bill hits $500/mo.
- Sentry SaaS vs self-hosted — Self-hosted Sentry needs ~6 services (Kafka, Postgres, Redis, ClickHouse). For under 100k events/month, the SaaS free tier is genuinely cheaper than your time. Self-host only when you're past the free tier and have ops bandwidth.
- Prometheus + Loki + Grafana vs Datadog — Datadog is the polished hosted incumbent. The open stack costs ~$20/mo in VPS instead of $300+/mo per host. Tradeoff: you babysit the stack. Below ~10 services, open-source wins on cost and lock-in; above ~50, Datadog's ergonomics start to matter.
- Push vs pull metrics — Prometheus is pull (it scrapes you). If you run serverless or short-lived jobs, pull doesn't work — use a Pushgateway, or switch to OpenTelemetry push to a Collector. Don't fight the model.
Common pitfalls
- Alerting on causes, not symptoms. "CPU > 80%" pages you at 3am for a workload that's fine. "User-facing p95 > 2s" pages you only when it matters. Tune for symptoms; investigate causes after waking up.
- No release annotation in Grafana. Half of all incidents start "right after the deploy." Wire your deploy script to POST a Grafana annotation on every release. The flame on the timeline saves 20 minutes per incident.
- Indexing every log field. Loki's whole point is that it doesn't. If you add 50 labels per log line, cardinality explodes and the cheap log store becomes expensive. Index by service, env, level — grep the rest.
- One alert channel for everything. P1 (site down) → phone. P2 (degraded) → Slack with @channel. P3 (anomaly) → daily digest. Mix them and either you ignore the pager or you ignore the digest. Both fail.
- No external uptime check. Your internal Prometheus thinks the service is up. Cloudflare or your CDN is dropping 30% of requests in
eu-west. Uptime Kuma from a different network catches this. Five minutes to set up.
10 ressources prêtes à installer
Questions fréquentes
Do I really need all ten of these? It looks like a lot.
You need one from each layer, not all ten. The pack lists alternatives within layers (three deploy targets, two metric paths via Prometheus or OTel) — pick the one that fits your scale. The minimum viable stack for a 1-person indie ship is: Vercel CLI + Sentry + Uptime Kuma. Add Prometheus + Grafana + Alertmanager when you have a second engineer. Add Loki + OpenTelemetry Collector when you're past 10 services. Don't install ahead of need.
What's the realistic monthly cost for this whole stack?
For a small team: Vercel free or $20/mo, Sentry free tier (5k errors/mo) or $26/mo, then a single $5-20 VPS to host Prometheus + Loki + Grafana + Uptime Kuma + Alertmanager together (they're all light on RAM). Total: $25-60/mo for production observability that catches real outages. Compare to Datadog at $15-31 per host per month, often $300+/mo for the same coverage.
How does this overlap with the LLM Observability pack?
LLM Observability (Langfuse, Phoenix, AgentOps) is the application-semantic layer — prompt traces, token costs, eval scores. This Deploy + Monitor + Observability pack is the infrastructure layer — is the container alive, is the HTTP p95 acceptable, did the deploy break the error rate. You want both. The OpenTelemetry Collector in this pack can ingest LLM traces from Langfuse/Phoenix and forward them alongside infra metrics, so on-call sees both on one Grafana dashboard.
Why Kamal over Docker Swarm or Nomad?
Kamal is opinionated to the point of being boring, which is what you want for deploys. It only does zero-downtime container rollouts and traefik-based routing — no scheduler, no service mesh, no YAML cathedral. For 1-10 servers it's the simplest thing that works. Swarm is in maintenance mode; Nomad is great but the operational footprint is larger than a small team needs. Reach for k8s only when you have someone whose full-time job is k8s.
Can I use this stack with a serverless backend (AWS Lambda, Cloudflare Workers)?
Yes, but the scrape model breaks. For serverless, use OpenTelemetry SDKs that push traces and metrics to the OpenTelemetry Collector via OTLP. The Collector then writes to Prometheus (via remote_write) and Loki, and everything else in the pack works unchanged. Uptime Kuma still pings the public URL, Sentry's SDK works in Lambda/Workers runtimes, and Grafana dashboards don't care where the data came from.
12 packs · 80+ ressources sélectionnées
Découvrez tous les packs curatés sur la page d'accueil
Retour à tous les packs