Pack Analyse et Recherche de Logs
Dix choix pour l'ingénieur qui lit des logs à 3h du matin — loggers structurés, stack d'envoi et stockage (Fluent Bit → Loki / Elasticsearch / ClickHouse), tail SQL local avec lnav, groupage d'erreurs Sentry, et serveurs MCP pour que l'agent IA interroge traces et alertes directement.
What this pack solves
It's 3 a.m. The pager says 5xx rate jumped. You SSH in, tail -f a file that's already rotated, grep for an exception that's actually three exceptions sharing a substring, and 40 minutes later you've narrowed it to "something in checkout." That's the problem this pack kills.
The goal isn't observability theatre — no fifteen dashboards no one opens. The goal is: structured logs go in one end, a question comes out the other, and the question can be asked by you, a teammate, or an AI agent with MCP access.
Every pick here is open-source or has a self-hostable open-source core. The full pipeline runs on a single mid-size VM up to about 50 GB/day of log volume; past that, you split Loki/ClickHouse onto their own boxes. No vendor lock-in, no per-GB pricing surprises.
Install in this order
- winston (Node) or Loguru (Python) — start with structured logging in your app. JSON output, one log line per event, every line has
timestamp,level,service,trace_id. If your logs aren't structured at the source, every downstream tool is fighting your formatter instead of doing its job. - Fluent Bit — the shipper. Tails files / journald / Docker logs, parses JSON, adds host labels, batches, retries, ships to your store. Tiny C binary, ~5 MB RSS, runs as a sidecar or DaemonSet. The non-negotiable middle layer.
- Grafana Loki — the store, default pick. Indexes labels (not full text), uses object storage, cheap to run. Best when you ship structured JSON and search by
service=checkout level=error. LogQL feels like PromQL — five minutes to learn if you know Prometheus. - Elasticsearch — alternative store when you need full-text search across log message bodies, not just labels. Heavier (JVM, more disk) but unbeatable when the question is "find every log mentioning
OrderId=abc-123anywhere". Pair with Kibana for the UI. - ClickHouse — alternative store when you have a lot of logs (>100 GB/day) and want SQL. Columnar, eats compressed JSON for breakfast, queries that take Elasticsearch 30 seconds run in 1 second. The right pick at scale.
- lnav — the local terminal log navigator. SQL queries against log files directly, live tailing, format auto-detection, syntax-highlighted error highlighting. The tool you reach for when SSH'd to one box and the centralized store isn't relevant. Single binary, no daemon.
- Sentry — error grouping + alerting. Different from Loki/ES/CH — those store all logs; Sentry catches exceptions and stack traces, groups duplicates intelligently, sends an alert when a new error appears or volume spikes. Self-hostable.
- SigNoz MCP Server — Model Context Protocol bridge. Lets Claude / ChatGPT / Cursor query SigNoz's traces, logs, and alerts conversationally. "What's the slowest endpoint in the last hour?" → real answer from real data, not hallucinated.
- ClickHouse MCP — the safer MCP pick when your store is ClickHouse. Read-only by default, drop-table protection, parameterized queries. Hand it to an agent without panicking that it'll
DROP DATABASE production.
How the pipeline fits together
[ your app ]
│
▼ winston / Loguru (structured JSON to stdout)
│
[ Fluent Bit ] (parses, labels, batches)
│
├──▶ Loki ← cheap, label-indexed
├──▶ Elasticsearch ← full-text-heavy queries
└──▶ ClickHouse ← high-volume SQL analytics
│
├──▶ Sentry ← errors only, grouped + alerted
│
▼ read paths:
- lnav (local file, no daemon)
- Grafana (Loki UI)
- Kibana (ES UI)
- SigNoz MCP (AI agent → traces/logs/alerts)
- ClickHouse MCP (AI agent → SQL, read-only)
The critical insight: pick one store, not all three. Loki is the right default for 80% of teams. Move to Elasticsearch only if full-text search across message bodies is a daily need. Move to ClickHouse only when log volume + query latency push you off Loki. The pack lists all three because the right answer depends on your traffic shape — not because you should install all three.
Tradeoffs you'll hit
- Loki vs Elasticsearch vs ClickHouse — Loki is cheapest to run and easiest to operate, but its full-text search is genuinely weak (substring matches across millions of lines are slow). Elasticsearch is the opposite: heavy to run, brilliant at "find this string anywhere." ClickHouse is the SQL nuclear option — incredibly fast at aggregations but you write SQL, not LogQL/KQL. Pick the one whose tradeoff matches your usual question.
- winston vs Loguru vs pino vs zap — winston is the Node default but pino is faster (and the pino ecosystem has caught up). Loguru is the Python default but
structlogis more flexible if you have complex context binding. This pack picks the defaults; switch later if you hit a real limit. - Sentry vs the log store — Sentry overlaps with your log store on error capture. Worth running both: Sentry for the "new error appeared, page on-call" loop; the log store for the "reconstruct the request sequence" loop. They're different jobs.
- MCP server vs custom agent tools — MCP standardizes how agents call your tools, so any MCP-aware client (Claude Desktop, Cursor, ChatGPT custom GPTs) can use the same SigNoz/ClickHouse access. Custom OpenAI function-calling is more flexible per-agent but doesn't port. MCP wins for any tool you'll expose to more than one agent runtime.
Common pitfalls
- Logging strings instead of structured fields —
log.info("user " + userId + " failed")is unsearchable.log.info({ event: "login_failed", userId })is queryable in any of the stores. This is the single change that makes 80% of the rest of the stack worthwhile. - Fluent Bit without flow control — under burst load, Fluent Bit's tail input can OOM. Set
Mem_Buf_Limitand enable file-based buffering before you discover this in production. - Loki labels with high cardinality — never label by
user_id,request_id,trace_id. Loki's storage cost is linear in unique label-set count; one accidental high-cardinality label can 100x your bill. Keep labels toservice,env,level,host. - Sentry sample rate at 100% — fine until your background job spams the same error 50k times in 10 minutes and you hit your quota. Use the SDK's
before_sendto deduplicate aggressive loops at the source. - MCP server exposed to read-write by default — every MCP server doc shows the read-write example first. For ClickHouse MCP specifically, the read-only mode (set in env) is the only safe default when an agent is on the other end. Audit the config.
- Indexing log messages as schema — ES/CH will let every JSON field become a column or mapping. Six months later you have 12,000 fields, half of them typos from one buggy service. Normalize event names and field names at the logger, not at the store.
10 ressources prêtes à installer
Questions fréquentes
Do I really need all three of Loki, Elasticsearch, and ClickHouse?
No — pick one. The pack lists all three because the right answer depends on your shape. Loki is the default for ~80% of teams: cheap, label-indexed, easy to run. Pick Elasticsearch if your daily question is 'find this string anywhere in any message body' (it's much better at unstructured full-text). Pick ClickHouse when you cross ~100 GB/day or need real SQL analytics on logs. Running all three is fine for a comparison week, painful as a permanent state.
Where does AI fit in this stack — is the SigNoz MCP just a chat UI?
It's more than a chat UI. The MCP server exposes traces, logs, and alerts as tools an AI agent can call autonomously. Practical examples: a Claude agent triages a Sentry alert by querying SigNoz for the trace, pulls the corresponding logs from Loki, and writes a one-paragraph incident summary into your ticket — all from one prompt. The ClickHouse MCP plays the same role for SQL-style log analytics, with read-only enforced so the agent can't drop a table.
Why winston/Loguru instead of just printing JSON manually?
Three reasons. First: structured fields are added by API, not string concatenation, so they're consistent across the codebase. Second: log levels, sampling, and transports (file / stdout / network) are decoupled from call sites. Third: ecosystems — winston has 100+ transports, Loguru integrates with FastAPI/Django out of the box. You could roll your own with json.dumps, but you'll re-invent these features within a month.
Is Sentry redundant if I already ship error logs to Loki?
No, the jobs differ. Loki/ES/CH stores everything indiscriminately and answers 'show me the sequence around this request.' Sentry deduplicates exceptions by stack trace, groups them as 'issues,' tracks first-seen / regression / volume spike, and pages you when a new issue appears. Treat Sentry as your error inbox and your log store as the witness — both serve you, neither replaces the other.
Can this whole pack run on one VM, or do I need a Kubernetes cluster?
One mid-size VM (16 vCPU, 32 GB RAM, 500 GB SSD) handles up to ~20 GB/day of log volume with Loki + Fluent Bit + Sentry self-hosted comfortably. Past 50 GB/day, split Loki object storage off to S3-compatible storage and give ClickHouse/Elasticsearch their own nodes. You don't need Kubernetes for this — docker-compose is fine and arguably preferable below 50 GB/day. Add k8s when you have ops appetite to maintain it, not because the log pipeline requires it.
12 packs · 80+ ressources sélectionnées
Découvrez tous les packs curatés sur la page d'accueil
Retour à tous les packs