Jaeger — CNCF Distributed Tracing Platform
Jaeger is a CNCF-graduated distributed tracing system for monitoring microservice-based architectures. Track requests across services, identify latency hotspots, and understand root causes of failures in complex distributed systems.
What it is
Jaeger is a CNCF-graduated distributed tracing system designed for monitoring and troubleshooting microservice-based architectures. It tracks requests as they flow across multiple services, showing the full call chain with timing data for each hop.
Jaeger helps developers identify latency hotspots, understand service dependencies, and diagnose root causes of failures in complex distributed systems. It supports OpenTelemetry natively and stores trace data in Elasticsearch, Cassandra, or Kafka.
How it saves time or tokens
Debugging latency in a microservice architecture without distributed tracing means grepping logs across dozens of services. Jaeger provides a visual timeline of every service call in a request, immediately showing where time is spent.
For AI-assisted debugging, Jaeger's structured trace data can be exported as JSON and fed to an LLM for analysis. The model can identify patterns like cascading timeouts or retry storms that are hard to spot manually.
Additionally, the project's well-structured documentation and active community mean developers spend less time troubleshooting integration issues. When AI coding assistants generate code for this tool, they can reference established patterns from the documentation, producing correct implementations with fewer iterations and lower token costs.
How to use
- Run Jaeger all-in-one for development:
docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
- Instrument your application with OpenTelemetry:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider = TracerProvider()
provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint='http://localhost:4317'))
)
trace.set_tracer_provider(provider)
- Open the Jaeger UI at
http://localhost:16686to search and visualize traces.
- Use the service dependency graph to understand how your services connect.
Example
tracer = trace.get_tracer('my-service')
with tracer.start_as_current_span('process-order') as span:
span.set_attribute('order.id', '12345')
result = call_payment_service(order)
span.set_attribute('payment.status', result.status)
Related on TokRepo
- AI Tools for Monitoring — Monitoring and observability tools
- AI Tools for DevOps — DevOps and infrastructure tools
Common pitfalls
- Tracing every request in production. At high throughput, tracing everything overwhelms storage. Use sampling (1% or adaptive) to capture representative traces without drowning in data.
- Not propagating trace context across service boundaries. If any service in the chain does not propagate the trace ID, the trace breaks into disconnected fragments.
- Using the all-in-one deployment in production. It stores traces in memory and loses them on restart. Use Elasticsearch or Cassandra for production storage.
- Failing to review community discussions and changelogs before upgrading. Breaking changes in major versions can disrupt existing workflows. Pin versions in production and test upgrades in staging first.
Frequently Asked Questions
Distributed tracing tracks a single request as it flows through multiple microservices. Each service creates a span (a unit of work) with timing data. Spans are linked by a shared trace ID, creating a tree that shows the full request lifecycle across services.
Both are open-source distributed tracing systems. Jaeger is CNCF-graduated with a more active community and better OpenTelemetry integration. Zipkin is simpler to deploy and has broader language support for legacy instrumentation. Both support the same core tracing concepts.
Yes. Jaeger natively accepts OpenTelemetry data via the OTLP protocol. You instrument your applications with OpenTelemetry SDKs and export traces directly to Jaeger. This is the recommended approach for new deployments.
Jaeger supports Elasticsearch, OpenSearch, Cassandra, Kafka (as a buffer), and an in-memory store for development. Elasticsearch is the most common production choice due to its query capabilities and operational maturity.
Yes. You can create spans for each step in an AI agent workflow: LLM calls, tool invocations, retrieval operations. This gives visibility into where time and tokens are spent in AI agent pipelines, helping optimize both latency and cost.
Citations (3)
- Jaeger GitHub— Jaeger is a CNCF-graduated distributed tracing system
- Jaeger Documentation— Jaeger documentation and deployment guides
- OpenTelemetry Docs— OpenTelemetry distributed tracing specification
Related on TokRepo
Discussion
Related Assets
Conda — Cross-Platform Package and Environment Manager
Install, update, and manage packages and isolated environments for Python, R, C/C++, and hundreds of other languages from a single tool.
Sphinx — Python Documentation Generator
Generate professional documentation from reStructuredText and Markdown with cross-references, API autodoc, and multiple output formats.
Neutralinojs — Lightweight Cross-Platform Desktop Apps
Build desktop applications with HTML, CSS, and JavaScript using a tiny native runtime instead of bundling Chromium.