Introduction
Dynamo is an inference serving framework built for datacenter-scale deployments. It provides a Rust-based routing engine that sits in front of model backends like vLLM, SGLang, and TensorRT-LLM, handling request distribution, KV-cache-aware scheduling, and disaggregated prefill/decode execution.
What Dynamo Does
- Routes inference requests across a fleet of GPU workers with intelligent load balancing
- Supports disaggregated serving where prefill and decode phases run on separate hardware
- Integrates with vLLM, SGLang, and TensorRT-LLM as pluggable backends
- Provides KV-cache-aware scheduling to minimize redundant computation
- Scales from single-node development to multi-node Kubernetes clusters
Architecture Overview
Dynamo consists of a Rust routing engine, a Python control plane, and backend adapters. The router receives incoming requests and dispatches them based on model placement, GPU utilization, and KV-cache locality. In disaggregated mode, prefill requests are sent to high-throughput nodes while decode requests go to latency-optimized nodes. A NATS-based message bus coordinates state between components.
Self-Hosting & Configuration
- Install via pip for single-node setups or Helm charts for Kubernetes
- Configure model placement and routing policies via YAML files
- Backend selection (vLLM, SGLang, TensorRT-LLM) is specified per model
- Metrics are exposed in Prometheus format for monitoring
- Supports both HTTP and gRPC APIs for client connections
Key Features
- Disaggregated serving splits prefill and decode across hardware tiers
- KV-cache-aware routing reduces redundant computation for similar prompts
- Rust routing engine handles thousands of concurrent requests with low latency
- Pluggable backend architecture supports multiple inference engines
- Kubernetes-native deployment with auto-scaling support
Comparison with Similar Tools
- vLLM — high-throughput inference engine; Dynamo adds fleet-level routing and disaggregated serving on top
- TGI — Hugging Face serving; Dynamo supports multiple backends and datacenter-scale orchestration
- Ray Serve — general-purpose serving; Dynamo is specialized for LLM inference patterns
- Triton Inference Server — NVIDIA multi-framework server; Dynamo adds LLM-specific optimizations like KV-cache routing
- KServe — Kubernetes ML serving; Dynamo provides deeper LLM-aware scheduling
FAQ
Q: Do I need Kubernetes to run Dynamo? A: No. Single-node mode works with just pip install. Kubernetes is needed for multi-node deployments.
Q: Which GPU types are supported? A: Any GPU supported by the underlying backend (vLLM, SGLang, TensorRT-LLM), including NVIDIA A100, H100, and consumer GPUs.
Q: What is disaggregated serving? A: It separates the prefill phase (processing the input prompt) from the decode phase (generating tokens), allowing each to run on hardware optimized for its workload.
Q: Is Dynamo open source? A: Yes. It is released under the Apache 2.0 license.