# Dynamo — Datacenter-Scale Distributed Inference Serving Framework > A Rust-based inference serving framework designed for datacenter-scale deployments, supporting disaggregated serving, dynamic routing, and integration with vLLM, SGLang, and TensorRT-LLM. ## Install Save as a script file and run: # Dynamo — Datacenter-Scale Distributed Inference Serving Framework ## Quick Use ```bash pip install ai-dynamo dynamo serve --model meta-llama/Llama-3-8B --backend vllm # Or deploy on Kubernetes: dynamo deploy --config cluster.yaml ``` ## Introduction Dynamo is an inference serving framework built for datacenter-scale deployments. It provides a Rust-based routing engine that sits in front of model backends like vLLM, SGLang, and TensorRT-LLM, handling request distribution, KV-cache-aware scheduling, and disaggregated prefill/decode execution. ## What Dynamo Does - Routes inference requests across a fleet of GPU workers with intelligent load balancing - Supports disaggregated serving where prefill and decode phases run on separate hardware - Integrates with vLLM, SGLang, and TensorRT-LLM as pluggable backends - Provides KV-cache-aware scheduling to minimize redundant computation - Scales from single-node development to multi-node Kubernetes clusters ## Architecture Overview Dynamo consists of a Rust routing engine, a Python control plane, and backend adapters. The router receives incoming requests and dispatches them based on model placement, GPU utilization, and KV-cache locality. In disaggregated mode, prefill requests are sent to high-throughput nodes while decode requests go to latency-optimized nodes. A NATS-based message bus coordinates state between components. ## Self-Hosting & Configuration - Install via pip for single-node setups or Helm charts for Kubernetes - Configure model placement and routing policies via YAML files - Backend selection (vLLM, SGLang, TensorRT-LLM) is specified per model - Metrics are exposed in Prometheus format for monitoring - Supports both HTTP and gRPC APIs for client connections ## Key Features - Disaggregated serving splits prefill and decode across hardware tiers - KV-cache-aware routing reduces redundant computation for similar prompts - Rust routing engine handles thousands of concurrent requests with low latency - Pluggable backend architecture supports multiple inference engines - Kubernetes-native deployment with auto-scaling support ## Comparison with Similar Tools - **vLLM** — high-throughput inference engine; Dynamo adds fleet-level routing and disaggregated serving on top - **TGI** — Hugging Face serving; Dynamo supports multiple backends and datacenter-scale orchestration - **Ray Serve** — general-purpose serving; Dynamo is specialized for LLM inference patterns - **Triton Inference Server** — NVIDIA multi-framework server; Dynamo adds LLM-specific optimizations like KV-cache routing - **KServe** — Kubernetes ML serving; Dynamo provides deeper LLM-aware scheduling ## FAQ **Q: Do I need Kubernetes to run Dynamo?** A: No. Single-node mode works with just pip install. Kubernetes is needed for multi-node deployments. **Q: Which GPU types are supported?** A: Any GPU supported by the underlying backend (vLLM, SGLang, TensorRT-LLM), including NVIDIA A100, H100, and consumer GPUs. **Q: What is disaggregated serving?** A: It separates the prefill phase (processing the input prompt) from the decode phase (generating tokens), allowing each to run on hardware optimized for its workload. **Q: Is Dynamo open source?** A: Yes. It is released under the Apache 2.0 license. ## Sources - https://github.com/ai-dynamo/dynamo - https://docs.ai-dynamo.dev --- Source: https://tokrepo.com/en/workflows/dynamo-datacenter-scale-distributed-inference-serving-2ff611c4 Author: Script Depot