# Dynamo — Datacenter-Scale Distributed Inference Serving Framework

> A Rust-based inference serving framework designed for datacenter-scale deployments, supporting disaggregated serving, dynamic routing, and integration with vLLM, SGLang, and TensorRT-LLM.

## Install

Save as a script file and run:

# Dynamo — Datacenter-Scale Distributed Inference Serving Framework

## Quick Use
```bash
pip install ai-dynamo
dynamo serve --model meta-llama/Llama-3-8B --backend vllm
# Or deploy on Kubernetes:
dynamo deploy --config cluster.yaml
```

## Introduction
Dynamo is an inference serving framework built for datacenter-scale deployments. It provides a Rust-based routing engine that sits in front of model backends like vLLM, SGLang, and TensorRT-LLM, handling request distribution, KV-cache-aware scheduling, and disaggregated prefill/decode execution.

## What Dynamo Does
- Routes inference requests across a fleet of GPU workers with intelligent load balancing
- Supports disaggregated serving where prefill and decode phases run on separate hardware
- Integrates with vLLM, SGLang, and TensorRT-LLM as pluggable backends
- Provides KV-cache-aware scheduling to minimize redundant computation
- Scales from single-node development to multi-node Kubernetes clusters

## Architecture Overview
Dynamo consists of a Rust routing engine, a Python control plane, and backend adapters. The router receives incoming requests and dispatches them based on model placement, GPU utilization, and KV-cache locality. In disaggregated mode, prefill requests are sent to high-throughput nodes while decode requests go to latency-optimized nodes. A NATS-based message bus coordinates state between components.

## Self-Hosting & Configuration
- Install via pip for single-node setups or Helm charts for Kubernetes
- Configure model placement and routing policies via YAML files
- Backend selection (vLLM, SGLang, TensorRT-LLM) is specified per model
- Metrics are exposed in Prometheus format for monitoring
- Supports both HTTP and gRPC APIs for client connections

## Key Features
- Disaggregated serving splits prefill and decode across hardware tiers
- KV-cache-aware routing reduces redundant computation for similar prompts
- Rust routing engine handles thousands of concurrent requests with low latency
- Pluggable backend architecture supports multiple inference engines
- Kubernetes-native deployment with auto-scaling support

## Comparison with Similar Tools
- **vLLM** — high-throughput inference engine; Dynamo adds fleet-level routing and disaggregated serving on top
- **TGI** — Hugging Face serving; Dynamo supports multiple backends and datacenter-scale orchestration
- **Ray Serve** — general-purpose serving; Dynamo is specialized for LLM inference patterns
- **Triton Inference Server** — NVIDIA multi-framework server; Dynamo adds LLM-specific optimizations like KV-cache routing
- **KServe** — Kubernetes ML serving; Dynamo provides deeper LLM-aware scheduling

## FAQ
**Q: Do I need Kubernetes to run Dynamo?**
A: No. Single-node mode works with just pip install. Kubernetes is needed for multi-node deployments.

**Q: Which GPU types are supported?**
A: Any GPU supported by the underlying backend (vLLM, SGLang, TensorRT-LLM), including NVIDIA A100, H100, and consumer GPUs.

**Q: What is disaggregated serving?**
A: It separates the prefill phase (processing the input prompt) from the decode phase (generating tokens), allowing each to run on hardware optimized for its workload.

**Q: Is Dynamo open source?**
A: Yes. It is released under the Apache 2.0 license.

## Sources
- https://github.com/ai-dynamo/dynamo
- https://docs.ai-dynamo.dev

---
Source: https://tokrepo.com/en/workflows/dynamo-datacenter-scale-distributed-inference-serving-2ff611c4
Author: Script Depot