Dynamo — Datacenter-Scale Distributed Inference Serving Framework

Introduction

Dynamo is an inference serving framework built for datacenter-scale deployments. It provides a Rust-based routing engine that sits in front of model backends like vLLM, SGLang, and TensorRT-LLM, handling request distribution, KV-cache-aware scheduling, and disaggregated prefill/decode execution.

What Dynamo Does

Routes inference requests across a fleet of GPU workers with intelligent load balancing
Supports disaggregated serving where prefill and decode phases run on separate hardware
Integrates with vLLM, SGLang, and TensorRT-LLM as pluggable backends
Provides KV-cache-aware scheduling to minimize redundant computation
Scales from single-node development to multi-node Kubernetes clusters

Architecture Overview

Dynamo consists of a Rust routing engine, a Python control plane, and backend adapters. The router receives incoming requests and dispatches them based on model placement, GPU utilization, and KV-cache locality. In disaggregated mode, prefill requests are sent to high-throughput nodes while decode requests go to latency-optimized nodes. A NATS-based message bus coordinates state between components.

Self-Hosting & Configuration

Install via pip for single-node setups or Helm charts for Kubernetes
Configure model placement and routing policies via YAML files
Backend selection (vLLM, SGLang, TensorRT-LLM) is specified per model
Metrics are exposed in Prometheus format for monitoring
Supports both HTTP and gRPC APIs for client connections

Key Features

Disaggregated serving splits prefill and decode across hardware tiers
KV-cache-aware routing reduces redundant computation for similar prompts
Rust routing engine handles thousands of concurrent requests with low latency
Pluggable backend architecture supports multiple inference engines
Kubernetes-native deployment with auto-scaling support

Comparison with Similar Tools

vLLM — high-throughput inference engine; Dynamo adds fleet-level routing and disaggregated serving on top
TGI — Hugging Face serving; Dynamo supports multiple backends and datacenter-scale orchestration
Ray Serve — general-purpose serving; Dynamo is specialized for LLM inference patterns
Triton Inference Server — NVIDIA multi-framework server; Dynamo adds LLM-specific optimizations like KV-cache routing
KServe — Kubernetes ML serving; Dynamo provides deeper LLM-aware scheduling

FAQ

Q: Do I need Kubernetes to run Dynamo? A: No. Single-node mode works with just pip install. Kubernetes is needed for multi-node deployments.

Q: Which GPU types are supported? A: Any GPU supported by the underlying backend (vLLM, SGLang, TensorRT-LLM), including NVIDIA A100, H100, and consumer GPUs.

Q: What is disaggregated serving? A: It separates the prefill phase (processing the input prompt) from the decode phase (generating tokens), allowing each to run on hardware optimized for its workload.

Q: Is Dynamo open source? A: Yes. It is released under the Apache 2.0 license.

Dynamo — Datacenter-Scale Distributed Inference Serving Framework

Installation agent prête

Introduction

What Dynamo Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Fil de discussion

Actifs similaires

CAMEL — Multi-Agent Framework at Scale

Ceph — Unified Distributed Storage at Scale

Apache Cassandra — Distributed Wide-Column Database at Scale

Pyramid — Versatile Python Web Framework for Any Scale