Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsJul 1, 2026·3 min de lecture

Dynamo — Datacenter-Scale Distributed Inference Serving Framework

A Rust-based inference serving framework designed for datacenter-scale deployments, supporting disaggregated serving, dynamic routing, and integration with vLLM, SGLang, and TensorRT-LLM.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Dynamo
Commande d'installation directe
npx -y tokrepo@latest install 2ff611c4-758b-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

Dynamo is an inference serving framework built for datacenter-scale deployments. It provides a Rust-based routing engine that sits in front of model backends like vLLM, SGLang, and TensorRT-LLM, handling request distribution, KV-cache-aware scheduling, and disaggregated prefill/decode execution.

What Dynamo Does

  • Routes inference requests across a fleet of GPU workers with intelligent load balancing
  • Supports disaggregated serving where prefill and decode phases run on separate hardware
  • Integrates with vLLM, SGLang, and TensorRT-LLM as pluggable backends
  • Provides KV-cache-aware scheduling to minimize redundant computation
  • Scales from single-node development to multi-node Kubernetes clusters

Architecture Overview

Dynamo consists of a Rust routing engine, a Python control plane, and backend adapters. The router receives incoming requests and dispatches them based on model placement, GPU utilization, and KV-cache locality. In disaggregated mode, prefill requests are sent to high-throughput nodes while decode requests go to latency-optimized nodes. A NATS-based message bus coordinates state between components.

Self-Hosting & Configuration

  • Install via pip for single-node setups or Helm charts for Kubernetes
  • Configure model placement and routing policies via YAML files
  • Backend selection (vLLM, SGLang, TensorRT-LLM) is specified per model
  • Metrics are exposed in Prometheus format for monitoring
  • Supports both HTTP and gRPC APIs for client connections

Key Features

  • Disaggregated serving splits prefill and decode across hardware tiers
  • KV-cache-aware routing reduces redundant computation for similar prompts
  • Rust routing engine handles thousands of concurrent requests with low latency
  • Pluggable backend architecture supports multiple inference engines
  • Kubernetes-native deployment with auto-scaling support

Comparison with Similar Tools

  • vLLM — high-throughput inference engine; Dynamo adds fleet-level routing and disaggregated serving on top
  • TGI — Hugging Face serving; Dynamo supports multiple backends and datacenter-scale orchestration
  • Ray Serve — general-purpose serving; Dynamo is specialized for LLM inference patterns
  • Triton Inference Server — NVIDIA multi-framework server; Dynamo adds LLM-specific optimizations like KV-cache routing
  • KServe — Kubernetes ML serving; Dynamo provides deeper LLM-aware scheduling

FAQ

Q: Do I need Kubernetes to run Dynamo? A: No. Single-node mode works with just pip install. Kubernetes is needed for multi-node deployments.

Q: Which GPU types are supported? A: Any GPU supported by the underlying backend (vLLM, SGLang, TensorRT-LLM), including NVIDIA A100, H100, and consumer GPUs.

Q: What is disaggregated serving? A: It separates the prefill phase (processing the input prompt) from the decode phase (generating tokens), allowing each to run on hardware optimized for its workload.

Q: Is Dynamo open source? A: Yes. It is released under the Apache 2.0 license.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires