SGLang — Fast LLM Serving with RadixAttention
SGLang is a high-performance serving framework for LLMs and multimodal models. 25.3K+ GitHub stars. RadixAttention prefix caching, speculative decoding, structured outputs. NVIDIA/AMD/Intel/TPU. Apach
What it is
SGLang is a serving framework optimized for running large language models and multimodal models at high throughput. Its signature feature is RadixAttention, a prefix caching mechanism that reuses KV cache across requests sharing common prompt prefixes.
The framework targets ML engineers and platform teams who need to serve LLMs in production with low latency. It supports NVIDIA, AMD, Intel GPUs, and Google TPUs, making it hardware-flexible.
How it saves time or tokens
RadixAttention caches KV states for shared prompt prefixes across requests. When multiple users send requests with the same system prompt, SGLang avoids recomputing attention for the shared prefix. Speculative decoding further reduces latency by predicting multiple tokens at once. These optimizations translate to higher throughput per GPU dollar.
How to use
- Install SGLang via pip with your target hardware backend.
- Launch the server pointing to a model checkpoint.
- Send requests via the OpenAI-compatible API endpoint.
# Install
pip install sglang[all]
# Launch server with a model
python -m sglang.launch_server \
--model-path meta-llama/Llama-3-8B-Instruct \
--port 30000
# Query the server
curl http://localhost:30000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model": "meta-llama/Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'
Example
import sglang as sgl
@sgl.function
def qa_pipeline(s, question):
s += sgl.system('You are a helpful assistant.')
s += sgl.user(question)
s += sgl.assistant(sgl.gen('answer', max_tokens=256))
runtime = sgl.Runtime(model_path='meta-llama/Llama-3-8B-Instruct')
sgl.set_default_backend(runtime)
state = qa_pipeline.run(question='What is RadixAttention?')
print(state['answer'])
Related on TokRepo
- Local LLM tools — Compare SGLang with vLLM for self-hosted inference
- AI gateway solutions — Route requests across multiple serving backends
Common pitfalls
- RadixAttention benefits diminish when requests have no shared prefixes; batch requests with common system prompts to maximize cache hits.
- GPU memory must accommodate both the model weights and the KV cache; underprovisioning causes OOM errors on long-context workloads.
- Structured output mode (JSON schema enforcement) adds decoding overhead; disable it for free-form generation tasks.
Frequently Asked Questions
Both are high-performance LLM serving frameworks. SGLang differentiates with RadixAttention for prefix caching and a frontend language for composing LLM programs. vLLM uses PagedAttention for memory efficiency. In benchmarks, SGLang often shows higher throughput for workloads with shared prefixes.
SGLang supports NVIDIA GPUs (CUDA), AMD GPUs (ROCm), Intel GPUs and CPUs, and Google TPUs. The framework auto-detects available hardware and selects the appropriate backend during installation.
Yes. SGLang exposes an OpenAI-compatible HTTP endpoint at /v1/chat/completions and /v1/completions. This means you can swap SGLang into any application that currently calls the OpenAI API by changing the base URL.
RadixAttention organizes KV cache entries in a radix tree indexed by token prefixes. When a new request shares a prefix with a cached request, SGLang reuses the cached KV states instead of recomputing them. This is especially effective for chat applications where all requests share the same system prompt.
Yes. SGLang supports multimodal models that process both text and images. You load the multimodal checkpoint the same way as a text-only model, and the framework handles image tokenization and attention computation internally.
Citations (3)
- SGLang GitHub— SGLang provides RadixAttention for automatic KV cache reuse across requests
- SGLang Paper (arXiv)— RadixAttention uses a radix tree to manage prefix-based KV cache sharing
- Speculative Decoding Paper (arXiv)— Speculative decoding predicts multiple tokens to reduce inference latency
Related on TokRepo
Source & Thanks
Created by SGLang Project. Licensed under Apache 2.0. sgl-project/sglang — 25,300+ GitHub stars
Discussion
Related Assets
NAPI-RS — Build Node.js Native Addons in Rust
Write high-performance Node.js native modules in Rust with automatic TypeScript type generation and cross-platform prebuilt binaries.
Mamba — Fast Cross-Platform Package Manager
A drop-in conda replacement written in C++ that resolves environments in seconds instead of minutes.
Plasmo — The Browser Extension Framework
Build, test, and publish browser extensions for Chrome, Firefox, and Edge using React or Vue with hot-reload and automatic manifest generation.