ScriptsMar 31, 2026·2 min read

SGLang — Fast LLM Serving with RadixAttention

SGLang is a high-performance serving framework for LLMs and multimodal models. 25.3K+ GitHub stars. RadixAttention prefix caching, speculative decoding, structured outputs. NVIDIA/AMD/Intel/TPU. Apach

TL;DR
SGLang serves LLMs and multimodal models with RadixAttention prefix caching, speculative decoding, and structured output support.
§01

What it is

SGLang is a serving framework optimized for running large language models and multimodal models at high throughput. Its signature feature is RadixAttention, a prefix caching mechanism that reuses KV cache across requests sharing common prompt prefixes.

The framework targets ML engineers and platform teams who need to serve LLMs in production with low latency. It supports NVIDIA, AMD, Intel GPUs, and Google TPUs, making it hardware-flexible.

§02

How it saves time or tokens

RadixAttention caches KV states for shared prompt prefixes across requests. When multiple users send requests with the same system prompt, SGLang avoids recomputing attention for the shared prefix. Speculative decoding further reduces latency by predicting multiple tokens at once. These optimizations translate to higher throughput per GPU dollar.

§03

How to use

  1. Install SGLang via pip with your target hardware backend.
  2. Launch the server pointing to a model checkpoint.
  3. Send requests via the OpenAI-compatible API endpoint.
# Install
pip install sglang[all]

# Launch server with a model
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3-8B-Instruct \
  --port 30000

# Query the server
curl http://localhost:30000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model": "meta-llama/Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'
§04

Example

import sglang as sgl

@sgl.function
def qa_pipeline(s, question):
    s += sgl.system('You are a helpful assistant.')
    s += sgl.user(question)
    s += sgl.assistant(sgl.gen('answer', max_tokens=256))

runtime = sgl.Runtime(model_path='meta-llama/Llama-3-8B-Instruct')
sgl.set_default_backend(runtime)

state = qa_pipeline.run(question='What is RadixAttention?')
print(state['answer'])
§05

Related on TokRepo

§06

Common pitfalls

  • RadixAttention benefits diminish when requests have no shared prefixes; batch requests with common system prompts to maximize cache hits.
  • GPU memory must accommodate both the model weights and the KV cache; underprovisioning causes OOM errors on long-context workloads.
  • Structured output mode (JSON schema enforcement) adds decoding overhead; disable it for free-form generation tasks.

Frequently Asked Questions

How does SGLang compare to vLLM?+

Both are high-performance LLM serving frameworks. SGLang differentiates with RadixAttention for prefix caching and a frontend language for composing LLM programs. vLLM uses PagedAttention for memory efficiency. In benchmarks, SGLang often shows higher throughput for workloads with shared prefixes.

What hardware does SGLang support?+

SGLang supports NVIDIA GPUs (CUDA), AMD GPUs (ROCm), Intel GPUs and CPUs, and Google TPUs. The framework auto-detects available hardware and selects the appropriate backend during installation.

Does SGLang provide an OpenAI-compatible API?+

Yes. SGLang exposes an OpenAI-compatible HTTP endpoint at /v1/chat/completions and /v1/completions. This means you can swap SGLang into any application that currently calls the OpenAI API by changing the base URL.

What is RadixAttention and why does it matter?+

RadixAttention organizes KV cache entries in a radix tree indexed by token prefixes. When a new request shares a prefix with a cached request, SGLang reuses the cached KV states instead of recomputing them. This is especially effective for chat applications where all requests share the same system prompt.

Can SGLang serve multimodal models?+

Yes. SGLang supports multimodal models that process both text and images. You load the multimodal checkpoint the same way as a text-only model, and the framework handles image tokenization and attention computation internally.

Citations (3)
🙏

Source & Thanks

Created by SGLang Project. Licensed under Apache 2.0. sgl-project/sglang — 25,300+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets