Introduction
Text Generation Inference (TGI) is the LLM serving stack Hugging Face uses to power its Inference Endpoints product. Written in Rust + Python + CUDA, it focuses on what matters in production: high throughput, low latency, multi-tenant fairness, and cost efficiency.
With over 11,000 GitHub stars, TGI is used by Hugging Face, IBM watsonx, and many enterprises as the foundation for their LLM API. It supports continuous batching, tensor parallelism, multiple quantization formats, and PEFT adapters loaded on the fly.
What TGI Does
TGI exposes an OpenAI-compatible HTTP API that delegates to a Rust router (handles batching, auth, observability) which forwards token-generation requests to a Python+CUDA "shard" process. Multi-GPU tensor parallelism, paged attention, FlashAttention, and custom CUDA kernels deliver high throughput.
Architecture Overview
Clients (OpenAI SDK, curl, custom)
|
[TGI Router (Rust)]
request batching, auth,
prom metrics, tracing
|
gRPC to shards
|
[TGI Shards (Python + CUDA)]
tensor parallelism
paged attention + continuous batching
FlashAttention v2
quantization (bitsandbytes, gptq, awq, eetq, fp8)
|
[Model loaded from HF Hub or local path]
safetensors / GGUF
PEFT / LoRA adapters at runtimeSelf-Hosting & Configuration
# Multi-GPU + quantization for an 70B model on 4xA100
docker run --gpus all --shm-size=16g -p 8080:80 \
-v $PWD/data:/data \
-e HF_TOKEN=hf_xxx \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-70B-Instruct \
--num-shard 4 \
--quantize bitsandbytes-nf4 \
--max-batch-total-tokens 32768 \
--max-input-length 8192 \
--max-total-tokens 12288
# Hot-load LoRA adapters via the API
curl -X POST http://localhost:8080/info/loras \
-d '{"adapters":[{"name":"customer-1","path":"/data/adapters/cust1"}]}'# Use any OpenAI client unchanged
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="tgi")
for chunk in client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "Stream me a haiku."}],
stream=True,
):
print(chunk.choices[0].delta.content or "", end="")Key Features
- OpenAI-compatible API — drop-in for any OpenAI client/library
- Continuous batching — fill GPU with new requests as old ones finish
- Tensor parallelism — shard a model across multiple GPUs
- Quantization — bitsandbytes (NF4/INT8), GPTQ, AWQ, EETQ, FP8
- PEFT/LoRA at runtime — load adapters without reloading the base model
- Prom metrics + OpenTelemetry — production-grade observability
- Speculative decoding — n-gram or draft model speedups
- Wide model support — Llama, Mistral, Qwen, Phi, Gemma, MPT, BLOOM, Falcon, ...
Comparison with Similar Tools
| Feature | TGI | vLLM | TensorRT-LLM | OpenLLM | llama.cpp server |
|---|---|---|---|---|---|
| OpenAI API | Yes | Yes | Via Triton | Yes | Yes |
| Continuous batching | Yes | Yes (PagedAttention) | Yes | Yes (via vLLM) | Limited |
| Tensor parallel | Yes | Yes | Yes | Via vLLM | No |
| LoRA adapters | Yes (hot) | Yes | Yes | Yes | No |
| Quantization | Many | Many | INT4/FP8 (custom) | Many | GGUF Q-formats |
| Best For | HF-anchored stacks | Throughput / community | NVIDIA-only max perf | BentoML pipelines | CPU/edge serving |
FAQ
Q: TGI vs vLLM? A: vLLM has more aggressive batching (PagedAttention) and broader community model support; TGI integrates tighter with HF (auth, license gating, Inference Endpoints) and has stable production observability. Benchmark both on your model.
Q: Is TGI free to use? A: Yes — Apache-2.0 (since v3.0). Earlier 1.x versions used a more restrictive license. Always check the version's LICENSE for clarity.
Q: Does TGI support multimodal models? A: Yes for many vision-language models (LLaVA, Idefics, PaliGemma). Streaming text + image inputs via the OpenAI vision API format.
Q: Can I run TGI without Docker? A: Yes — install from source (Rust + Python + CUDA). Docker is just the convenient packaging path. Most production deployments use the official image.
Sources
- GitHub: https://github.com/huggingface/text-generation-inference
- Docs: https://huggingface.co/docs/text-generation-inference
- Company: Hugging Face
- License: Apache-2.0