ScriptsApr 14, 2026·3 min read

Text Generation Inference (TGI) — Hugging Face Production LLM Server

TGI is Hugging Face's production-grade LLM inference server. It powers HF Inference Endpoints with continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs — handling thousands of requests per second.

Script Depot
Script Depot · Community

Introduction

Text Generation Inference (TGI) is the LLM serving stack Hugging Face uses to power its Inference Endpoints product. Written in Rust + Python + CUDA, it focuses on what matters in production: high throughput, low latency, multi-tenant fairness, and cost efficiency.

With over 11,000 GitHub stars, TGI is used by Hugging Face, IBM watsonx, and many enterprises as the foundation for their LLM API. It supports continuous batching, tensor parallelism, multiple quantization formats, and PEFT adapters loaded on the fly.

What TGI Does

TGI exposes an OpenAI-compatible HTTP API that delegates to a Rust router (handles batching, auth, observability) which forwards token-generation requests to a Python+CUDA "shard" process. Multi-GPU tensor parallelism, paged attention, FlashAttention, and custom CUDA kernels deliver high throughput.

Architecture Overview

Clients (OpenAI SDK, curl, custom)
      |
[TGI Router (Rust)]
   request batching, auth,
   prom metrics, tracing
      |
   gRPC to shards
      |
[TGI Shards (Python + CUDA)]
   tensor parallelism
   paged attention + continuous batching
   FlashAttention v2
   quantization (bitsandbytes, gptq, awq, eetq, fp8)
      |
[Model loaded from HF Hub or local path]
   safetensors / GGUF
   PEFT / LoRA adapters at runtime

Self-Hosting & Configuration

# Multi-GPU + quantization for an 70B model on 4xA100
docker run --gpus all --shm-size=16g -p 8080:80 \
  -v $PWD/data:/data \
  -e HF_TOKEN=hf_xxx \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-70B-Instruct \
  --num-shard 4 \
  --quantize bitsandbytes-nf4 \
  --max-batch-total-tokens 32768 \
  --max-input-length 8192 \
  --max-total-tokens 12288

# Hot-load LoRA adapters via the API
curl -X POST http://localhost:8080/info/loras \
  -d '{"adapters":[{"name":"customer-1","path":"/data/adapters/cust1"}]}'
# Use any OpenAI client unchanged
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="tgi")
for chunk in client.chat.completions.create(
    model="tgi",
    messages=[{"role": "user", "content": "Stream me a haiku."}],
    stream=True,
):
    print(chunk.choices[0].delta.content or "", end="")

Key Features

  • OpenAI-compatible API — drop-in for any OpenAI client/library
  • Continuous batching — fill GPU with new requests as old ones finish
  • Tensor parallelism — shard a model across multiple GPUs
  • Quantization — bitsandbytes (NF4/INT8), GPTQ, AWQ, EETQ, FP8
  • PEFT/LoRA at runtime — load adapters without reloading the base model
  • Prom metrics + OpenTelemetry — production-grade observability
  • Speculative decoding — n-gram or draft model speedups
  • Wide model support — Llama, Mistral, Qwen, Phi, Gemma, MPT, BLOOM, Falcon, ...

Comparison with Similar Tools

Feature TGI vLLM TensorRT-LLM OpenLLM llama.cpp server
OpenAI API Yes Yes Via Triton Yes Yes
Continuous batching Yes Yes (PagedAttention) Yes Yes (via vLLM) Limited
Tensor parallel Yes Yes Yes Via vLLM No
LoRA adapters Yes (hot) Yes Yes Yes No
Quantization Many Many INT4/FP8 (custom) Many GGUF Q-formats
Best For HF-anchored stacks Throughput / community NVIDIA-only max perf BentoML pipelines CPU/edge serving

FAQ

Q: TGI vs vLLM? A: vLLM has more aggressive batching (PagedAttention) and broader community model support; TGI integrates tighter with HF (auth, license gating, Inference Endpoints) and has stable production observability. Benchmark both on your model.

Q: Is TGI free to use? A: Yes — Apache-2.0 (since v3.0). Earlier 1.x versions used a more restrictive license. Always check the version's LICENSE for clarity.

Q: Does TGI support multimodal models? A: Yes for many vision-language models (LLaVA, Idefics, PaliGemma). Streaming text + image inputs via the OpenAI vision API format.

Q: Can I run TGI without Docker? A: Yes — install from source (Rust + Python + CUDA). Docker is just the convenient packaging path. Most production deployments use the official image.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets