Text Generation Inference (TGI) — Hugging Face Production LLM Server
TGI is Hugging Face's production-grade LLM inference server. It powers HF Inference Endpoints with continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs — handling thousands of requests per second.
Safe staging for this asset
This asset is staged first. The copied prompt tells the agent to inspect the staged files and ask before activating scripts, MCP config, or global config.
npx -y tokrepo@latest install e08ad222-37db-11f1-9bc6-00163e2b0d79 --target codexStages files first; activation requires review of the staged README and plan.
What it is
Text Generation Inference (TGI) is Hugging Face's production-grade LLM inference server. It powers Hugging Face Inference Endpoints and provides continuous batching, tensor parallelism, quantization (GPTQ, AWQ, bitsandbytes), and an OpenAI-compatible API out of the box. TGI handles thousands of concurrent requests efficiently on GPU hardware.
TGI is designed for ML engineers and platform teams who need to self-host LLMs with production-level throughput and latency. It supports most popular open-weight models from Llama, Mistral, Falcon, and other families.
How it saves time or tokens
TGI's continuous batching dynamically groups incoming requests to maximize GPU utilization. Instead of processing one request at a time or waiting for a fixed batch to fill, TGI starts generating tokens immediately and adds new requests to the running batch. This dramatically reduces latency under load compared to naive serving approaches.
The OpenAI-compatible API means you can switch from OpenAI to a self-hosted model by changing a single base URL in your application. No code changes beyond the endpoint configuration.
How to use
- Pull the TGI Docker image and start serving a model: specify the model name, GPU allocation, and port mapping.
- Send requests to the OpenAI-compatible endpoint at
http://localhost:8080/v1/chat/completions. - For multi-GPU setups, set
--num-shardto the number of GPUs for tensor parallelism.
Example
# Serve a model with one Docker command
docker run --gpus all -p 8080:80 \
-v $PWD/data:/data \
-e HF_TOKEN=hf_xxx \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct
# Query with curl (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
Related on TokRepo
- Local LLM tools -- compare self-hosted LLM inference options
- AI tools for coding -- development tools using local models
Common pitfalls
- TGI requires NVIDIA GPUs with CUDA support; AMD ROCm support exists but is less tested. CPU-only inference is not supported for production workloads.
- Model downloads can be large (8B parameter models are 15-30GB); ensure sufficient disk space and a fast network connection for the initial pull.
- Quantized models (GPTQ, AWQ) reduce memory requirements but may degrade output quality for complex reasoning tasks; benchmark with your specific use case.
Frequently Asked Questions
TGI supports most popular open-weight LLMs including Llama, Mistral, Falcon, StarCoder, Gemma, and others hosted on Hugging Face Hub. The server auto-detects model architecture and applies appropriate optimizations.
Continuous batching dynamically adds new requests to a running batch without waiting for the current batch to complete. This maximizes GPU utilization and reduces average latency compared to static batching, especially under variable load.
Yes. TGI supports tensor parallelism across multiple GPUs using the --num-shard flag. A model too large for one GPU can be split across two or more GPUs on the same machine.
Yes. TGI exposes an OpenAI-compatible /v1/chat/completions endpoint. Any application using the OpenAI Python SDK or REST API can point to TGI by changing the base URL, with no other code changes needed.
TGI supports GPTQ, AWQ, and bitsandbytes quantization. These methods reduce model memory requirements by 2-4x, enabling larger models to run on smaller GPU configurations at the cost of minor quality trade-offs.
Citations (3)
- TGI GitHub— TGI is Hugging Face's production LLM inference server
- TGI Documentation— Continuous batching and tensor parallelism for LLM serving
- OpenAI API Reference— OpenAI-compatible API specification
Related on TokRepo
Discussion
Related Assets
Text Embeddings Inference — High-Performance Embedding Server by Hugging Face
A blazing-fast inference server for text embedding and reranking models. TEI serves any Sentence Transformers or cross-encoder model with optimized Rust and CUDA kernels, token-based dynamic batching, and an OpenAI-compatible API.
Hugging Face Tokenizers — Fast Text Tokenization for ML Pipelines
Hugging Face Tokenizers is a Rust-powered tokenization library with Python bindings that implements BPE, WordPiece, Unigram, and SentencePiece tokenizers with training and encoding speeds of gigabytes per second, used as the backbone for Transformers model tokenization.
Text Generation WebUI — Local LLM Chat Interface
Text Generation WebUI is a Gradio interface for running LLMs locally. 46.4K+ GitHub stars. Multiple backends, vision, training, image gen, OpenAI-compatible API. 100% offline.
Hugging Face Datasets — Access and Process ML Datasets at Scale
Hugging Face Datasets is a Python library for efficiently loading, processing, and sharing machine learning datasets with Apache Arrow-backed memory mapping, streaming support, and access to thousands of community datasets on the Hub.