Text Generation Inference (TGI) — Hugging Face Production LLM Server
TGI is Hugging Face's production-grade LLM inference server. It powers HF Inference Endpoints with continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs — handling thousands of requests per second.
What it is
Text Generation Inference (TGI) is Hugging Face's production-grade LLM inference server. It powers Hugging Face Inference Endpoints and provides continuous batching, tensor parallelism, quantization (GPTQ, AWQ, bitsandbytes), and an OpenAI-compatible API out of the box. TGI handles thousands of concurrent requests efficiently on GPU hardware.
TGI is designed for ML engineers and platform teams who need to self-host LLMs with production-level throughput and latency. It supports most popular open-weight models from Llama, Mistral, Falcon, and other families.
How it saves time or tokens
TGI's continuous batching dynamically groups incoming requests to maximize GPU utilization. Instead of processing one request at a time or waiting for a fixed batch to fill, TGI starts generating tokens immediately and adds new requests to the running batch. This dramatically reduces latency under load compared to naive serving approaches.
The OpenAI-compatible API means you can switch from OpenAI to a self-hosted model by changing a single base URL in your application. No code changes beyond the endpoint configuration.
How to use
- Pull the TGI Docker image and start serving a model: specify the model name, GPU allocation, and port mapping.
- Send requests to the OpenAI-compatible endpoint at
http://localhost:8080/v1/chat/completions. - For multi-GPU setups, set
--num-shardto the number of GPUs for tensor parallelism.
Example
# Serve a model with one Docker command
docker run --gpus all -p 8080:80 \
-v $PWD/data:/data \
-e HF_TOKEN=hf_xxx \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct
# Query with curl (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
Related on TokRepo
- Local LLM tools -- compare self-hosted LLM inference options
- AI tools for coding -- development tools using local models
Common pitfalls
- TGI requires NVIDIA GPUs with CUDA support; AMD ROCm support exists but is less tested. CPU-only inference is not supported for production workloads.
- Model downloads can be large (8B parameter models are 15-30GB); ensure sufficient disk space and a fast network connection for the initial pull.
- Quantized models (GPTQ, AWQ) reduce memory requirements but may degrade output quality for complex reasoning tasks; benchmark with your specific use case.
Frequently Asked Questions
TGI supports most popular open-weight LLMs including Llama, Mistral, Falcon, StarCoder, Gemma, and others hosted on Hugging Face Hub. The server auto-detects model architecture and applies appropriate optimizations.
Continuous batching dynamically adds new requests to a running batch without waiting for the current batch to complete. This maximizes GPU utilization and reduces average latency compared to static batching, especially under variable load.
Yes. TGI supports tensor parallelism across multiple GPUs using the --num-shard flag. A model too large for one GPU can be split across two or more GPUs on the same machine.
Yes. TGI exposes an OpenAI-compatible /v1/chat/completions endpoint. Any application using the OpenAI Python SDK or REST API can point to TGI by changing the base URL, with no other code changes needed.
TGI supports GPTQ, AWQ, and bitsandbytes quantization. These methods reduce model memory requirements by 2-4x, enabling larger models to run on smaller GPU configurations at the cost of minor quality trade-offs.
Citations (3)
- TGI GitHub— TGI is Hugging Face's production LLM inference server
- TGI Documentation— Continuous batching and tensor parallelism for LLM serving
- OpenAI API Reference— OpenAI-compatible API specification
Related on TokRepo
Discussion
Related Assets
Flax — Neural Network Library for JAX
A high-performance neural network library built on JAX, providing a flexible module system used extensively across Google DeepMind and the JAX research community.
PyCaret — Low-Code Machine Learning in Python
An open-source AutoML library that wraps scikit-learn, XGBoost, LightGBM, CatBoost, and other ML libraries into a unified low-code interface for rapid experimentation.
DGL — Deep Graph Library for Scalable Graph Neural Networks
A high-performance framework for building graph neural networks on top of PyTorch, TensorFlow, or MXNet, designed for both research prototyping and production-scale graph learning.