ScriptsApr 14, 2026·3 min read

Text Generation Inference (TGI) — Hugging Face Production LLM Server

TGI is Hugging Face's production-grade LLM inference server. It powers HF Inference Endpoints with continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs — handling thousands of requests per second.

TL;DR
TGI serves LLMs in production with continuous batching, tensor parallelism, quantization, and an OpenAI-compatible API.
§01

What it is

Text Generation Inference (TGI) is Hugging Face's production-grade LLM inference server. It powers Hugging Face Inference Endpoints and provides continuous batching, tensor parallelism, quantization (GPTQ, AWQ, bitsandbytes), and an OpenAI-compatible API out of the box. TGI handles thousands of concurrent requests efficiently on GPU hardware.

TGI is designed for ML engineers and platform teams who need to self-host LLMs with production-level throughput and latency. It supports most popular open-weight models from Llama, Mistral, Falcon, and other families.

§02

How it saves time or tokens

TGI's continuous batching dynamically groups incoming requests to maximize GPU utilization. Instead of processing one request at a time or waiting for a fixed batch to fill, TGI starts generating tokens immediately and adds new requests to the running batch. This dramatically reduces latency under load compared to naive serving approaches.

The OpenAI-compatible API means you can switch from OpenAI to a self-hosted model by changing a single base URL in your application. No code changes beyond the endpoint configuration.

§03

How to use

  1. Pull the TGI Docker image and start serving a model: specify the model name, GPU allocation, and port mapping.
  2. Send requests to the OpenAI-compatible endpoint at http://localhost:8080/v1/chat/completions.
  3. For multi-GPU setups, set --num-shard to the number of GPUs for tensor parallelism.
§04

Example

# Serve a model with one Docker command
docker run --gpus all -p 8080:80 \
  -v $PWD/data:/data \
  -e HF_TOKEN=hf_xxx \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct

# Query with curl (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
§05

Related on TokRepo

§06

Common pitfalls

  • TGI requires NVIDIA GPUs with CUDA support; AMD ROCm support exists but is less tested. CPU-only inference is not supported for production workloads.
  • Model downloads can be large (8B parameter models are 15-30GB); ensure sufficient disk space and a fast network connection for the initial pull.
  • Quantized models (GPTQ, AWQ) reduce memory requirements but may degrade output quality for complex reasoning tasks; benchmark with your specific use case.

Frequently Asked Questions

What models does TGI support?+

TGI supports most popular open-weight LLMs including Llama, Mistral, Falcon, StarCoder, Gemma, and others hosted on Hugging Face Hub. The server auto-detects model architecture and applies appropriate optimizations.

How does continuous batching work?+

Continuous batching dynamically adds new requests to a running batch without waiting for the current batch to complete. This maximizes GPU utilization and reduces average latency compared to static batching, especially under variable load.

Can TGI run on multiple GPUs?+

Yes. TGI supports tensor parallelism across multiple GPUs using the --num-shard flag. A model too large for one GPU can be split across two or more GPUs on the same machine.

Is the API compatible with OpenAI clients?+

Yes. TGI exposes an OpenAI-compatible /v1/chat/completions endpoint. Any application using the OpenAI Python SDK or REST API can point to TGI by changing the base URL, with no other code changes needed.

What quantization methods does TGI support?+

TGI supports GPTQ, AWQ, and bitsandbytes quantization. These methods reduce model memory requirements by 2-4x, enabling larger models to run on smaller GPU configurations at the cost of minor quality trade-offs.

Citations (3)

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets