Text Generation Inference (TGI) — Hugging Face Production LLM Server
TGI is Hugging Face's production-grade LLM inference server. It powers HF Inference Endpoints with continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs — handling thousands of requests per second.
这个资产会安全暂存
这个资产会先安全暂存。复制的指令会要求 Agent 读取暂存文件,并在激活脚本、MCP 配置或全局配置前先确认。
npx -y tokrepo@latest install e08ad222-37db-11f1-9bc6-00163e2b0d79 --target codex先暂存文件;激活前需要读取暂存 README 和安装计划。
What it is
Text Generation Inference (TGI) is Hugging Face's production-grade LLM inference server. It powers Hugging Face Inference Endpoints and provides continuous batching, tensor parallelism, quantization (GPTQ, AWQ, bitsandbytes), and an OpenAI-compatible API out of the box. TGI handles thousands of concurrent requests efficiently on GPU hardware.
TGI is designed for ML engineers and platform teams who need to self-host LLMs with production-level throughput and latency. It supports most popular open-weight models from Llama, Mistral, Falcon, and other families.
How it saves time or tokens
TGI's continuous batching dynamically groups incoming requests to maximize GPU utilization. Instead of processing one request at a time or waiting for a fixed batch to fill, TGI starts generating tokens immediately and adds new requests to the running batch. This dramatically reduces latency under load compared to naive serving approaches.
The OpenAI-compatible API means you can switch from OpenAI to a self-hosted model by changing a single base URL in your application. No code changes beyond the endpoint configuration.
How to use
- Pull the TGI Docker image and start serving a model: specify the model name, GPU allocation, and port mapping.
- Send requests to the OpenAI-compatible endpoint at
http://localhost:8080/v1/chat/completions. - For multi-GPU setups, set
--num-shardto the number of GPUs for tensor parallelism.
Example
# Serve a model with one Docker command
docker run --gpus all -p 8080:80 \
-v $PWD/data:/data \
-e HF_TOKEN=hf_xxx \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-3.1-8B-Instruct
# Query with curl (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello"}]
}'
Related on TokRepo
- Local LLM tools -- compare self-hosted LLM inference options
- AI tools for coding -- development tools using local models
Common pitfalls
- TGI requires NVIDIA GPUs with CUDA support; AMD ROCm support exists but is less tested. CPU-only inference is not supported for production workloads.
- Model downloads can be large (8B parameter models are 15-30GB); ensure sufficient disk space and a fast network connection for the initial pull.
- Quantized models (GPTQ, AWQ) reduce memory requirements but may degrade output quality for complex reasoning tasks; benchmark with your specific use case.
常见问题
TGI supports most popular open-weight LLMs including Llama, Mistral, Falcon, StarCoder, Gemma, and others hosted on Hugging Face Hub. The server auto-detects model architecture and applies appropriate optimizations.
Continuous batching dynamically adds new requests to a running batch without waiting for the current batch to complete. This maximizes GPU utilization and reduces average latency compared to static batching, especially under variable load.
Yes. TGI supports tensor parallelism across multiple GPUs using the --num-shard flag. A model too large for one GPU can be split across two or more GPUs on the same machine.
Yes. TGI exposes an OpenAI-compatible /v1/chat/completions endpoint. Any application using the OpenAI Python SDK or REST API can point to TGI by changing the base URL, with no other code changes needed.
TGI supports GPTQ, AWQ, and bitsandbytes quantization. These methods reduce model memory requirements by 2-4x, enabling larger models to run on smaller GPU configurations at the cost of minor quality trade-offs.
引用来源 (3)
- TGI GitHub— TGI is Hugging Face's production LLM inference server
- TGI Documentation— Continuous batching and tensor parallelism for LLM serving
- OpenAI API Reference— OpenAI-compatible API specification
TokRepo 相关
讨论
相关资产
Text Embeddings Inference — High-Performance Embedding Server by Hugging Face
A blazing-fast inference server for text embedding and reranking models. TEI serves any Sentence Transformers or cross-encoder model with optimized Rust and CUDA kernels, token-based dynamic batching, and an OpenAI-compatible API.
Hugging Face Tokenizers — Fast Text Tokenization for ML Pipelines
Hugging Face Tokenizers is a Rust-powered tokenization library with Python bindings that implements BPE, WordPiece, Unigram, and SentencePiece tokenizers with training and encoding speeds of gigabytes per second, used as the backbone for Transformers model tokenization.
Text Generation WebUI — Local LLM Chat Interface
Text Generation WebUI is a Gradio interface for running LLMs locally. 46.4K+ GitHub stars. Multiple backends, vision, training, image gen, OpenAI-compatible API. 100% offline.
Hugging Face Datasets — Access and Process ML Datasets at Scale
Hugging Face Datasets is a Python library for efficiently loading, processing, and sharing machine learning datasets with Apache Arrow-backed memory mapping, streaming support, and access to thousands of community datasets on the Hub.