Skills2026年4月14日·1 分钟阅读

Text Generation Inference (TGI) — Hugging Face Production LLM Server

TGI is Hugging Face's production-grade LLM inference server. It powers HF Inference Endpoints with continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs — handling thousands of requests per second.

Agent 就绪

这个资产会安全暂存

这个资产会先安全暂存。复制的指令会要求 Agent 读取暂存文件,并在激活脚本、MCP 配置或全局配置前先确认。

Stage only · 29/100策略:需暂存
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Stage only
信任
信任等级:Community
入口
step-1.md
安全暂存命令
npx -y tokrepo@latest install e08ad222-37db-11f1-9bc6-00163e2b0d79 --target codex

先暂存文件;激活前需要读取暂存 README 和安装计划。

TL;DR
TGI serves LLMs in production with continuous batching, tensor parallelism, quantization, and an OpenAI-compatible API.
§01

What it is

Text Generation Inference (TGI) is Hugging Face's production-grade LLM inference server. It powers Hugging Face Inference Endpoints and provides continuous batching, tensor parallelism, quantization (GPTQ, AWQ, bitsandbytes), and an OpenAI-compatible API out of the box. TGI handles thousands of concurrent requests efficiently on GPU hardware.

TGI is designed for ML engineers and platform teams who need to self-host LLMs with production-level throughput and latency. It supports most popular open-weight models from Llama, Mistral, Falcon, and other families.

§02

How it saves time or tokens

TGI's continuous batching dynamically groups incoming requests to maximize GPU utilization. Instead of processing one request at a time or waiting for a fixed batch to fill, TGI starts generating tokens immediately and adds new requests to the running batch. This dramatically reduces latency under load compared to naive serving approaches.

The OpenAI-compatible API means you can switch from OpenAI to a self-hosted model by changing a single base URL in your application. No code changes beyond the endpoint configuration.

§03

How to use

  1. Pull the TGI Docker image and start serving a model: specify the model name, GPU allocation, and port mapping.
  2. Send requests to the OpenAI-compatible endpoint at http://localhost:8080/v1/chat/completions.
  3. For multi-GPU setups, set --num-shard to the number of GPUs for tensor parallelism.
§04

Example

# Serve a model with one Docker command
docker run --gpus all -p 8080:80 \
  -v $PWD/data:/data \
  -e HF_TOKEN=hf_xxx \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id meta-llama/Llama-3.1-8B-Instruct

# Query with curl (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello"}]
  }'
§05

Related on TokRepo

§06

Common pitfalls

  • TGI requires NVIDIA GPUs with CUDA support; AMD ROCm support exists but is less tested. CPU-only inference is not supported for production workloads.
  • Model downloads can be large (8B parameter models are 15-30GB); ensure sufficient disk space and a fast network connection for the initial pull.
  • Quantized models (GPTQ, AWQ) reduce memory requirements but may degrade output quality for complex reasoning tasks; benchmark with your specific use case.

常见问题

What models does TGI support?+

TGI supports most popular open-weight LLMs including Llama, Mistral, Falcon, StarCoder, Gemma, and others hosted on Hugging Face Hub. The server auto-detects model architecture and applies appropriate optimizations.

How does continuous batching work?+

Continuous batching dynamically adds new requests to a running batch without waiting for the current batch to complete. This maximizes GPU utilization and reduces average latency compared to static batching, especially under variable load.

Can TGI run on multiple GPUs?+

Yes. TGI supports tensor parallelism across multiple GPUs using the --num-shard flag. A model too large for one GPU can be split across two or more GPUs on the same machine.

Is the API compatible with OpenAI clients?+

Yes. TGI exposes an OpenAI-compatible /v1/chat/completions endpoint. Any application using the OpenAI Python SDK or REST API can point to TGI by changing the base URL, with no other code changes needed.

What quantization methods does TGI support?+

TGI supports GPTQ, AWQ, and bitsandbytes quantization. These methods reduce model memory requirements by 2-4x, enabling larger models to run on smaller GPU configurations at the cost of minor quality trade-offs.

引用来源 (3)

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产