Configs2026年5月2日·1 分钟阅读

Text Embeddings Inference — High-Performance Embedding Server by Hugging Face

A blazing-fast inference server for text embedding and reranking models. TEI serves any Sentence Transformers or cross-encoder model with optimized Rust and CUDA kernels, token-based dynamic batching, and an OpenAI-compatible API.

Introduction

Text Embeddings Inference (TEI) is Hugging Face's production-grade server for deploying text embedding and reranking models. Written in Rust with custom CUDA kernels, it delivers low-latency, high-throughput inference for RAG pipelines, semantic search, and classification workloads.

What TEI Does

  • Serves text embedding models with optimized Rust and CUDA backends
  • Supports reranking and cross-encoder models for two-stage retrieval
  • Provides token-based dynamic batching to maximize GPU utilization
  • Exposes OpenAI-compatible API endpoints for drop-in integration
  • Handles SPLADE sparse embeddings alongside dense vectors

Architecture Overview

TEI is a Rust server using Tokio for async I/O and custom CUDA kernels for transformer inference. Incoming requests are grouped into batches based on token count rather than request count, ensuring optimal GPU utilization across varying input lengths. Model weights are loaded via candle (Rust ML framework) or PyTorch backends depending on the model architecture. Continuous batching serves new requests without waiting for the current batch to finish.

Self-Hosting & Configuration

  • Docker images available for NVIDIA GPU, AMD GPU, Intel GPU, and CPU-only deployment
  • Supports any model from the Hugging Face Hub with a sentence-transformers or cross-encoder tag
  • Configuration via command-line flags: --max-batch-tokens, --max-concurrent-requests, --dtype
  • Health check and metrics endpoints for production monitoring
  • Quantization support for INT8 and FP16 inference

Key Features

  • Written in Rust for minimal memory overhead and maximum throughput
  • Token-based dynamic batching outperforms fixed-size batching on variable-length inputs
  • Flash Attention integration for long-context embedding models
  • Prometheus metrics endpoint for observability
  • Supports sequence classification, embedding, and reranking in a single server

Comparison with Similar Tools

  • Sentence Transformers (Python) — the training and inference library; TEI wraps models in a production server with batching and concurrency
  • Infinity — Python-based embedding server; TEI's Rust backend offers lower latency and higher throughput
  • vLLM — optimized for generative LLM serving; TEI is purpose-built for embedding and reranking workloads
  • Triton Inference Server — general-purpose model server; TEI provides simpler setup specifically for embedding models

FAQ

Q: Can TEI serve generative LLMs? A: No. TEI is specialized for embedding and reranking. Use Text Generation Inference (TGI) for generative models.

Q: Does it work without a GPU? A: Yes. CPU-only Docker images are available, though throughput is significantly lower.

Q: How do I use it with LangChain or LlamaIndex? A: Both frameworks support TEI via the Hugging Face inference endpoint integration or the OpenAI-compatible API.

Q: What embedding models work best? A: Any model on the Hub tagged sentence-transformers is supported. Popular choices include BGE, E5, and GTE families.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产