Text Embeddings Inference — High-Performance Embedding Server by Hugging Face

Introduction

Text Embeddings Inference (TEI) is Hugging Face's production-grade server for deploying text embedding and reranking models. Written in Rust with custom CUDA kernels, it delivers low-latency, high-throughput inference for RAG pipelines, semantic search, and classification workloads.

What TEI Does

Serves text embedding models with optimized Rust and CUDA backends
Supports reranking and cross-encoder models for two-stage retrieval
Provides token-based dynamic batching to maximize GPU utilization
Exposes OpenAI-compatible API endpoints for drop-in integration
Handles SPLADE sparse embeddings alongside dense vectors

Architecture Overview

TEI is a Rust server using Tokio for async I/O and custom CUDA kernels for transformer inference. Incoming requests are grouped into batches based on token count rather than request count, ensuring optimal GPU utilization across varying input lengths. Model weights are loaded via candle (Rust ML framework) or PyTorch backends depending on the model architecture. Continuous batching serves new requests without waiting for the current batch to finish.

Self-Hosting & Configuration

Docker images available for NVIDIA GPU, AMD GPU, Intel GPU, and CPU-only deployment
Supports any model from the Hugging Face Hub with a sentence-transformers or cross-encoder tag
Configuration via command-line flags: --max-batch-tokens, --max-concurrent-requests, --dtype
Health check and metrics endpoints for production monitoring
Quantization support for INT8 and FP16 inference

Key Features

Written in Rust for minimal memory overhead and maximum throughput
Token-based dynamic batching outperforms fixed-size batching on variable-length inputs
Flash Attention integration for long-context embedding models
Prometheus metrics endpoint for observability
Supports sequence classification, embedding, and reranking in a single server

Comparison with Similar Tools

Sentence Transformers (Python) — the training and inference library; TEI wraps models in a production server with batching and concurrency
Infinity — Python-based embedding server; TEI's Rust backend offers lower latency and higher throughput
vLLM — optimized for generative LLM serving; TEI is purpose-built for embedding and reranking workloads
Triton Inference Server — general-purpose model server; TEI provides simpler setup specifically for embedding models

FAQ

Q: Can TEI serve generative LLMs? A: No. TEI is specialized for embedding and reranking. Use Text Generation Inference (TGI) for generative models.

Q: Does it work without a GPU? A: Yes. CPU-only Docker images are available, though throughput is significantly lower.

Q: How do I use it with LangChain or LlamaIndex? A: Both frameworks support TEI via the Hugging Face inference endpoint integration or the OpenAI-compatible API.

Q: What embedding models work best? A: Any model on the Hub tagged sentence-transformers is supported. Popular choices include BGE, E5, and GTE families.

Text Embeddings Inference — High-Performance Embedding Server by Hugging Face

Introduction

What TEI Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

GPT-NeoX — Open-Source Large Language Model Training Library

SAM 2 — Segment Anything in Images and Videos

LLaVA — Large Language and Vision Assistant