Skills2026年5月2日·1 分钟阅读

Text Embeddings Inference — High-Performance Embedding Server by Hugging Face

A blazing-fast inference server for text embedding and reranking models. TEI serves any Sentence Transformers or cross-encoder model with optimized Rust and CUDA kernels, token-based dynamic batching, and an OpenAI-compatible API.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Needs Confirmation · 64/100策略:需确认
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Community
入口
Text Embeddings Inference Overview
通用 CLI 安装命令
npx tokrepo install 19c58bfa-45e0-11f1-9bc6-00163e2b0d79

Introduction

Text Embeddings Inference (TEI) is Hugging Face's production-grade server for deploying text embedding and reranking models. Written in Rust with custom CUDA kernels, it delivers low-latency, high-throughput inference for RAG pipelines, semantic search, and classification workloads.

What TEI Does

  • Serves text embedding models with optimized Rust and CUDA backends
  • Supports reranking and cross-encoder models for two-stage retrieval
  • Provides token-based dynamic batching to maximize GPU utilization
  • Exposes OpenAI-compatible API endpoints for drop-in integration
  • Handles SPLADE sparse embeddings alongside dense vectors

Architecture Overview

TEI is a Rust server using Tokio for async I/O and custom CUDA kernels for transformer inference. Incoming requests are grouped into batches based on token count rather than request count, ensuring optimal GPU utilization across varying input lengths. Model weights are loaded via candle (Rust ML framework) or PyTorch backends depending on the model architecture. Continuous batching serves new requests without waiting for the current batch to finish.

Self-Hosting & Configuration

  • Docker images available for NVIDIA GPU, AMD GPU, Intel GPU, and CPU-only deployment
  • Supports any model from the Hugging Face Hub with a sentence-transformers or cross-encoder tag
  • Configuration via command-line flags: --max-batch-tokens, --max-concurrent-requests, --dtype
  • Health check and metrics endpoints for production monitoring
  • Quantization support for INT8 and FP16 inference

Key Features

  • Written in Rust for minimal memory overhead and maximum throughput
  • Token-based dynamic batching outperforms fixed-size batching on variable-length inputs
  • Flash Attention integration for long-context embedding models
  • Prometheus metrics endpoint for observability
  • Supports sequence classification, embedding, and reranking in a single server

Comparison with Similar Tools

  • Sentence Transformers (Python) — the training and inference library; TEI wraps models in a production server with batching and concurrency
  • Infinity — Python-based embedding server; TEI's Rust backend offers lower latency and higher throughput
  • vLLM — optimized for generative LLM serving; TEI is purpose-built for embedding and reranking workloads
  • Triton Inference Server — general-purpose model server; TEI provides simpler setup specifically for embedding models

FAQ

Q: Can TEI serve generative LLMs? A: No. TEI is specialized for embedding and reranking. Use Text Generation Inference (TGI) for generative models.

Q: Does it work without a GPU? A: Yes. CPU-only Docker images are available, though throughput is significantly lower.

Q: How do I use it with LangChain or LlamaIndex? A: Both frameworks support TEI via the Hugging Face inference endpoint integration or the OpenAI-compatible API.

Q: What embedding models work best? A: Any model on the Hub tagged sentence-transformers is supported. Popular choices include BGE, E5, and GTE families.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产