Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 24, 2026·3 min de lecture

TensorRT-LLM — High-Performance LLM Inference on NVIDIA GPUs

NVIDIA's open-source library for optimizing and deploying large language models with state-of-the-art inference performance on NVIDIA hardware.

Prêt pour agents

Cet actif peut être lu et installé directement par les agents

TokRepo expose une commande CLI universelle, un contrat d'installation, le metadata JSON, un plan selon l'adaptateur et le contenu raw pour aider les agents à juger l'adaptation, le risque et les prochaines actions.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
TensorRT-LLM
Commande CLI universelle
npx tokrepo install 92079e30-57ad-11f1-9bc6-00163e2b0d79

Introduction

TensorRT-LLM is NVIDIA's open-source Python library that provides an easy-to-use API for defining, optimizing, and running LLM inference on NVIDIA GPUs. It combines TensorRT's deep learning compiler with LLM-specific optimizations like in-flight batching, paged KV caches, and custom CUDA kernels to achieve maximum throughput.

What TensorRT-LLM Does

  • Compiles LLM models into optimized TensorRT engines
  • Supports Llama, GPT, Mistral, Qwen, DeepSeek, and 50+ model architectures
  • Implements continuous batching and paged attention for high concurrency
  • Provides quantization (INT8, FP8, AWQ, GPTQ) for reduced memory usage
  • Runs on single GPUs through multi-node tensor-parallel deployments

Architecture Overview

TensorRT-LLM consists of a Python model definition layer, a graph compiler that lowers models to TensorRT engines, and a C++ runtime that handles scheduling, memory management, and execution. The runtime implements an inflight batching scheduler that dynamically inserts and removes requests, maximizing GPU utilization without waiting for the longest sequence in a batch.

Self-Hosting & Configuration

  • Requires NVIDIA GPUs with compute capability 8.0+ (Ampere, Hopper, Blackwell)
  • Install via pip or use the official NVIDIA Docker containers
  • Convert model checkpoints, then build engines with trtllm-build CLI
  • Configure tensor parallelism for multi-GPU inference via MPI
  • Supports Triton Inference Server integration for production serving

Key Features

  • FP8 quantization on Hopper/Blackwell GPUs for 2x throughput gains
  • Speculative decoding and Medusa heads for reduced latency
  • KV cache reuse across requests with paged memory management
  • Multi-node inference with NVLink and InfiniBand interconnects
  • OpenAI-compatible API server included for quick deployment

Comparison with Similar Tools

  • vLLM — pure Python, broader hardware support; TensorRT-LLM offers peak NVIDIA performance
  • SGLang — RadixAttention for prefix caching; TensorRT-LLM uses compiled graphs for throughput
  • llama.cpp — CPU and consumer GPU focus; TensorRT-LLM targets datacenter GPUs
  • DeepSpeed-FastGen — research-focused; TensorRT-LLM is NVIDIA's production path

FAQ

Q: Which GPUs are supported? A: Ampere (A100, A10G), Hopper (H100, H200), and Blackwell (B100, B200) series. Consumer GPUs like RTX 4090 work for smaller models.

Q: Can I use models from Hugging Face directly? A: Yes. Conversion scripts exist for most popular architectures. Convert checkpoints then build engines.

Q: How does it compare to vLLM performance? A: On NVIDIA GPUs, TensorRT-LLM typically achieves higher throughput due to compiled execution and hardware-specific kernels, especially with FP8.

Q: Is it suitable for real-time applications? A: Yes. The C++ runtime is designed for low-latency serving with continuous batching and streaming token output.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires