Esta página se muestra en inglés. Una traducción al español está en curso.
ConfigsMay 24, 2026·3 min de lectura

TensorRT-LLM — High-Performance LLM Inference on NVIDIA GPUs

NVIDIA's open-source library for optimizing and deploying large language models with state-of-the-art inference performance on NVIDIA hardware.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
TensorRT-LLM
Comando CLI universal
npx tokrepo install 92079e30-57ad-11f1-9bc6-00163e2b0d79

Introduction

TensorRT-LLM is NVIDIA's open-source Python library that provides an easy-to-use API for defining, optimizing, and running LLM inference on NVIDIA GPUs. It combines TensorRT's deep learning compiler with LLM-specific optimizations like in-flight batching, paged KV caches, and custom CUDA kernels to achieve maximum throughput.

What TensorRT-LLM Does

  • Compiles LLM models into optimized TensorRT engines
  • Supports Llama, GPT, Mistral, Qwen, DeepSeek, and 50+ model architectures
  • Implements continuous batching and paged attention for high concurrency
  • Provides quantization (INT8, FP8, AWQ, GPTQ) for reduced memory usage
  • Runs on single GPUs through multi-node tensor-parallel deployments

Architecture Overview

TensorRT-LLM consists of a Python model definition layer, a graph compiler that lowers models to TensorRT engines, and a C++ runtime that handles scheduling, memory management, and execution. The runtime implements an inflight batching scheduler that dynamically inserts and removes requests, maximizing GPU utilization without waiting for the longest sequence in a batch.

Self-Hosting & Configuration

  • Requires NVIDIA GPUs with compute capability 8.0+ (Ampere, Hopper, Blackwell)
  • Install via pip or use the official NVIDIA Docker containers
  • Convert model checkpoints, then build engines with trtllm-build CLI
  • Configure tensor parallelism for multi-GPU inference via MPI
  • Supports Triton Inference Server integration for production serving

Key Features

  • FP8 quantization on Hopper/Blackwell GPUs for 2x throughput gains
  • Speculative decoding and Medusa heads for reduced latency
  • KV cache reuse across requests with paged memory management
  • Multi-node inference with NVLink and InfiniBand interconnects
  • OpenAI-compatible API server included for quick deployment

Comparison with Similar Tools

  • vLLM — pure Python, broader hardware support; TensorRT-LLM offers peak NVIDIA performance
  • SGLang — RadixAttention for prefix caching; TensorRT-LLM uses compiled graphs for throughput
  • llama.cpp — CPU and consumer GPU focus; TensorRT-LLM targets datacenter GPUs
  • DeepSpeed-FastGen — research-focused; TensorRT-LLM is NVIDIA's production path

FAQ

Q: Which GPUs are supported? A: Ampere (A100, A10G), Hopper (H100, H200), and Blackwell (B100, B200) series. Consumer GPUs like RTX 4090 work for smaller models.

Q: Can I use models from Hugging Face directly? A: Yes. Conversion scripts exist for most popular architectures. Convert checkpoints then build engines.

Q: How does it compare to vLLM performance? A: On NVIDIA GPUs, TensorRT-LLM typically achieves higher throughput due to compiled execution and hardware-specific kernels, especially with FP8.

Q: Is it suitable for real-time applications? A: Yes. The C++ runtime is designed for low-latency serving with continuous batching and streaming token output.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados