# TensorRT — High-Performance Deep Learning Inference by NVIDIA

> NVIDIA's SDK for optimizing trained deep learning models for production inference, delivering low latency and high throughput on NVIDIA GPUs through graph optimization, kernel fusion, and precision calibration.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# TensorRT — High-Performance Deep Learning Inference by NVIDIA

## Quick Use
```bash
pip install tensorrt
# Convert ONNX model to TensorRT engine
trtexec --onnx=model.onnx --saveEngine=model.trt --fp16
# Run inference
trtexec --loadEngine=model.trt --batch=8
```

## Introduction
TensorRT is NVIDIA's inference optimization SDK that takes trained models from any major framework, applies graph-level and kernel-level optimizations, and produces deployment-ready engines that maximize throughput and minimize latency on NVIDIA GPUs. It handles precision calibration, layer fusion, and memory management automatically.

## What TensorRT Does
- Converts models from ONNX, TensorFlow, and PyTorch to optimized inference engines
- Applies layer and tensor fusion to reduce memory bandwidth and kernel launch overhead
- Calibrates INT8 and FP16 precision while maintaining accuracy
- Performs kernel auto-tuning for the target GPU architecture
- Supports dynamic batch sizes and input shapes for flexible deployment

## Architecture Overview
TensorRT operates in two phases: a build phase that analyzes the computation graph, selects optimal kernels from a library, fuses operations, and calibrates reduced precision; and a runtime phase that executes the resulting plan on GPU. The builder profiles multiple kernel implementations per layer and selects the fastest for the target hardware. Memory is pre-allocated in a single contiguous buffer to avoid runtime allocation overhead.

## Self-Hosting & Configuration
- Install via pip (`pip install tensorrt`) or NVIDIA's container images
- Requires NVIDIA GPU with compute capability 6.0+ and CUDA toolkit
- Export models to ONNX for the most portable conversion path
- Use trtexec CLI for quick benchmarks or the Python/C++ API for integration
- Configure workspace size, precision mode, and calibration dataset in builder config

## Key Features
- Automatic kernel fusion merges convolution, bias, and activation into single kernels
- INT8 calibration with entropy or percentile methods preserves accuracy
- Dynamic shape support handles variable-size inputs without rebuilding
- Plugin API allows custom layer implementations in CUDA
- Multi-stream execution for concurrent inference on the same GPU

## Comparison with Similar Tools
- **ONNX Runtime** — cross-platform inference; TensorRT is faster on NVIDIA GPUs but NVIDIA-only
- **TorchScript/torch.compile** — PyTorch native; less optimized than TensorRT for production
- **OpenVINO** — Intel's inference toolkit for CPU/iGPU; TensorRT targets NVIDIA
- **TVM** — compiler-based approach with broader hardware support but more setup
- **Triton Inference Server** — model serving layer that uses TensorRT as a backend

## FAQ
**Q: Does TensorRT require retraining the model?**
A: No. It optimizes already-trained models. INT8 mode requires a small calibration dataset but no gradient updates.

**Q: How much speedup can I expect?**
A: Typically 2-6x over unoptimized PyTorch inference, depending on model architecture and precision mode.

**Q: Can I use TensorRT on Jetson edge devices?**
A: Yes. TensorRT is included in NVIDIA JetPack for Jetson Nano, Xavier, and Orin platforms.

**Q: Is TensorRT open source?**
A: The core runtime is proprietary, but the open-source repository includes parsers, plugins, and sample code under Apache 2.0.

## Sources
- https://github.com/NVIDIA/TensorRT
- https://developer.nvidia.com/tensorrt

---
Source: https://tokrepo.com/en/workflows/tensorrt-high-performance-deep-learning-inference-nvidia-b7039131
Author: NVIDIA