# TensorRT — High-Performance Deep Learning Inference by NVIDIA > NVIDIA's SDK for optimizing trained deep learning models for production inference, delivering low latency and high throughput on NVIDIA GPUs through graph optimization, kernel fusion, and precision calibration. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: # TensorRT — High-Performance Deep Learning Inference by NVIDIA ## Quick Use ```bash pip install tensorrt # Convert ONNX model to TensorRT engine trtexec --onnx=model.onnx --saveEngine=model.trt --fp16 # Run inference trtexec --loadEngine=model.trt --batch=8 ``` ## Introduction TensorRT is NVIDIA's inference optimization SDK that takes trained models from any major framework, applies graph-level and kernel-level optimizations, and produces deployment-ready engines that maximize throughput and minimize latency on NVIDIA GPUs. It handles precision calibration, layer fusion, and memory management automatically. ## What TensorRT Does - Converts models from ONNX, TensorFlow, and PyTorch to optimized inference engines - Applies layer and tensor fusion to reduce memory bandwidth and kernel launch overhead - Calibrates INT8 and FP16 precision while maintaining accuracy - Performs kernel auto-tuning for the target GPU architecture - Supports dynamic batch sizes and input shapes for flexible deployment ## Architecture Overview TensorRT operates in two phases: a build phase that analyzes the computation graph, selects optimal kernels from a library, fuses operations, and calibrates reduced precision; and a runtime phase that executes the resulting plan on GPU. The builder profiles multiple kernel implementations per layer and selects the fastest for the target hardware. Memory is pre-allocated in a single contiguous buffer to avoid runtime allocation overhead. ## Self-Hosting & Configuration - Install via pip (`pip install tensorrt`) or NVIDIA's container images - Requires NVIDIA GPU with compute capability 6.0+ and CUDA toolkit - Export models to ONNX for the most portable conversion path - Use trtexec CLI for quick benchmarks or the Python/C++ API for integration - Configure workspace size, precision mode, and calibration dataset in builder config ## Key Features - Automatic kernel fusion merges convolution, bias, and activation into single kernels - INT8 calibration with entropy or percentile methods preserves accuracy - Dynamic shape support handles variable-size inputs without rebuilding - Plugin API allows custom layer implementations in CUDA - Multi-stream execution for concurrent inference on the same GPU ## Comparison with Similar Tools - **ONNX Runtime** — cross-platform inference; TensorRT is faster on NVIDIA GPUs but NVIDIA-only - **TorchScript/torch.compile** — PyTorch native; less optimized than TensorRT for production - **OpenVINO** — Intel's inference toolkit for CPU/iGPU; TensorRT targets NVIDIA - **TVM** — compiler-based approach with broader hardware support but more setup - **Triton Inference Server** — model serving layer that uses TensorRT as a backend ## FAQ **Q: Does TensorRT require retraining the model?** A: No. It optimizes already-trained models. INT8 mode requires a small calibration dataset but no gradient updates. **Q: How much speedup can I expect?** A: Typically 2-6x over unoptimized PyTorch inference, depending on model architecture and precision mode. **Q: Can I use TensorRT on Jetson edge devices?** A: Yes. TensorRT is included in NVIDIA JetPack for Jetson Nano, Xavier, and Orin platforms. **Q: Is TensorRT open source?** A: The core runtime is proprietary, but the open-source repository includes parsers, plugins, and sample code under Apache 2.0. ## Sources - https://github.com/NVIDIA/TensorRT - https://developer.nvidia.com/tensorrt --- Source: https://tokrepo.com/en/workflows/tensorrt-high-performance-deep-learning-inference-nvidia-b7039131 Author: NVIDIA