# NVIDIA Triton Inference Server — Multi-Framework Model Serving at Scale > Triton Inference Server is NVIDIA's production model serving platform. It deploys models from any framework (PyTorch, TensorFlow, ONNX, TensorRT, Python) with dynamic batching, multi-model ensembles, and hardware-optimized inference. ## Install Save in your project root: # NVIDIA Triton Inference Server ## Quick Use ```bash # Run with the official NGC image docker run --gpus all -d --name triton \ -p 8000:8000 -p 8001:8001 -p 8002:8002 \ -v $PWD/model_repository:/models \ nvcr.io/nvidia/tritonserver:24.07-py3 \ tritonserver --model-repository=/models # Health curl http://localhost:8000/v2/health/ready # Models curl http://localhost:8000/v2/models ``` ``` # Model repository layout model_repository/ bert_base/ config.pbtxt # backend, inputs/outputs, dynamic batching 1/ # version directory model.onnx ``` ## Introduction Triton Inference Server is NVIDIA's open-source production server for ML models. It was the answer to the "every team uses a different framework" problem: instead of standing up a separate server per framework, Triton hosts PyTorch, TensorFlow, ONNX, TensorRT, OpenVINO, RAPIDS, custom Python, and even C++ backends side-by-side. With over 10,000 GitHub stars, Triton powers production inference at NVIDIA, Microsoft Azure, Snap, Yahoo Japan, and hundreds of enterprises. It pairs with TensorRT-LLM for state-of-the-art LLM inference performance on NVIDIA GPUs. ## What Triton Does Triton loads models from a "model repository" (filesystem, S3, Azure Blob, GCS) into matching backends. It exposes HTTP/gRPC APIs for inference and metrics, batches requests dynamically, supports model ensembles (DAG of models), provides per-model versioning and A/B routing, and surfaces detailed Prometheus metrics for fleet operations. ## Architecture Overview ``` Clients (HTTP / gRPC / C API) | [Triton Server] request scheduler + dynamic batcher model versioning, ensembles shared memory I/O | +--------+--------+--------+--------+ | | | | | TensorRT PyTorch ONNX TF Python vLLM/TRT-LLM backend backend backend backend backend | GPU(s) — kernel scheduling, MIG, MPS | [Metrics + Tracing] Prometheus, OpenTelemetry | [Model Analyzer + Performance Analyzer tools] ``` ## Self-Hosting & Configuration ```protobuf # config.pbtxt — example for an ONNX text classifier name: "bert_base" platform: "onnxruntime_onnx" max_batch_size: 32 input [ { name: "input_ids", data_type: TYPE_INT64, dims: [128] }, { name: "attention_mask", data_type: TYPE_INT64, dims: [128] } ] output [ { name: "logits", data_type: TYPE_FP32, dims: [2] } ] dynamic_batching { preferred_batch_size: [4, 8, 16] max_queue_delay_microseconds: 5000 } instance_group [{ count: 2, kind: KIND_GPU }] ``` ```python # Python client import tritonclient.http as httpclient import numpy as np client = httpclient.InferenceServerClient(url="localhost:8000") in_ids = httpclient.InferInput("input_ids", [1, 128], "INT64") in_ids.set_data_from_numpy(np.zeros((1, 128), dtype=np.int64)) attn = httpclient.InferInput("attention_mask", [1, 128], "INT64") attn.set_data_from_numpy(np.ones((1, 128), dtype=np.int64)) result = client.infer("bert_base", inputs=[in_ids, attn]) print(result.as_numpy("logits")) ``` ## Key Features - **Multi-framework backends** — PyTorch, TF, ONNX, TensorRT, OpenVINO, Python, custom - **Dynamic batching** — requests batched on-the-fly for higher throughput - **Ensembles** — pipeline multiple models without inter-process latency - **Model versioning + A/B** — host multiple versions, route by policy - **Multi-GPU / multi-instance** — dispatch across GPUs (or MIG slices) - **HTTP/gRPC + KServe v2 protocol** — standard inference protocol - **Performance Analyzer tool** — find optimal batch/instance counts - **TensorRT-LLM integration** — serve LLMs with NVIDIA's tuned engines ## Comparison with Similar Tools | Feature | Triton | TorchServe | TensorFlow Serving | KServe | BentoML | |---|---|---|---|---|---| | Multi-framework | Yes (broadest) | PyTorch | TF | Pluggable (Triton inside) | Many | | Dynamic batching | Yes | Yes | Yes | Yes | Yes | | Ensembles | Yes (built-in DAG) | Limited | Limited | Via pipelines | Via runners | | GPU optimization | Best (NVIDIA-native) | Good | Good | Depends | Depends | | Best For | Multi-framework production fleets | PyTorch shops | TF shops | k8s-native serving | Python-heavy ML pipelines | ## FAQ **Q: Triton vs TGI/vLLM for LLMs?** A: TGI/vLLM are LLM-specific, with continuous batching tuned for autoregressive generation. Triton + TensorRT-LLM matches that performance and lets you serve other model types in the same fleet. For LLM-only stacks, vLLM/TGI are simpler. **Q: Does Triton require NVIDIA GPUs?** A: No — backends include CPU (ONNX, OpenVINO, Python). But NVIDIA features (MIG, TensorRT, MPS) are first-class and the project is NVIDIA-owned. **Q: How does it integrate with Kubernetes?** A: Common pattern: Triton pods behind a Service, model repository on S3/PVC. KServe wraps Triton as a backend for declarative model deployments. **Q: Can I write a custom backend?** A: Yes — Python backend (write a Python class), C++ backend (high performance), or Business Logic Scripting (BLS) for orchestration. NeMo Guardrails / TensorRT-LLM use these to extend Triton. ## Sources - GitHub: https://github.com/triton-inference-server/server - Docs: https://docs.nvidia.com/deeplearning/triton-inference-server - Company: NVIDIA - License: BSD-3-Clause --- Source: https://tokrepo.com/en/workflows/e0a9738b-37db-11f1-9bc6-00163e2b0d79 Author: AI Open Source