Introduction
Triton Inference Server is NVIDIA's open-source production server for ML models. It was the answer to the "every team uses a different framework" problem: instead of standing up a separate server per framework, Triton hosts PyTorch, TensorFlow, ONNX, TensorRT, OpenVINO, RAPIDS, custom Python, and even C++ backends side-by-side.
With over 10,000 GitHub stars, Triton powers production inference at NVIDIA, Microsoft Azure, Snap, Yahoo Japan, and hundreds of enterprises. It pairs with TensorRT-LLM for state-of-the-art LLM inference performance on NVIDIA GPUs.
What Triton Does
Triton loads models from a "model repository" (filesystem, S3, Azure Blob, GCS) into matching backends. It exposes HTTP/gRPC APIs for inference and metrics, batches requests dynamically, supports model ensembles (DAG of models), provides per-model versioning and A/B routing, and surfaces detailed Prometheus metrics for fleet operations.
Architecture Overview
Clients (HTTP / gRPC / C API)
|
[Triton Server]
request scheduler + dynamic batcher
model versioning, ensembles
shared memory I/O
|
+--------+--------+--------+--------+
| | | | |
TensorRT PyTorch ONNX TF Python vLLM/TRT-LLM
backend backend backend backend backend
|
GPU(s) — kernel scheduling, MIG, MPS
|
[Metrics + Tracing]
Prometheus, OpenTelemetry
|
[Model Analyzer + Performance Analyzer tools]Self-Hosting & Configuration
# config.pbtxt — example for an ONNX text classifier
name: "bert_base"
platform: "onnxruntime_onnx"
max_batch_size: 32
input [
{ name: "input_ids", data_type: TYPE_INT64, dims: [128] },
{ name: "attention_mask", data_type: TYPE_INT64, dims: [128] }
]
output [
{ name: "logits", data_type: TYPE_FP32, dims: [2] }
]
dynamic_batching {
preferred_batch_size: [4, 8, 16]
max_queue_delay_microseconds: 5000
}
instance_group [{ count: 2, kind: KIND_GPU }]# Python client
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient(url="localhost:8000")
in_ids = httpclient.InferInput("input_ids", [1, 128], "INT64")
in_ids.set_data_from_numpy(np.zeros((1, 128), dtype=np.int64))
attn = httpclient.InferInput("attention_mask", [1, 128], "INT64")
attn.set_data_from_numpy(np.ones((1, 128), dtype=np.int64))
result = client.infer("bert_base", inputs=[in_ids, attn])
print(result.as_numpy("logits"))Key Features
- Multi-framework backends — PyTorch, TF, ONNX, TensorRT, OpenVINO, Python, custom
- Dynamic batching — requests batched on-the-fly for higher throughput
- Ensembles — pipeline multiple models without inter-process latency
- Model versioning + A/B — host multiple versions, route by policy
- Multi-GPU / multi-instance — dispatch across GPUs (or MIG slices)
- HTTP/gRPC + KServe v2 protocol — standard inference protocol
- Performance Analyzer tool — find optimal batch/instance counts
- TensorRT-LLM integration — serve LLMs with NVIDIA's tuned engines
Comparison with Similar Tools
| Feature | Triton | TorchServe | TensorFlow Serving | KServe | BentoML |
|---|---|---|---|---|---|
| Multi-framework | Yes (broadest) | PyTorch | TF | Pluggable (Triton inside) | Many |
| Dynamic batching | Yes | Yes | Yes | Yes | Yes |
| Ensembles | Yes (built-in DAG) | Limited | Limited | Via pipelines | Via runners |
| GPU optimization | Best (NVIDIA-native) | Good | Good | Depends | Depends |
| Best For | Multi-framework production fleets | PyTorch shops | TF shops | k8s-native serving | Python-heavy ML pipelines |
FAQ
Q: Triton vs TGI/vLLM for LLMs? A: TGI/vLLM are LLM-specific, with continuous batching tuned for autoregressive generation. Triton + TensorRT-LLM matches that performance and lets you serve other model types in the same fleet. For LLM-only stacks, vLLM/TGI are simpler.
Q: Does Triton require NVIDIA GPUs? A: No — backends include CPU (ONNX, OpenVINO, Python). But NVIDIA features (MIG, TensorRT, MPS) are first-class and the project is NVIDIA-owned.
Q: How does it integrate with Kubernetes? A: Common pattern: Triton pods behind a Service, model repository on S3/PVC. KServe wraps Triton as a backend for declarative model deployments.
Q: Can I write a custom backend? A: Yes — Python backend (write a Python class), C++ backend (high performance), or Business Logic Scripting (BLS) for orchestration. NeMo Guardrails / TensorRT-LLM use these to extend Triton.
Sources
- GitHub: https://github.com/triton-inference-server/server
- Docs: https://docs.nvidia.com/deeplearning/triton-inference-server
- Company: NVIDIA
- License: BSD-3-Clause