# NVIDIA Triton Inference Server — Multi-Framework Model Serving at Scale

> Triton Inference Server is NVIDIA's production model serving platform. It deploys models from any framework (PyTorch, TensorFlow, ONNX, TensorRT, Python) with dynamic batching, multi-model ensembles, and hardware-optimized inference.

## Install

Save in your project root:

# NVIDIA Triton Inference Server

## Quick Use
```bash
# Run with the official NGC image
docker run --gpus all -d --name triton \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $PWD/model_repository:/models \
  nvcr.io/nvidia/tritonserver:24.07-py3 \
  tritonserver --model-repository=/models

# Health
curl http://localhost:8000/v2/health/ready
# Models
curl http://localhost:8000/v2/models
```

```
# Model repository layout
model_repository/
  bert_base/
    config.pbtxt          # backend, inputs/outputs, dynamic batching
    1/                    # version directory
      model.onnx
```

## Introduction
Triton Inference Server is NVIDIA's open-source production server for ML models. It was the answer to the "every team uses a different framework" problem: instead of standing up a separate server per framework, Triton hosts PyTorch, TensorFlow, ONNX, TensorRT, OpenVINO, RAPIDS, custom Python, and even C++ backends side-by-side.

With over 10,000 GitHub stars, Triton powers production inference at NVIDIA, Microsoft Azure, Snap, Yahoo Japan, and hundreds of enterprises. It pairs with TensorRT-LLM for state-of-the-art LLM inference performance on NVIDIA GPUs.

## What Triton Does
Triton loads models from a "model repository" (filesystem, S3, Azure Blob, GCS) into matching backends. It exposes HTTP/gRPC APIs for inference and metrics, batches requests dynamically, supports model ensembles (DAG of models), provides per-model versioning and A/B routing, and surfaces detailed Prometheus metrics for fleet operations.

## Architecture Overview
```
Clients (HTTP / gRPC / C API)
      |
[Triton Server]
   request scheduler + dynamic batcher
   model versioning, ensembles
   shared memory I/O
      |
   +--------+--------+--------+--------+
   |        |        |        |        |
TensorRT  PyTorch ONNX TF Python  vLLM/TRT-LLM
 backend   backend backend backend backend
      |
   GPU(s) — kernel scheduling, MIG, MPS
      |
[Metrics + Tracing]
   Prometheus, OpenTelemetry
      |
[Model Analyzer + Performance Analyzer tools]
```

## Self-Hosting & Configuration
```protobuf
# config.pbtxt — example for an ONNX text classifier
name: "bert_base"
platform: "onnxruntime_onnx"
max_batch_size: 32
input [
  { name: "input_ids",      data_type: TYPE_INT64, dims: [128] },
  { name: "attention_mask", data_type: TYPE_INT64, dims: [128] }
]
output [
  { name: "logits", data_type: TYPE_FP32, dims: [2] }
]
dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 5000
}
instance_group [{ count: 2, kind: KIND_GPU }]
```

```python
# Python client
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient(url="localhost:8000")
in_ids = httpclient.InferInput("input_ids", [1, 128], "INT64")
in_ids.set_data_from_numpy(np.zeros((1, 128), dtype=np.int64))
attn = httpclient.InferInput("attention_mask", [1, 128], "INT64")
attn.set_data_from_numpy(np.ones((1, 128), dtype=np.int64))
result = client.infer("bert_base", inputs=[in_ids, attn])
print(result.as_numpy("logits"))
```

## Key Features
- **Multi-framework backends** — PyTorch, TF, ONNX, TensorRT, OpenVINO, Python, custom
- **Dynamic batching** — requests batched on-the-fly for higher throughput
- **Ensembles** — pipeline multiple models without inter-process latency
- **Model versioning + A/B** — host multiple versions, route by policy
- **Multi-GPU / multi-instance** — dispatch across GPUs (or MIG slices)
- **HTTP/gRPC + KServe v2 protocol** — standard inference protocol
- **Performance Analyzer tool** — find optimal batch/instance counts
- **TensorRT-LLM integration** — serve LLMs with NVIDIA's tuned engines

## Comparison with Similar Tools
| Feature | Triton | TorchServe | TensorFlow Serving | KServe | BentoML |
|---|---|---|---|---|---|
| Multi-framework | Yes (broadest) | PyTorch | TF | Pluggable (Triton inside) | Many |
| Dynamic batching | Yes | Yes | Yes | Yes | Yes |
| Ensembles | Yes (built-in DAG) | Limited | Limited | Via pipelines | Via runners |
| GPU optimization | Best (NVIDIA-native) | Good | Good | Depends | Depends |
| Best For | Multi-framework production fleets | PyTorch shops | TF shops | k8s-native serving | Python-heavy ML pipelines |

## FAQ
**Q: Triton vs TGI/vLLM for LLMs?**
A: TGI/vLLM are LLM-specific, with continuous batching tuned for autoregressive generation. Triton + TensorRT-LLM matches that performance and lets you serve other model types in the same fleet. For LLM-only stacks, vLLM/TGI are simpler.

**Q: Does Triton require NVIDIA GPUs?**
A: No — backends include CPU (ONNX, OpenVINO, Python). But NVIDIA features (MIG, TensorRT, MPS) are first-class and the project is NVIDIA-owned.

**Q: How does it integrate with Kubernetes?**
A: Common pattern: Triton pods behind a Service, model repository on S3/PVC. KServe wraps Triton as a backend for declarative model deployments.

**Q: Can I write a custom backend?**
A: Yes — Python backend (write a Python class), C++ backend (high performance), or Business Logic Scripting (BLS) for orchestration. NeMo Guardrails / TensorRT-LLM use these to extend Triton.

## Sources
- GitHub: https://github.com/triton-inference-server/server
- Docs: https://docs.nvidia.com/deeplearning/triton-inference-server
- Company: NVIDIA
- License: BSD-3-Clause

---
Source: https://tokrepo.com/en/workflows/e0a9738b-37db-11f1-9bc6-00163e2b0d79
Author: AI Open Source