Introduction
Triton is an open-source programming language and compiler that makes writing custom GPU kernels accessible to machine learning researchers. Instead of wrestling with CUDA thread hierarchies, developers express parallel computations in a Python-like DSL and let the Triton compiler handle tiling, memory coalescing, and scheduling automatically.
What Triton Does
- Lets you write GPU kernels using Python syntax with explicit block-level parallelism
- Compiles to optimized PTX/AMDGPU code via an MLIR-based compiler pipeline
- Auto-tunes tile sizes, number of warps, and pipeline stages for peak throughput
- Integrates directly with PyTorch tensors for seamless use in training and inference
- Supports NVIDIA and AMD GPUs through multiple backend targets
Architecture Overview
Triton programs operate on blocks of data rather than individual threads. The compiler ingests Triton IR, applies optimization passes (loop unrolling, software pipelining, shared memory allocation), and lowers through MLIR dialects to target-specific machine code. An auto-tuner explores the parameter search space at compile time to select optimal configurations per kernel and hardware combination.
Self-Hosting & Configuration
- Install via pip:
pip install tritonfor the latest stable release - Build from source for development: clone the repo and run
pip install -e python - Requires a recent NVIDIA or AMD GPU with the corresponding driver installed
- Set
TRITON_CACHE_DIRto control where compiled kernel binaries are cached - Use
@triton.autotunedecorator to define config search spaces per kernel
Key Features
- Python-native syntax eliminates the need to learn CUDA C++
- MLIR-based compiler produces code competitive with hand-tuned CUDA kernels
- Built-in autotuning finds optimal launch configurations automatically
- First-class PyTorch integration via
triton.opsand custom autograd functions - Powers the
torch.compileinductor backend in PyTorch 2.x
Comparison with Similar Tools
- CUDA C++ — Maximum control but steep learning curve; Triton trades some flexibility for productivity
- Numba CUDA — Python-based GPU programming but uses thread-level abstractions unlike Triton's block model
- CuPy — Wraps existing CUDA libraries; Triton lets you write custom fused kernels
- Taichi — Focused on spatial computing and graphics; Triton targets ML workloads
- OpenAI Kernel — Triton is the successor to earlier OpenAI GPU tooling efforts
FAQ
Q: Does Triton replace CUDA for all use cases? A: Triton excels at data-parallel ML kernels (matmul, attention, reductions). For complex control flow or graphics workloads, CUDA C++ remains more appropriate.
Q: Which GPUs are supported? A: NVIDIA GPUs (Volta and newer) and AMD GPUs (CDNA/RDNA architectures) are supported through separate backends.
Q: How does Triton relate to PyTorch?
A: PyTorch's torch.compile uses Triton as its default GPU code generation backend via the Inductor compiler.
Q: Can Triton kernels match cuBLAS performance? A: For many matrix multiplication and attention patterns, Triton kernels achieve 90-100% of cuBLAS/cuDNN throughput with the right autotuning configuration.