Introduction
Triton bridges the gap between high-level Python and low-level GPU programming. It lets ML researchers and engineers write custom GPU kernels in a Python-embedded DSL, achieving performance close to hand-tuned CUDA without managing thread blocks, shared memory, or memory coalescing manually.
What Triton Does
- Compiles Python-decorated functions into GPU kernels via LLVM
- Automatically handles memory coalescing, shared memory management, and thread scheduling
- Supports NVIDIA GPUs (via PTX/CUDA) and AMD GPUs (via ROCm/AMDGPU)
- Integrates with PyTorch tensors as kernel arguments
- Provides auto-tuning utilities to search over kernel configurations
Architecture Overview
Triton programs are written using the @triton.jit decorator and the triton.language module. The compiler parses the Python AST, lowers it to Triton IR, applies optimization passes (automatic memory coalescing, software pipelining, shared memory allocation), and generates LLVM IR. The LLVM backend then produces PTX for NVIDIA GPUs or AMDGPU ISA for AMD. Block-level parallelism is explicit (program_id, arange), but thread-level details are handled by the compiler.
Self-Hosting & Configuration
- Install via pip: pip install triton (requires a compatible NVIDIA or AMD GPU driver)
- Write kernels as Python functions decorated with @triton.jit
- Launch kernels by calling the function with a grid specification
- Use triton.autotune to automatically search over block sizes and other parameters
- Triton is bundled with recent PyTorch nightly builds for seamless integration
Key Features
- Python-native syntax eliminates the need for separate CUDA/C++ compilation
- Compiler automatically optimizes memory access patterns and instruction scheduling
- Auto-tuning framework finds optimal kernel configurations for target hardware
- Supports custom attention, normalization, and activation kernels used in LLM training
- Powers torch.compile and PyTorch's inductor backend for automated kernel generation
Comparison with Similar Tools
- CUDA C++ — maximum control over GPU hardware; Triton abstracts thread management for faster development
- Taichi — Python GPU programming for simulation and graphics; Triton is optimized for ML kernel patterns
- Numba CUDA — Python JIT for CUDA; Triton provides higher-level abstractions and automatic optimizations
- CuPy — NumPy-like GPU arrays; Triton enables writing custom kernels, not just array operations
- CUTLASS — NVIDIA's C++ template library for GEMM; Triton offers a simpler Python interface for similar operations
FAQ
Q: Do I need to know CUDA to write Triton kernels? A: No. Triton handles thread scheduling, memory coalescing, and shared memory automatically. You think in blocks, not threads.
Q: Does Triton work with AMD GPUs? A: Yes. Triton supports AMD GPUs via the ROCm backend.
Q: How does Triton relate to PyTorch? A: Triton is the primary kernel backend for torch.compile (inductor). PyTorch uses Triton to generate optimized GPU code automatically.
Q: Can Triton kernels match hand-tuned CUDA performance? A: For many common patterns (matrix multiply, attention, elementwise ops), Triton kernels achieve within 5-15% of optimized CUDA, and sometimes match it.