Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 3, 2026·3 min de lecture

Triton Language — GPU Kernel Programming Made Accessible

Triton is a language and compiler for writing highly efficient GPU kernels in Python-like syntax, enabling researchers to match or exceed cuDNN-level performance.

Introduction

Triton is an open-source programming language and compiler that makes writing custom GPU kernels accessible to machine learning researchers. Instead of wrestling with CUDA thread hierarchies, developers express parallel computations in a Python-like DSL and let the Triton compiler handle tiling, memory coalescing, and scheduling automatically.

What Triton Does

  • Lets you write GPU kernels using Python syntax with explicit block-level parallelism
  • Compiles to optimized PTX/AMDGPU code via an MLIR-based compiler pipeline
  • Auto-tunes tile sizes, number of warps, and pipeline stages for peak throughput
  • Integrates directly with PyTorch tensors for seamless use in training and inference
  • Supports NVIDIA and AMD GPUs through multiple backend targets

Architecture Overview

Triton programs operate on blocks of data rather than individual threads. The compiler ingests Triton IR, applies optimization passes (loop unrolling, software pipelining, shared memory allocation), and lowers through MLIR dialects to target-specific machine code. An auto-tuner explores the parameter search space at compile time to select optimal configurations per kernel and hardware combination.

Self-Hosting & Configuration

  • Install via pip: pip install triton for the latest stable release
  • Build from source for development: clone the repo and run pip install -e python
  • Requires a recent NVIDIA or AMD GPU with the corresponding driver installed
  • Set TRITON_CACHE_DIR to control where compiled kernel binaries are cached
  • Use @triton.autotune decorator to define config search spaces per kernel

Key Features

  • Python-native syntax eliminates the need to learn CUDA C++
  • MLIR-based compiler produces code competitive with hand-tuned CUDA kernels
  • Built-in autotuning finds optimal launch configurations automatically
  • First-class PyTorch integration via triton.ops and custom autograd functions
  • Powers the torch.compile inductor backend in PyTorch 2.x

Comparison with Similar Tools

  • CUDA C++ — Maximum control but steep learning curve; Triton trades some flexibility for productivity
  • Numba CUDA — Python-based GPU programming but uses thread-level abstractions unlike Triton's block model
  • CuPy — Wraps existing CUDA libraries; Triton lets you write custom fused kernels
  • Taichi — Focused on spatial computing and graphics; Triton targets ML workloads
  • OpenAI Kernel — Triton is the successor to earlier OpenAI GPU tooling efforts

FAQ

Q: Does Triton replace CUDA for all use cases? A: Triton excels at data-parallel ML kernels (matmul, attention, reductions). For complex control flow or graphics workloads, CUDA C++ remains more appropriate.

Q: Which GPUs are supported? A: NVIDIA GPUs (Volta and newer) and AMD GPUs (CDNA/RDNA architectures) are supported through separate backends.

Q: How does Triton relate to PyTorch? A: PyTorch's torch.compile uses Triton as its default GPU code generation backend via the Inductor compiler.

Q: Can Triton kernels match cuBLAS performance? A: For many matrix multiplication and attention patterns, Triton kernels achieve 90-100% of cuBLAS/cuDNN throughput with the right autotuning configuration.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires