# Triton Language — GPU Kernel Programming Made Accessible

> Triton is a language and compiler for writing highly efficient GPU kernels in Python-like syntax, enabling researchers to match or exceed cuDNN-level performance.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# Triton Language — GPU Kernel Programming Made Accessible

## Quick Use
```bash
pip install triton
python -c "
import triton
import triton.language as tl
print(triton.__version__)
"
```

## Introduction
Triton is an open-source programming language and compiler that makes writing custom GPU kernels accessible to machine learning researchers. Instead of wrestling with CUDA thread hierarchies, developers express parallel computations in a Python-like DSL and let the Triton compiler handle tiling, memory coalescing, and scheduling automatically.

## What Triton Does
- Lets you write GPU kernels using Python syntax with explicit block-level parallelism
- Compiles to optimized PTX/AMDGPU code via an MLIR-based compiler pipeline
- Auto-tunes tile sizes, number of warps, and pipeline stages for peak throughput
- Integrates directly with PyTorch tensors for seamless use in training and inference
- Supports NVIDIA and AMD GPUs through multiple backend targets

## Architecture Overview
Triton programs operate on blocks of data rather than individual threads. The compiler ingests Triton IR, applies optimization passes (loop unrolling, software pipelining, shared memory allocation), and lowers through MLIR dialects to target-specific machine code. An auto-tuner explores the parameter search space at compile time to select optimal configurations per kernel and hardware combination.

## Self-Hosting & Configuration
- Install via pip: `pip install triton` for the latest stable release
- Build from source for development: clone the repo and run `pip install -e python`
- Requires a recent NVIDIA or AMD GPU with the corresponding driver installed
- Set `TRITON_CACHE_DIR` to control where compiled kernel binaries are cached
- Use `@triton.autotune` decorator to define config search spaces per kernel

## Key Features
- Python-native syntax eliminates the need to learn CUDA C++
- MLIR-based compiler produces code competitive with hand-tuned CUDA kernels
- Built-in autotuning finds optimal launch configurations automatically
- First-class PyTorch integration via `triton.ops` and custom autograd functions
- Powers the `torch.compile` inductor backend in PyTorch 2.x

## Comparison with Similar Tools
- **CUDA C++** — Maximum control but steep learning curve; Triton trades some flexibility for productivity
- **Numba CUDA** — Python-based GPU programming but uses thread-level abstractions unlike Triton's block model
- **CuPy** — Wraps existing CUDA libraries; Triton lets you write custom fused kernels
- **Taichi** — Focused on spatial computing and graphics; Triton targets ML workloads
- **OpenAI Kernel** — Triton is the successor to earlier OpenAI GPU tooling efforts

## FAQ
**Q: Does Triton replace CUDA for all use cases?**
A: Triton excels at data-parallel ML kernels (matmul, attention, reductions). For complex control flow or graphics workloads, CUDA C++ remains more appropriate.

**Q: Which GPUs are supported?**
A: NVIDIA GPUs (Volta and newer) and AMD GPUs (CDNA/RDNA architectures) are supported through separate backends.

**Q: How does Triton relate to PyTorch?**
A: PyTorch's `torch.compile` uses Triton as its default GPU code generation backend via the Inductor compiler.

**Q: Can Triton kernels match cuBLAS performance?**
A: For many matrix multiplication and attention patterns, Triton kernels achieve 90-100% of cuBLAS/cuDNN throughput with the right autotuning configuration.

## Sources
- https://github.com/triton-lang/triton
- https://triton-lang.org/

---
Source: https://tokrepo.com/en/workflows/triton-language-gpu-kernel-programming-made-accessible-f6c89f68
Author: AI Open Source