How do I install Triton — GPU Kernel Programming Language for Deep Learning?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Triton — GPU Kernel Programming Language for Deep Learning

Introduction

Triton bridges the gap between high-level Python and low-level GPU programming. It lets ML researchers and engineers write custom GPU kernels in a Python-embedded DSL, achieving performance close to hand-tuned CUDA without managing thread blocks, shared memory, or memory coalescing manually.

What Triton Does

Compiles Python-decorated functions into GPU kernels via LLVM
Automatically handles memory coalescing, shared memory management, and thread scheduling
Supports NVIDIA GPUs (via PTX/CUDA) and AMD GPUs (via ROCm/AMDGPU)
Integrates with PyTorch tensors as kernel arguments
Provides auto-tuning utilities to search over kernel configurations

Architecture Overview

Triton programs are written using the @triton.jit decorator and the triton.language module. The compiler parses the Python AST, lowers it to Triton IR, applies optimization passes (automatic memory coalescing, software pipelining, shared memory allocation), and generates LLVM IR. The LLVM backend then produces PTX for NVIDIA GPUs or AMDGPU ISA for AMD. Block-level parallelism is explicit (program_id, arange), but thread-level details are handled by the compiler.

Self-Hosting & Configuration

Install via pip: pip install triton (requires a compatible NVIDIA or AMD GPU driver)
Write kernels as Python functions decorated with @triton.jit
Launch kernels by calling the function with a grid specification
Use triton.autotune to automatically search over block sizes and other parameters
Triton is bundled with recent PyTorch nightly builds for seamless integration

Key Features

Python-native syntax eliminates the need for separate CUDA/C++ compilation
Compiler automatically optimizes memory access patterns and instruction scheduling
Auto-tuning framework finds optimal kernel configurations for target hardware
Supports custom attention, normalization, and activation kernels used in LLM training
Powers torch.compile and PyTorch's inductor backend for automated kernel generation

Comparison with Similar Tools

CUDA C++ — maximum control over GPU hardware; Triton abstracts thread management for faster development
Taichi — Python GPU programming for simulation and graphics; Triton is optimized for ML kernel patterns
Numba CUDA — Python JIT for CUDA; Triton provides higher-level abstractions and automatic optimizations
CuPy — NumPy-like GPU arrays; Triton enables writing custom kernels, not just array operations
CUTLASS — NVIDIA's C++ template library for GEMM; Triton offers a simpler Python interface for similar operations

FAQ

Q: Do I need to know CUDA to write Triton kernels? A: No. Triton handles thread scheduling, memory coalescing, and shared memory automatically. You think in blocks, not threads.

Q: Does Triton work with AMD GPUs? A: Yes. Triton supports AMD GPUs via the ROCm backend.

Q: How does Triton relate to PyTorch? A: Triton is the primary kernel backend for torch.compile (inductor). PyTorch uses Triton to generate optimized GPU code automatically.

Q: Can Triton kernels match hand-tuned CUDA performance? A: For many common patterns (matrix multiply, attention, elementwise ops), Triton kernels achieve within 5-15% of optimized CUDA, and sometimes match it.

Triton — GPU Kernel Programming Language for Deep Learning

这个资产可以被 Agent 直接读取和安装

Introduction

What Triton Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

TensorRT — High-Performance Deep Learning Inference by NVIDIA

Triton Language — GPU Kernel Programming Made Accessible

Taichi — Productive GPU Programming in Python

Liger-Kernel — Efficient GPU Kernels for LLM Training