Scripts2026年5月19日·1 分钟阅读

Triton — GPU Kernel Programming Language for Deep Learning

Triton is an open-source programming language and compiler for writing efficient GPU kernels, originally developed by OpenAI. It provides a Python-like syntax that compiles to optimized CUDA, ROCm, and other GPU backends, making custom kernel development accessible without low-level GPU expertise.

Agent 就绪

这个资产可以被 Agent 直接读取和安装

TokRepo 同时提供通用 CLI 命令、安装契约、metadata JSON、按适配器生成的安装计划和原始内容链接,方便 Agent 判断适配度、风险和下一步动作。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Quick Use
通用 CLI 安装命令
npx tokrepo install b62a0112-537e-11f1-9bc6-00163e2b0d79

Introduction

Triton bridges the gap between high-level Python and low-level GPU programming. It lets ML researchers and engineers write custom GPU kernels in a Python-embedded DSL, achieving performance close to hand-tuned CUDA without managing thread blocks, shared memory, or memory coalescing manually.

What Triton Does

  • Compiles Python-decorated functions into GPU kernels via LLVM
  • Automatically handles memory coalescing, shared memory management, and thread scheduling
  • Supports NVIDIA GPUs (via PTX/CUDA) and AMD GPUs (via ROCm/AMDGPU)
  • Integrates with PyTorch tensors as kernel arguments
  • Provides auto-tuning utilities to search over kernel configurations

Architecture Overview

Triton programs are written using the @triton.jit decorator and the triton.language module. The compiler parses the Python AST, lowers it to Triton IR, applies optimization passes (automatic memory coalescing, software pipelining, shared memory allocation), and generates LLVM IR. The LLVM backend then produces PTX for NVIDIA GPUs or AMDGPU ISA for AMD. Block-level parallelism is explicit (program_id, arange), but thread-level details are handled by the compiler.

Self-Hosting & Configuration

  • Install via pip: pip install triton (requires a compatible NVIDIA or AMD GPU driver)
  • Write kernels as Python functions decorated with @triton.jit
  • Launch kernels by calling the function with a grid specification
  • Use triton.autotune to automatically search over block sizes and other parameters
  • Triton is bundled with recent PyTorch nightly builds for seamless integration

Key Features

  • Python-native syntax eliminates the need for separate CUDA/C++ compilation
  • Compiler automatically optimizes memory access patterns and instruction scheduling
  • Auto-tuning framework finds optimal kernel configurations for target hardware
  • Supports custom attention, normalization, and activation kernels used in LLM training
  • Powers torch.compile and PyTorch's inductor backend for automated kernel generation

Comparison with Similar Tools

  • CUDA C++ — maximum control over GPU hardware; Triton abstracts thread management for faster development
  • Taichi — Python GPU programming for simulation and graphics; Triton is optimized for ML kernel patterns
  • Numba CUDA — Python JIT for CUDA; Triton provides higher-level abstractions and automatic optimizations
  • CuPy — NumPy-like GPU arrays; Triton enables writing custom kernels, not just array operations
  • CUTLASS — NVIDIA's C++ template library for GEMM; Triton offers a simpler Python interface for similar operations

FAQ

Q: Do I need to know CUDA to write Triton kernels? A: No. Triton handles thread scheduling, memory coalescing, and shared memory automatically. You think in blocks, not threads.

Q: Does Triton work with AMD GPUs? A: Yes. Triton supports AMD GPUs via the ROCm backend.

Q: How does Triton relate to PyTorch? A: Triton is the primary kernel backend for torch.compile (inductor). PyTorch uses Triton to generate optimized GPU code automatically.

Q: Can Triton kernels match hand-tuned CUDA performance? A: For many common patterns (matrix multiply, attention, elementwise ops), Triton kernels achieve within 5-15% of optimized CUDA, and sometimes match it.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产