Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 19, 2026·3 min de lectura

Triton — GPU Kernel Programming Language for Deep Learning

Triton is an open-source programming language and compiler for writing efficient GPU kernels, originally developed by OpenAI. It provides a Python-like syntax that compiles to optimized CUDA, ROCm, and other GPU backends, making custom kernel development accessible without low-level GPU expertise.

Listo para agents

Este activo puede ser leído e instalado directamente por agents

TokRepo expone un comando CLI universal, contrato de instalación, metadata JSON, plan según adaptador y contenido raw para que los agents evalúen compatibilidad, riesgo y próximos pasos.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
Quick Use
Comando CLI universal
npx tokrepo install b62a0112-537e-11f1-9bc6-00163e2b0d79

Introduction

Triton bridges the gap between high-level Python and low-level GPU programming. It lets ML researchers and engineers write custom GPU kernels in a Python-embedded DSL, achieving performance close to hand-tuned CUDA without managing thread blocks, shared memory, or memory coalescing manually.

What Triton Does

  • Compiles Python-decorated functions into GPU kernels via LLVM
  • Automatically handles memory coalescing, shared memory management, and thread scheduling
  • Supports NVIDIA GPUs (via PTX/CUDA) and AMD GPUs (via ROCm/AMDGPU)
  • Integrates with PyTorch tensors as kernel arguments
  • Provides auto-tuning utilities to search over kernel configurations

Architecture Overview

Triton programs are written using the @triton.jit decorator and the triton.language module. The compiler parses the Python AST, lowers it to Triton IR, applies optimization passes (automatic memory coalescing, software pipelining, shared memory allocation), and generates LLVM IR. The LLVM backend then produces PTX for NVIDIA GPUs or AMDGPU ISA for AMD. Block-level parallelism is explicit (program_id, arange), but thread-level details are handled by the compiler.

Self-Hosting & Configuration

  • Install via pip: pip install triton (requires a compatible NVIDIA or AMD GPU driver)
  • Write kernels as Python functions decorated with @triton.jit
  • Launch kernels by calling the function with a grid specification
  • Use triton.autotune to automatically search over block sizes and other parameters
  • Triton is bundled with recent PyTorch nightly builds for seamless integration

Key Features

  • Python-native syntax eliminates the need for separate CUDA/C++ compilation
  • Compiler automatically optimizes memory access patterns and instruction scheduling
  • Auto-tuning framework finds optimal kernel configurations for target hardware
  • Supports custom attention, normalization, and activation kernels used in LLM training
  • Powers torch.compile and PyTorch's inductor backend for automated kernel generation

Comparison with Similar Tools

  • CUDA C++ — maximum control over GPU hardware; Triton abstracts thread management for faster development
  • Taichi — Python GPU programming for simulation and graphics; Triton is optimized for ML kernel patterns
  • Numba CUDA — Python JIT for CUDA; Triton provides higher-level abstractions and automatic optimizations
  • CuPy — NumPy-like GPU arrays; Triton enables writing custom kernels, not just array operations
  • CUTLASS — NVIDIA's C++ template library for GEMM; Triton offers a simpler Python interface for similar operations

FAQ

Q: Do I need to know CUDA to write Triton kernels? A: No. Triton handles thread scheduling, memory coalescing, and shared memory automatically. You think in blocks, not threads.

Q: Does Triton work with AMD GPUs? A: Yes. Triton supports AMD GPUs via the ROCm backend.

Q: How does Triton relate to PyTorch? A: Triton is the primary kernel backend for torch.compile (inductor). PyTorch uses Triton to generate optimized GPU code automatically.

Q: Can Triton kernels match hand-tuned CUDA performance? A: For many common patterns (matrix multiply, attention, elementwise ops), Triton kernels achieve within 5-15% of optimized CUDA, and sometimes match it.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados