Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 31, 2026·3 min de lectura

NVIDIA CUTLASS — CUDA Templates for High-Performance Linear Algebra

A collection of CUDA C++ template abstractions for implementing high-performance matrix multiplications and convolutions on NVIDIA GPUs.

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir
Superficie agent
Cualquier agent MCP/CLI
Tipo
Skill
Instalación
Single
Confianza
Confianza: Established
Entrada
NVIDIA CUTLASS
Comando de instalación directa
npx -y tokrepo@latest install 7d20c843-5cea-11f1-9bc6-00163e2b0d79 --target codex

Ejecutar después de confirmar el plan con dry-run.

Introduction

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA's open-source library of C++ template abstractions for writing high-performance GEMM (General Matrix Multiply) and convolution kernels on NVIDIA GPUs. It provides the building blocks that power cuBLAS and many deep learning frameworks at the kernel level.

What CUTLASS Does

  • Provides composable C++ templates for GEMM, grouped GEMM, and convolution operations
  • Supports FP64, FP32, TF32, FP16, BF16, INT8, and FP8 data types across GPU architectures
  • Includes CuTe, a layout algebra DSL for expressing tensor data movement patterns
  • Generates kernels optimized for Hopper (SM90), Ampere (SM80), and earlier architectures
  • Offers a Python-based CUTLASS library and profiler for rapid kernel prototyping

Architecture Overview

CUTLASS decomposes matrix operations into a hierarchy of tile-level abstractions: thread-block tiles, warp tiles, and instruction-level tiles. Each level maps to GPU hardware constructs (threadblock clusters, warp groups, tensor cores). The CuTe library handles data layout transformations and memory copy operations between global, shared, and register memory. A code generation layer emits architecture-specific PTX instructions targeting tensor core MMA operations.

Self-Hosting & Configuration

  • Clone the repository and build with CMake targeting your GPU architecture
  • Set CUTLASS_NVCC_ARCHS to match your hardware (80 for Ampere, 90a for Hopper)
  • Use the Python interface for rapid prototyping without writing C++
  • Integrate as a header-only library into existing CUDA projects
  • Run the built-in profiler to benchmark kernel configurations

Key Features

  • Near-peak GPU utilization on tensor core operations across all modern NVIDIA architectures
  • CuTe layout algebra makes complex data movement patterns expressible and composable
  • Python bindings allow kernel prototyping and benchmarking without C++ compilation
  • Supports asynchronous copy, warp-specialized kernels, and TMA on Hopper GPUs
  • Used internally by cuBLAS, cuDNN, and PyTorch as a kernel generation backend

Comparison with Similar Tools

  • cuBLAS — pre-compiled BLAS library; CUTLASS provides the source templates cuBLAS is built from
  • Triton — Python-based GPU kernel language with auto-tuning; CUTLASS offers lower-level C++ control
  • FBGEMM — Meta's GEMM library focused on quantized inference; CUTLASS covers broader data types and operations
  • OpenBLAS — CPU-targeted BLAS; CUTLASS is GPU-only and targets tensor cores
  • Liger Kernel — Triton-based LLM training kernels; CUTLASS operates at a lower abstraction level

FAQ

Q: Do I need CUTLASS to use NVIDIA GPUs for deep learning? A: No. Frameworks like PyTorch use cuBLAS and cuDNN which are built on CUTLASS internally. Use CUTLASS directly when you need custom kernels or maximum control.

Q: Which GPU architectures are supported? A: Volta (SM70), Turing (SM75), Ampere (SM80), Ada Lovelace (SM89), and Hopper (SM90a), with the latest features targeting Hopper.

Q: Can I use CUTLASS from Python? A: Yes. The cutlass-library Python package provides interfaces for defining, compiling, and profiling GEMM and convolution kernels.

Q: How does CUTLASS relate to Triton? A: Triton generates GPU kernels from Python with automatic optimization. CUTLASS provides C++ templates for maximum performance control, especially for GEMM operations.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados