ScriptsMay 31, 2026·3 min read

NVIDIA CUTLASS — CUDA Templates for High-Performance Linear Algebra

A collection of CUDA C++ template abstractions for implementing high-performance matrix multiplications and convolutions on NVIDIA GPUs.

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
NVIDIA CUTLASS
Direct install command
npx -y tokrepo@latest install 7d20c843-5cea-11f1-9bc6-00163e2b0d79 --target codex

Run after dry-run confirms the install plan.

Introduction

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA's open-source library of C++ template abstractions for writing high-performance GEMM (General Matrix Multiply) and convolution kernels on NVIDIA GPUs. It provides the building blocks that power cuBLAS and many deep learning frameworks at the kernel level.

What CUTLASS Does

  • Provides composable C++ templates for GEMM, grouped GEMM, and convolution operations
  • Supports FP64, FP32, TF32, FP16, BF16, INT8, and FP8 data types across GPU architectures
  • Includes CuTe, a layout algebra DSL for expressing tensor data movement patterns
  • Generates kernels optimized for Hopper (SM90), Ampere (SM80), and earlier architectures
  • Offers a Python-based CUTLASS library and profiler for rapid kernel prototyping

Architecture Overview

CUTLASS decomposes matrix operations into a hierarchy of tile-level abstractions: thread-block tiles, warp tiles, and instruction-level tiles. Each level maps to GPU hardware constructs (threadblock clusters, warp groups, tensor cores). The CuTe library handles data layout transformations and memory copy operations between global, shared, and register memory. A code generation layer emits architecture-specific PTX instructions targeting tensor core MMA operations.

Self-Hosting & Configuration

  • Clone the repository and build with CMake targeting your GPU architecture
  • Set CUTLASS_NVCC_ARCHS to match your hardware (80 for Ampere, 90a for Hopper)
  • Use the Python interface for rapid prototyping without writing C++
  • Integrate as a header-only library into existing CUDA projects
  • Run the built-in profiler to benchmark kernel configurations

Key Features

  • Near-peak GPU utilization on tensor core operations across all modern NVIDIA architectures
  • CuTe layout algebra makes complex data movement patterns expressible and composable
  • Python bindings allow kernel prototyping and benchmarking without C++ compilation
  • Supports asynchronous copy, warp-specialized kernels, and TMA on Hopper GPUs
  • Used internally by cuBLAS, cuDNN, and PyTorch as a kernel generation backend

Comparison with Similar Tools

  • cuBLAS — pre-compiled BLAS library; CUTLASS provides the source templates cuBLAS is built from
  • Triton — Python-based GPU kernel language with auto-tuning; CUTLASS offers lower-level C++ control
  • FBGEMM — Meta's GEMM library focused on quantized inference; CUTLASS covers broader data types and operations
  • OpenBLAS — CPU-targeted BLAS; CUTLASS is GPU-only and targets tensor cores
  • Liger Kernel — Triton-based LLM training kernels; CUTLASS operates at a lower abstraction level

FAQ

Q: Do I need CUTLASS to use NVIDIA GPUs for deep learning? A: No. Frameworks like PyTorch use cuBLAS and cuDNN which are built on CUTLASS internally. Use CUTLASS directly when you need custom kernels or maximum control.

Q: Which GPU architectures are supported? A: Volta (SM70), Turing (SM75), Ampere (SM80), Ada Lovelace (SM89), and Hopper (SM90a), with the latest features targeting Hopper.

Q: Can I use CUTLASS from Python? A: Yes. The cutlass-library Python package provides interfaces for defining, compiling, and profiling GEMM and convolution kernels.

Q: How does CUTLASS relate to Triton? A: Triton generates GPU kernels from Python with automatic optimization. CUTLASS provides C++ templates for maximum performance control, especially for GEMM operations.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets