NVIDIA CUTLASS — CUDA Templates for High-Performance Linear Algebra

Introduction

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA's open-source library of C++ template abstractions for writing high-performance GEMM (General Matrix Multiply) and convolution kernels on NVIDIA GPUs. It provides the building blocks that power cuBLAS and many deep learning frameworks at the kernel level.

What CUTLASS Does

Provides composable C++ templates for GEMM, grouped GEMM, and convolution operations
Supports FP64, FP32, TF32, FP16, BF16, INT8, and FP8 data types across GPU architectures
Includes CuTe, a layout algebra DSL for expressing tensor data movement patterns
Generates kernels optimized for Hopper (SM90), Ampere (SM80), and earlier architectures
Offers a Python-based CUTLASS library and profiler for rapid kernel prototyping

Architecture Overview

CUTLASS decomposes matrix operations into a hierarchy of tile-level abstractions: thread-block tiles, warp tiles, and instruction-level tiles. Each level maps to GPU hardware constructs (threadblock clusters, warp groups, tensor cores). The CuTe library handles data layout transformations and memory copy operations between global, shared, and register memory. A code generation layer emits architecture-specific PTX instructions targeting tensor core MMA operations.

Self-Hosting & Configuration

Clone the repository and build with CMake targeting your GPU architecture
Set CUTLASS_NVCC_ARCHS to match your hardware (80 for Ampere, 90a for Hopper)
Use the Python interface for rapid prototyping without writing C++
Integrate as a header-only library into existing CUDA projects
Run the built-in profiler to benchmark kernel configurations

Key Features

Near-peak GPU utilization on tensor core operations across all modern NVIDIA architectures
CuTe layout algebra makes complex data movement patterns expressible and composable
Python bindings allow kernel prototyping and benchmarking without C++ compilation
Supports asynchronous copy, warp-specialized kernels, and TMA on Hopper GPUs
Used internally by cuBLAS, cuDNN, and PyTorch as a kernel generation backend

Comparison with Similar Tools

cuBLAS — pre-compiled BLAS library; CUTLASS provides the source templates cuBLAS is built from
Triton — Python-based GPU kernel language with auto-tuning; CUTLASS offers lower-level C++ control
FBGEMM — Meta's GEMM library focused on quantized inference; CUTLASS covers broader data types and operations
OpenBLAS — CPU-targeted BLAS; CUTLASS is GPU-only and targets tensor cores
Liger Kernel — Triton-based LLM training kernels; CUTLASS operates at a lower abstraction level

FAQ

Q: Do I need CUTLASS to use NVIDIA GPUs for deep learning? A: No. Frameworks like PyTorch use cuBLAS and cuDNN which are built on CUTLASS internally. Use CUTLASS directly when you need custom kernels or maximum control.

Q: Which GPU architectures are supported? A: Volta (SM70), Turing (SM75), Ampere (SM80), Ada Lovelace (SM89), and Hopper (SM90a), with the latest features targeting Hopper.

Q: Can I use CUTLASS from Python? A: Yes. The cutlass-library Python package provides interfaces for defining, compiling, and profiling GEMM and convolution kernels.

Q: How does CUTLASS relate to Triton? A: Triton generates GPU kernels from Python with automatic optimization. CUTLASS provides C++ templates for maximum performance control, especially for GEMM operations.

NVIDIA CUTLASS — CUDA Templates for High-Performance Linear Algebra

Ready-to-run agent install

Introduction

What CUTLASS Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

TensorRT — High-Performance Deep Learning Inference by NVIDIA

TensorRT-LLM — High-Performance LLM Inference on NVIDIA GPUs

ZLUDA — Run CUDA Applications on AMD and Intel GPUs

DearPyGui — High-Performance Python GUI Framework with GPU Rendering