Scripts2026年5月31日·1 分钟阅读

NVIDIA CUTLASS — CUDA Templates for High-Performance Linear Algebra

A collection of CUDA C++ template abstractions for implementing high-performance matrix multiplications and convolutions on NVIDIA GPUs.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
NVIDIA CUTLASS
直接安装命令
npx -y tokrepo@latest install 7d20c843-5cea-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

CUTLASS (CUDA Templates for Linear Algebra Subroutines) is NVIDIA's open-source library of C++ template abstractions for writing high-performance GEMM (General Matrix Multiply) and convolution kernels on NVIDIA GPUs. It provides the building blocks that power cuBLAS and many deep learning frameworks at the kernel level.

What CUTLASS Does

  • Provides composable C++ templates for GEMM, grouped GEMM, and convolution operations
  • Supports FP64, FP32, TF32, FP16, BF16, INT8, and FP8 data types across GPU architectures
  • Includes CuTe, a layout algebra DSL for expressing tensor data movement patterns
  • Generates kernels optimized for Hopper (SM90), Ampere (SM80), and earlier architectures
  • Offers a Python-based CUTLASS library and profiler for rapid kernel prototyping

Architecture Overview

CUTLASS decomposes matrix operations into a hierarchy of tile-level abstractions: thread-block tiles, warp tiles, and instruction-level tiles. Each level maps to GPU hardware constructs (threadblock clusters, warp groups, tensor cores). The CuTe library handles data layout transformations and memory copy operations between global, shared, and register memory. A code generation layer emits architecture-specific PTX instructions targeting tensor core MMA operations.

Self-Hosting & Configuration

  • Clone the repository and build with CMake targeting your GPU architecture
  • Set CUTLASS_NVCC_ARCHS to match your hardware (80 for Ampere, 90a for Hopper)
  • Use the Python interface for rapid prototyping without writing C++
  • Integrate as a header-only library into existing CUDA projects
  • Run the built-in profiler to benchmark kernel configurations

Key Features

  • Near-peak GPU utilization on tensor core operations across all modern NVIDIA architectures
  • CuTe layout algebra makes complex data movement patterns expressible and composable
  • Python bindings allow kernel prototyping and benchmarking without C++ compilation
  • Supports asynchronous copy, warp-specialized kernels, and TMA on Hopper GPUs
  • Used internally by cuBLAS, cuDNN, and PyTorch as a kernel generation backend

Comparison with Similar Tools

  • cuBLAS — pre-compiled BLAS library; CUTLASS provides the source templates cuBLAS is built from
  • Triton — Python-based GPU kernel language with auto-tuning; CUTLASS offers lower-level C++ control
  • FBGEMM — Meta's GEMM library focused on quantized inference; CUTLASS covers broader data types and operations
  • OpenBLAS — CPU-targeted BLAS; CUTLASS is GPU-only and targets tensor cores
  • Liger Kernel — Triton-based LLM training kernels; CUTLASS operates at a lower abstraction level

FAQ

Q: Do I need CUTLASS to use NVIDIA GPUs for deep learning? A: No. Frameworks like PyTorch use cuBLAS and cuDNN which are built on CUTLASS internally. Use CUTLASS directly when you need custom kernels or maximum control.

Q: Which GPU architectures are supported? A: Volta (SM70), Turing (SM75), Ampere (SM80), Ada Lovelace (SM89), and Hopper (SM90a), with the latest features targeting Hopper.

Q: Can I use CUTLASS from Python? A: Yes. The cutlass-library Python package provides interfaces for defining, compiling, and profiling GEMM and convolution kernels.

Q: How does CUTLASS relate to Triton? A: Triton generates GPU kernels from Python with automatic optimization. CUTLASS provides C++ templates for maximum performance control, especially for GEMM operations.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产