Configs2026年5月31日·1 分钟阅读

Liger Kernel — Efficient Triton Kernels for LLM Training

A collection of fused Triton kernels that reduce GPU memory usage and increase throughput when training large language models with PyTorch.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Established
入口
Liger Kernel
直接安装命令
npx -y tokrepo@latest install c60fc695-5cea-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run 确认安装计划,再运行此命令。

Introduction

Liger Kernel by LinkedIn provides drop-in Triton kernel replacements for common operations in LLM training. By fusing operations like RMSNorm, SwiGLU, cross-entropy loss, and RoPE embeddings into single GPU kernels, it reduces memory transfers and lowers peak GPU memory usage, enabling larger batch sizes or longer context lengths on the same hardware.

What Liger Kernel Does

  • Provides fused Triton kernels for RMSNorm, LayerNorm, SwiGLU, GeGLU, and RoPE
  • Implements a fused linear cross-entropy kernel that avoids materializing large logit tensors
  • Reduces peak GPU memory usage by 20-60% during LLM training
  • Increases training throughput by 10-20% through reduced memory bandwidth pressure
  • Integrates with Hugging Face Transformers via one-line monkey-patching

Architecture Overview

Each Liger kernel fuses multiple PyTorch operations into a single Triton GPU kernel. For example, the fused cross-entropy kernel computes the linear projection, softmax, and loss in a single pass without materializing the full vocabulary-sized logit tensor in GPU memory. Kernels are written in OpenAI Triton and auto-tune their tile sizes and block configurations. A patching layer replaces standard Hugging Face Transformers modules with Liger equivalents at model initialization.

Self-Hosting & Configuration

  • Install via pip and apply with a single function call before model loading
  • Supports Llama, Mistral, Gemma, Phi, and Qwen model families out of the box
  • Use individual kernel imports for selective optimization of custom architectures
  • Compatible with FSDP, DeepSpeed, and standard PyTorch DDP training setups
  • Requires an NVIDIA GPU with Triton support (Ampere or newer recommended)

Key Features

  • One-line integration with Hugging Face Transformers requires no code restructuring
  • Fused cross-entropy eliminates the largest memory spike in LLM training
  • All kernels include both forward and backward passes for end-to-end training
  • Numerically equivalent to standard PyTorch implementations (bit-for-bit on forward pass)
  • Actively maintained with support for new model architectures as they are released

Comparison with Similar Tools

  • FlashAttention — fuses attention computation; Liger Kernel fuses non-attention operations (norms, activations, loss)
  • Unsloth — provides fused kernels bundled with a training framework; Liger Kernel is a standalone kernel library
  • xformers — memory-efficient attention from Meta; Liger Kernel targets normalization, activation, and loss kernels
  • NVIDIA Apex — fused optimizers and norms in CUDA; Liger Kernel uses Triton for portability and easier customization
  • torch.compile — compiler-based fusion; Liger Kernel provides hand-tuned fusions that often outperform automatic compilation

FAQ

Q: Does Liger Kernel change model outputs? A: No. The fused kernels are numerically equivalent to the standard PyTorch operations they replace.

Q: Which models are supported? A: Llama, Mistral, Gemma, Phi, Qwen, and other architectures that use standard transformer building blocks.

Q: Can I use it with DeepSpeed or FSDP? A: Yes. Liger kernels operate at the module level and are compatible with all standard distributed training strategies.

Q: How much memory does it save? A: Typical savings are 20-60% peak memory reduction depending on model size, sequence length, and vocabulary size.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产