How do I install Liger Kernel — Efficient Triton Kernels for LLM Training?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Liger Kernel — Efficient Triton Kernels for LLM Training

Introduction

Liger Kernel by LinkedIn provides drop-in Triton kernel replacements for common operations in LLM training. By fusing operations like RMSNorm, SwiGLU, cross-entropy loss, and RoPE embeddings into single GPU kernels, it reduces memory transfers and lowers peak GPU memory usage, enabling larger batch sizes or longer context lengths on the same hardware.

What Liger Kernel Does

Provides fused Triton kernels for RMSNorm, LayerNorm, SwiGLU, GeGLU, and RoPE
Implements a fused linear cross-entropy kernel that avoids materializing large logit tensors
Reduces peak GPU memory usage by 20-60% during LLM training
Increases training throughput by 10-20% through reduced memory bandwidth pressure
Integrates with Hugging Face Transformers via one-line monkey-patching

Architecture Overview

Each Liger kernel fuses multiple PyTorch operations into a single Triton GPU kernel. For example, the fused cross-entropy kernel computes the linear projection, softmax, and loss in a single pass without materializing the full vocabulary-sized logit tensor in GPU memory. Kernels are written in OpenAI Triton and auto-tune their tile sizes and block configurations. A patching layer replaces standard Hugging Face Transformers modules with Liger equivalents at model initialization.

Self-Hosting & Configuration

Install via pip and apply with a single function call before model loading
Supports Llama, Mistral, Gemma, Phi, and Qwen model families out of the box
Use individual kernel imports for selective optimization of custom architectures
Compatible with FSDP, DeepSpeed, and standard PyTorch DDP training setups
Requires an NVIDIA GPU with Triton support (Ampere or newer recommended)

Key Features

One-line integration with Hugging Face Transformers requires no code restructuring
Fused cross-entropy eliminates the largest memory spike in LLM training
All kernels include both forward and backward passes for end-to-end training
Numerically equivalent to standard PyTorch implementations (bit-for-bit on forward pass)
Actively maintained with support for new model architectures as they are released

Comparison with Similar Tools

FlashAttention — fuses attention computation; Liger Kernel fuses non-attention operations (norms, activations, loss)
Unsloth — provides fused kernels bundled with a training framework; Liger Kernel is a standalone kernel library
xformers — memory-efficient attention from Meta; Liger Kernel targets normalization, activation, and loss kernels
NVIDIA Apex — fused optimizers and norms in CUDA; Liger Kernel uses Triton for portability and easier customization
torch.compile — compiler-based fusion; Liger Kernel provides hand-tuned fusions that often outperform automatic compilation

FAQ

Q: Does Liger Kernel change model outputs? A: No. The fused kernels are numerically equivalent to the standard PyTorch operations they replace.

Q: Which models are supported? A: Llama, Mistral, Gemma, Phi, Qwen, and other architectures that use standard transformer building blocks.

Q: Can I use it with DeepSpeed or FSDP? A: Yes. Liger kernels operate at the module level and are compatible with all standard distributed training strategies.

Q: How much memory does it save? A: Typical savings are 20-60% peak memory reduction depending on model size, sequence length, and vocabulary size.

Liger Kernel — Efficient Triton Kernels for LLM Training

Agent 可直接安装

Introduction

What Liger Kernel Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Liger-Kernel — Efficient GPU Kernels for LLM Training

Triton Language — GPU Kernel Programming Made Accessible

Triton — GPU Kernel Programming Language for Deep Learning

FlashInfer — Kernel Library for LLM Serving