Introduction
MMAction2 is the next-generation video understanding toolbox from OpenMMLab. It covers action recognition, temporal action localization, and spatial-temporal action detection, providing a consistent PyTorch-based framework for researchers and practitioners working with video data.
What MMAction2 Does
- Classifies human actions in video clips using 20+ recognition models
- Localizes action segments temporally within untrimmed videos
- Detects actions in space and time with spatial-temporal models
- Supports skeleton-based action recognition via PoseC3D
- Benchmarks on Kinetics, Something-Something, AVA, and more
Architecture Overview
MMAction2 uses MMEngine as its training backend with a registry pattern for models, datasets, and pipelines. Recognition models process fixed-length clips through backbones like ResNet3D, SlowFast, or Video Swin Transformer. Temporal detectors use proposal generation and classification stages. All components are configured via Python config files.
Self-Hosting & Configuration
- Install mmaction2, mmengine, and mmcv via pip
- Download pre-trained checkpoints from the model zoo
- Prepare video datasets in the expected directory structure
- Modify config files for custom class labels and data paths
- Use torchrun for multi-GPU distributed training
Key Features
- Comprehensive coverage of action recognition paradigms (RGB, flow, skeleton)
- UniFormerV2 and VideoMAE models achieve state-of-the-art on Kinetics
- Modular design allows swapping backbones and temporal heads
- Pre-built data pipelines for common video dataset formats
- Integration with MMDeploy for production model conversion
Comparison with Similar Tools
- SlowFast (FAIR) — reference implementation of the SlowFast network; MMAction2 includes SlowFast plus many other methods
- PyTorchVideo — provides video-specific transforms and models; MMAction2 offers a broader set of methods and benchmarks
- TimeSformer — single Transformer architecture; MMAction2 supports TimeSformer alongside CNN and hybrid approaches
- Decord — video decoding library; MMAction2 uses Decord internally but adds full training and evaluation pipelines
FAQ
Q: Can I use MMAction2 for real-time action detection? A: Yes. Lightweight models like MobileNetV2-TSM can run in real time on modern GPUs.
Q: Does it support skeleton-based recognition? A: Yes. PoseC3D and ST-GCN models accept skeleton sequences extracted with MMPose.
Q: What video formats are supported? A: MMAction2 reads any format supported by Decord or OpenCV, including MP4, AVI, and MKV.
Q: Can I fine-tune on my own action classes? A: Yes. Update the label map and annotation files, then fine-tune from a Kinetics-pretrained checkpoint.