Esta página se muestra en inglés. Una traducción al español está en curso.
ScriptsMay 12, 2026·2 min de lectura

MMAction2 — OpenMMLab Video Understanding Toolbox

MMAction2 provides a modular framework for action recognition, temporal action detection, and spatial-temporal action detection with 20+ methods and support for major video benchmarks.

Introduction

MMAction2 is the next-generation video understanding toolbox from OpenMMLab. It covers action recognition, temporal action localization, and spatial-temporal action detection, providing a consistent PyTorch-based framework for researchers and practitioners working with video data.

What MMAction2 Does

  • Classifies human actions in video clips using 20+ recognition models
  • Localizes action segments temporally within untrimmed videos
  • Detects actions in space and time with spatial-temporal models
  • Supports skeleton-based action recognition via PoseC3D
  • Benchmarks on Kinetics, Something-Something, AVA, and more

Architecture Overview

MMAction2 uses MMEngine as its training backend with a registry pattern for models, datasets, and pipelines. Recognition models process fixed-length clips through backbones like ResNet3D, SlowFast, or Video Swin Transformer. Temporal detectors use proposal generation and classification stages. All components are configured via Python config files.

Self-Hosting & Configuration

  • Install mmaction2, mmengine, and mmcv via pip
  • Download pre-trained checkpoints from the model zoo
  • Prepare video datasets in the expected directory structure
  • Modify config files for custom class labels and data paths
  • Use torchrun for multi-GPU distributed training

Key Features

  • Comprehensive coverage of action recognition paradigms (RGB, flow, skeleton)
  • UniFormerV2 and VideoMAE models achieve state-of-the-art on Kinetics
  • Modular design allows swapping backbones and temporal heads
  • Pre-built data pipelines for common video dataset formats
  • Integration with MMDeploy for production model conversion

Comparison with Similar Tools

  • SlowFast (FAIR) — reference implementation of the SlowFast network; MMAction2 includes SlowFast plus many other methods
  • PyTorchVideo — provides video-specific transforms and models; MMAction2 offers a broader set of methods and benchmarks
  • TimeSformer — single Transformer architecture; MMAction2 supports TimeSformer alongside CNN and hybrid approaches
  • Decord — video decoding library; MMAction2 uses Decord internally but adds full training and evaluation pipelines

FAQ

Q: Can I use MMAction2 for real-time action detection? A: Yes. Lightweight models like MobileNetV2-TSM can run in real time on modern GPUs.

Q: Does it support skeleton-based recognition? A: Yes. PoseC3D and ST-GCN models accept skeleton sequences extracted with MMPose.

Q: What video formats are supported? A: MMAction2 reads any format supported by Decord or OpenCV, including MP4, AVI, and MKV.

Q: Can I fine-tune on my own action classes? A: Yes. Update the label map and annotation files, then fine-tune from a Kinetics-pretrained checkpoint.

Sources

Discusión

Inicia sesión para unirte a la discusión.
Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados