ConfigsMay 11, 2026·2 min read

Stable Baselines3 — Reliable Reinforcement Learning in PyTorch

A set of reliable implementations of reinforcement learning algorithms in PyTorch, including PPO, SAC, TD3, DQN, and more.

Introduction

Stable Baselines3 (SB3) provides clean, tested implementations of popular reinforcement learning algorithms built on PyTorch. It focuses on reproducibility and ease of use, letting researchers and practitioners train RL agents with minimal boilerplate.

What Stable Baselines3 Does

  • Implements PPO, A2C, SAC, TD3, DQN, DDPG, and HER algorithms
  • Provides a unified API across all algorithms for training, evaluation, and saving
  • Supports custom environments through the Gymnasium interface
  • Includes callback system for logging, early stopping, and checkpointing
  • Offers vectorized environments for parallel data collection

Architecture Overview

SB3 follows a modular design where each algorithm inherits from a base class that handles environment interaction, rollout collection, and logging. Policy networks are defined as PyTorch modules with configurable architecture. The training loop collects experience using vectorized environments, computes losses specific to each algorithm, and updates parameters through standard PyTorch optimizers.

Self-Hosting & Configuration

  • Install: pip install stable-baselines3[extra] for full dependencies including TensorBoard
  • Create environments with Gymnasium: gym.make('LunarLander-v3') or custom gym.Env subclasses
  • Configure hyperparameters via constructor: PPO('MlpPolicy', env, learning_rate=3e-4, n_steps=2048)
  • Use VecEnv wrappers for parallel training: make_vec_env('CartPole-v1', n_envs=4)
  • Monitor training with TensorBoard: tensorboard --logdir ./tb_logs/

Key Features

  • Thoroughly tested with continuous integration and unit tests for each algorithm
  • Type-annotated codebase with comprehensive documentation
  • Built-in experiment manager (RL Zoo) for hyperparameter tuning
  • Support for Dict and image-based observation spaces
  • Hindsight Experience Replay (HER) for goal-conditioned tasks

Comparison with Similar Tools

  • RLlib (Ray) — distributed RL at scale; SB3 focuses on single-node simplicity and clarity
  • CleanRL — single-file implementations for education; SB3 is more feature-complete for production
  • Tianshou — modular PyTorch RL; SB3 has larger community and more tested algorithms
  • Gymnasium — environment interface standard; SB3 provides the algorithms that train on them

FAQ

Q: How do I use a custom neural network architecture? A: Pass policy_kwargs=dict(net_arch=[256, 256]) or define a custom features extractor class.

Q: Can SB3 train on GPU? A: Yes. Pass device="cuda" to the algorithm constructor.

Q: Is multi-agent RL supported? A: Not natively. Use PettingZoo with the SB3 compatibility wrapper for simple multi-agent scenarios.

Q: How do I resume training from a checkpoint? A: Use model = PPO.load("checkpoint", env=env) then call model.learn() again.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets