Introduction
Luigi is a Python package developed by Spotify for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, and visualization so you can focus on the actual data transformations rather than orchestration plumbing.
What Luigi Does
- Defines tasks as Python classes with explicit input/output dependencies
- Automatically resolves execution order across large task graphs
- Provides a built-in web dashboard for monitoring pipeline progress
- Supports targets on local disk, S3, HDFS, and databases
- Retries failed tasks and sends configurable failure notifications
Architecture Overview
Luigi models pipelines as directed acyclic graphs (DAGs) of Task objects. Each Task declares its dependencies via a requires() method and its output via a target() method. The central scheduler tracks which targets exist and which tasks still need to run, then dispatches workers accordingly. A lightweight web server visualizes the DAG and task states in real time.
Self-Hosting & Configuration
- Install with
pip install luigiand optionallypip install luigi[toml]for TOML config - Run the central scheduler with
luigidfor multi-worker coordination - Configure via
luigi.cfgorpyproject.tomlunder[luigi]sections - Set
--workers Nto parallelize task execution across CPU cores - Point output targets to S3 or GCS by installing the matching extras
Key Features
- Pure Python API with no external DSL or YAML required
- Atomic file-based checkpointing prevents partial output corruption
- Built-in support for Hadoop, Spark, and BigQuery task types
- Visualization dashboard shows the full dependency graph and task status
- Extensible target system supports custom storage backends
Comparison with Similar Tools
- Apache Airflow — richer scheduling and UI but heavier operational footprint
- Prefect — modern async-first design with cloud-hosted option
- Dagster — asset-centric with strong typing and testing primitives
- Celery — general task queue without pipeline dependency resolution
- Makefiles — file-based dependencies but no Python integration or dashboard
FAQ
Q: How does Luigi differ from Airflow? A: Luigi focuses on dependency-driven batch pipelines with minimal infrastructure, while Airflow provides a full scheduling platform with its own metadata database and executor backends.
Q: Can Luigi run on a schedule? A: Luigi itself does not include a cron-like scheduler. You trigger runs externally via cron, CI, or a wrapper service, and Luigi handles dependency resolution from there.
Q: Does Luigi support distributed execution? A: Workers can run on multiple machines pointing to the same central scheduler. Each worker pulls tasks independently, enabling horizontal scaling.
Q: Is Luigi still maintained? A: Yes. Spotify continues to maintain Luigi and accepts community contributions, though the release cadence is slower than newer orchestrators.