Introduction
Apache DolphinScheduler is a distributed, cloud-native workflow orchestration platform designed for data pipeline scheduling. It provides a drag-and-drop visual DAG editor, multi-tenant isolation, and built-in support for dozens of task types including Shell, SQL, Spark, Flink, and Python, making it a strong choice for data engineering teams managing complex ETL workflows.
What Apache DolphinScheduler Does
- Orchestrates complex data workflows as directed acyclic graphs (DAGs) with visual editing
- Schedules tasks with cron expressions, manual triggers, and event-driven dependencies
- Supports 30+ task types including Shell, Python, SQL, Spark, Flink, MapReduce, and HTTP
- Provides multi-tenant resource isolation with role-based access control
- Monitors workflow execution with real-time logs, alerts, and retry mechanisms
Architecture Overview
DolphinScheduler uses a master-worker architecture. MasterServer handles DAG parsing, task scheduling, and workflow state management using a distributed lock via ZooKeeper or a database. WorkerServers pull tasks from the queue and execute them in isolated processes. An API server exposes REST endpoints consumed by the web frontend. All metadata and workflow definitions are stored in a relational database (MySQL or PostgreSQL).
Self-Hosting & Configuration
- Requires Java 8+, a relational database (MySQL 5.7+ or PostgreSQL 12+), and optionally ZooKeeper
- Deploy as standalone, pseudo-cluster, or full cluster mode depending on scale
- Configure datasource connections in the web UI for Hive, Spark, PostgreSQL, and other engines
- Set worker groups to route specific task types to designated machines
- Enable alerting via email, DingTalk, WeChat, PagerDuty, or custom webhook plugins
Key Features
- Visual drag-and-drop workflow designer with sub-workflow support and parameter passing
- Distributed architecture with horizontal scaling of master and worker nodes
- Complement and dependent scheduling modes for cross-workflow coordination
- Built-in resource center for managing scripts, configuration files, and UDFs
- SLA monitoring with configurable timeout alerts and failure retry policies
Comparison with Similar Tools
- Apache Airflow — Python-centric DAG scheduler with code-as-config; DolphinScheduler offers a visual editor and multi-tenancy out of the box
- Dagster — asset-focused orchestration with strong testing; DolphinScheduler focuses on operational scheduling at scale
- Prefect — Python-native workflow engine with a managed cloud option; DolphinScheduler provides more built-in task types
- Azkaban — LinkedIn's batch workflow scheduler; DolphinScheduler has a more modern architecture and active development
- Luigi — lightweight Python pipeline framework; DolphinScheduler adds distributed execution and a full web UI
FAQ
Q: How does DolphinScheduler differ from Airflow? A: DolphinScheduler provides a visual DAG editor, native multi-tenancy, and a master-worker distributed architecture, while Airflow defines workflows as Python code and relies on Celery or Kubernetes for distribution.
Q: Can it integrate with cloud services? A: Yes. It supports task types for AWS EMR, Google Dataproc, and various cloud SQL services via JDBC connections.
Q: What scale can DolphinScheduler handle? A: Production deployments manage tens of thousands of concurrent tasks across hundreds of worker nodes.
Q: Is there a managed cloud version? A: DolphinScheduler is self-hosted. Some cloud providers offer it as part of their managed data platform services.