Apache DolphinScheduler — Distributed Data Workflow Orchestration Platform

Introduction

Apache DolphinScheduler is a distributed, cloud-native workflow orchestration platform designed for data pipeline scheduling. It provides a drag-and-drop visual DAG editor, multi-tenant isolation, and built-in support for dozens of task types including Shell, SQL, Spark, Flink, and Python, making it a strong choice for data engineering teams managing complex ETL workflows.

What Apache DolphinScheduler Does

Orchestrates complex data workflows as directed acyclic graphs (DAGs) with visual editing
Schedules tasks with cron expressions, manual triggers, and event-driven dependencies
Supports 30+ task types including Shell, Python, SQL, Spark, Flink, MapReduce, and HTTP
Provides multi-tenant resource isolation with role-based access control
Monitors workflow execution with real-time logs, alerts, and retry mechanisms

Architecture Overview

DolphinScheduler uses a master-worker architecture. MasterServer handles DAG parsing, task scheduling, and workflow state management using a distributed lock via ZooKeeper or a database. WorkerServers pull tasks from the queue and execute them in isolated processes. An API server exposes REST endpoints consumed by the web frontend. All metadata and workflow definitions are stored in a relational database (MySQL or PostgreSQL).

Self-Hosting & Configuration

Requires Java 8+, a relational database (MySQL 5.7+ or PostgreSQL 12+), and optionally ZooKeeper
Deploy as standalone, pseudo-cluster, or full cluster mode depending on scale
Configure datasource connections in the web UI for Hive, Spark, PostgreSQL, and other engines
Set worker groups to route specific task types to designated machines
Enable alerting via email, DingTalk, WeChat, PagerDuty, or custom webhook plugins

Key Features

Visual drag-and-drop workflow designer with sub-workflow support and parameter passing
Distributed architecture with horizontal scaling of master and worker nodes
Complement and dependent scheduling modes for cross-workflow coordination
Built-in resource center for managing scripts, configuration files, and UDFs
SLA monitoring with configurable timeout alerts and failure retry policies

Comparison with Similar Tools

Apache Airflow — Python-centric DAG scheduler with code-as-config; DolphinScheduler offers a visual editor and multi-tenancy out of the box
Dagster — asset-focused orchestration with strong testing; DolphinScheduler focuses on operational scheduling at scale
Prefect — Python-native workflow engine with a managed cloud option; DolphinScheduler provides more built-in task types
Azkaban — LinkedIn's batch workflow scheduler; DolphinScheduler has a more modern architecture and active development
Luigi — lightweight Python pipeline framework; DolphinScheduler adds distributed execution and a full web UI

FAQ

Q: How does DolphinScheduler differ from Airflow? A: DolphinScheduler provides a visual DAG editor, native multi-tenancy, and a master-worker distributed architecture, while Airflow defines workflows as Python code and relies on Celery or Kubernetes for distribution.

Q: Can it integrate with cloud services? A: Yes. It supports task types for AWS EMR, Google Dataproc, and various cloud SQL services via JDBC connections.

Q: What scale can DolphinScheduler handle? A: Production deployments manage tens of thousands of concurrent tasks across hundreds of worker nodes.

Q: Is there a managed cloud version? A: DolphinScheduler is self-hosted. Some cloud providers offer it as part of their managed data platform services.

Apache DolphinScheduler — Distributed Data Workflow Orchestration Platform

Cet actif peut être lu et installé directement par les agents

Introduction

What Apache DolphinScheduler Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Fil de discussion

Actifs similaires

Apache Hive — Distributed Data Warehouse for Big Data Analytics

Apache SkyWalking — Distributed APM & Observability Platform

Apache ShardingSphere — Distributed Database Middleware Ecosystem

Apache RocketMQ — Cloud-Native Messaging and Streaming Platform