Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsApr 1, 2026·1 min de lectura

Dagster — Cloud-Native Data Pipeline Orchestrator

Dagster orchestrates data pipelines with declarative assets, lineage tracking, and observability. 15.2K+ stars. Python, asset-based, testable. Apache 2.0.

AI Open Source · Community

Listo para agents

Instalación lista para agent

Este activo puede instalarse después de elegir el runtime, revisar el plan y ejecutar el comando correspondiente.

Native · 98/100Política: permitir

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: Established

Entrada

dagster.md

Comando de instalación directa

npx -y tokrepo@latest install 9ad9a1ce-c5bf-4125-ba9a-b61ddbcad145 --target codex

Ejecutar después de confirmar el plan con dry-run.

TL;DR

Dagster orchestrates data pipelines using declarative assets with lineage, testing, and observability.

§01

What it is

Dagster is an open-source data pipeline orchestrator built around the concept of software-defined assets. Instead of defining tasks and their execution order, you define the data assets your pipeline produces and Dagster figures out the dependency graph, execution plan, and lineage automatically. It is written in Python and supports both local development and cloud deployment.

Dagster targets data engineers who want testable, observable pipelines with clear data lineage. It competes with Airflow, Prefect, and Mage as a modern orchestration layer that treats data assets as first-class citizens.

§02

How it saves time or tokens

Dagster's asset-based approach eliminates the gap between what your pipeline produces and how it runs. Each asset is a Python function with typed inputs and outputs, which means you can unit test individual assets locally before deploying. The built-in asset catalog shows what data exists, when it was last materialized, and what downstream assets depend on it.

For AI/ML teams, Dagster provides native integration with ML frameworks, making it straightforward to orchestrate training data preparation, model training, and evaluation as linked assets.

§03

How to use

Install Dagster: pip install dagster dagster-webserver. Create a new project with dagster project scaffold --name my_pipeline.
Define assets as Python functions decorated with @asset. Each function takes upstream assets as inputs and returns the produced data.
Launch the Dagster UI: dagster dev. View your asset graph, materialize assets on-demand, and schedule recurring materializations.

§04

Example

from dagster import asset, Definitions
import pandas as pd

@asset
def raw_orders() -> pd.DataFrame:
    return pd.read_csv('https://example.com/orders.csv')

@asset
def daily_summary(raw_orders: pd.DataFrame) -> pd.DataFrame:
    return raw_orders.groupby('date').agg(
        total_revenue=('amount', 'sum'),
        order_count=('order_id', 'count')
    ).reset_index()

defs = Definitions(assets=[raw_orders, daily_summary])

Dagster infers that daily_summary depends on raw_orders from the function signature. The asset graph is visualized in the web UI.

§05

Related on TokRepo

AI tools for automation — Pipeline and workflow automation tools
AI tools for database — Data infrastructure and storage tools

§06

Common pitfalls

Migrating from Airflow requires rethinking your pipeline structure. Airflow uses task DAGs while Dagster uses asset definitions. The mental model shift takes time but results in cleaner pipelines.
The asset graph grows complex in large organizations. Use asset groups and code locations to organize assets by team or domain.
Dagster Cloud is the managed deployment option. Self-hosting requires running the daemon, webserver, and a database (PostgreSQL recommended) for production workloads.

Preguntas frecuentes

How does Dagster differ from Airflow?+

Airflow defines pipelines as task DAGs with explicit dependencies. Dagster defines software-defined assets where dependencies are inferred from function signatures. Dagster provides built-in data lineage, asset catalog, and local testing. Airflow is more mature with a larger ecosystem of operators.

Can Dagster run on Kubernetes?+

Yes. Dagster has native Kubernetes support with the dagster-k8s package. It can launch pipeline steps as individual Kubernetes jobs, providing isolation and scalability. Dagster Cloud also offers a managed Kubernetes deployment.

Does Dagster support incremental processing?+

Yes. Dagster supports partitioned assets where each partition (e.g., daily, hourly) can be materialized independently. This enables incremental processing where only new partitions are computed rather than reprocessing the entire dataset.

What databases and storage does Dagster integrate with?+

Dagster has integrations (called 'resources' and 'IO managers') for PostgreSQL, Snowflake, BigQuery, DuckDB, S3, GCS, and many others. IO managers handle reading and writing data between assets and storage backends automatically.

Is Dagster suitable for ML pipelines?+

Yes. Dagster can orchestrate ML workflows: data ingestion, feature engineering, model training, evaluation, and deployment as linked assets. Each step is testable, versioned, and observable through the asset graph.

Referencias (3)

Dagster GitHub— Dagster orchestrates data pipelines with declarative assets
Dagster Documentation— Software-defined assets and asset graph documentation
Dagster Guides— Dagster vs Airflow comparison

Relacionados en TokRepo

Automation tools Database tools Featured workflows

🙏

Fuente y agradecimientos

dagster-io/dagster — 15,200+ GitHub stars

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

Databend — Cloud-Native Open-Source Data Warehouse Built in Rust

Databend is a modern cloud data warehouse with separation of storage and compute on object storage. Written in Rust for extreme performance, it is a self-hostable alternative to Snowflake with full Snowflake-style SQL compatibility.

Skills

AI Open Source

Easegress — Cloud-Native Traffic Orchestration System

Easegress is a high-performance, cloud-native traffic orchestration platform written in Go that provides API gateway, load balancing, service mesh sidecar, and pipeline-based request processing with built-in resilience patterns.

Skills

AI Open Source

JuiceFS — Cloud-Native POSIX File System Built on Object Storage

A high-performance distributed file system that stores data in object storage like S3 while keeping metadata in Redis, PostgreSQL, or MySQL for cloud-native workloads.

Skills

AI Open Source

Quickwit — Cloud-Native Sub-Second Search Engine

Quickwit is a cloud-native search engine built in Rust for log management and distributed search on object storage. It indexes data directly to S3-compatible stores, enabling cost-efficient search at petabyte scale.

Skills

Script Depot