Configs2026年4月16日·1 分钟阅读

Kedro — Production-Ready ML Pipeline Framework for Python

Kedro is an open-source Python framework by McKinsey QuantumBlack that applies software engineering best practices to data science and ML pipelines. It provides a standardized project structure, data catalog, and pipeline abstraction that makes experimental code production-ready.

Introduction

Kedro bridges the gap between messy notebook experiments and maintainable production pipelines. Created by QuantumBlack (a McKinsey company), it enforces a consistent project template, separates configuration from code, and makes pipelines reproducible without tying you to any orchestrator.

What Kedro Does

  • Provides a cookiecutter-style project template that standardizes ML project layout
  • Abstracts data access through a declarative YAML-based Data Catalog
  • Defines pipelines as pure Python functions connected by a DAG
  • Generates interactive pipeline visualizations with Kedro-Viz
  • Deploys to any orchestrator (Airflow, Prefect, Vertex AI, Databricks) via plugins

Architecture Overview

A Kedro project consists of nodes (Python functions), pipelines (DAGs of nodes), and a Data Catalog (YAML mapping logical dataset names to physical storage). The runner executes pipelines sequentially, in parallel threads, or delegates to external orchestrators. Configuration is layered by environment (base, local, prod) so credentials and parameters stay separate from code.

Self-Hosting & Configuration

  • Install via pip or conda and scaffold a project with kedro new
  • Define datasets in conf/base/catalog.yml pointing to local files, S3, GCS, or databases
  • Store credentials in conf/local/credentials.yml which is gitignored by default
  • Add parameters in conf/base/parameters.yml for experiment tracking
  • Deploy to Airflow with kedro-airflow or to Databricks with kedro-databricks plugin

Key Features

  • Declarative Data Catalog decouples I/O from business logic
  • Modular pipeline design encourages reuse across projects
  • Kedro-Viz provides interactive DAG visualization with experiment tracking
  • Built-in dataset versioning for reproducibility
  • Extensive plugin ecosystem for deployment, linting, and testing

Comparison with Similar Tools

  • Prefect — workflow orchestrator focused on scheduling; Kedro is a pipeline framework that feeds into orchestrators
  • DVC — data version control tool; Kedro manages pipeline structure and data access patterns
  • Metaflow — Netflix framework with strong compute abstraction; Kedro focuses on project structure and portability
  • ZenML — MLOps framework with stack abstraction; Kedro is lighter and more opinionated on project layout
  • Luigi — older pipeline library; Kedro offers modern packaging, catalog, and visualization

FAQ

Q: Is Kedro an orchestrator? A: No. Kedro defines pipelines; orchestrators like Airflow or Prefect schedule and monitor them. Kedro provides deployment plugins for popular orchestrators.

Q: Can I use Kedro with Jupyter notebooks? A: Yes. Kedro ships a Jupyter integration that loads the catalog and context so you can explore data interactively and then refactor into nodes.

Q: Does Kedro support distributed computing? A: Kedro nodes can use Spark, Dask, or Ray internally. The framework orchestrates the DAG; the compute engine handles scale.

Q: Who uses Kedro in production? A: Companies like Telus, QuantumBlack, Walmart, and NASA JPL use Kedro to standardize their ML workflows.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产