ConfigsApr 16, 2026·3 min read

Kedro — Production-Ready ML Pipeline Framework for Python

Kedro is an open-source Python framework by McKinsey QuantumBlack that applies software engineering best practices to data science and ML pipelines. It provides a standardized project structure, data catalog, and pipeline abstraction that makes experimental code production-ready.

TL;DR
Open-source Python framework by QuantumBlack that turns messy notebook code into production-ready ML pipelines.
§01

What it is

Kedro is an open-source Python framework created by McKinsey QuantumBlack that applies software engineering best practices to data science and machine learning code. It provides a standardized project structure, a declarative data catalog, and a pipeline abstraction that transforms experimental notebook code into maintainable, testable, production-ready pipelines.

Kedro targets data scientists and ML engineers who need to bridge the gap between prototype notebooks and production systems. It works alongside existing tools like pandas, scikit-learn, and PySpark without replacing them.

§02

How it saves time or tokens

Kedro eliminates the 'notebook to production' refactoring cycle. The standardized project template means new team members understand the codebase layout immediately. The data catalog decouples data access from business logic, so switching between local CSV files and cloud storage requires changing a YAML config, not Python code. Pipeline visualization with kedro viz provides instant documentation of data flow without writing diagram code.

§03

How to use

  1. Install Kedro and create a new project:
pip install kedro
kedro new --starter=spaceflights-pandas
cd spaceflights-pandas
  1. Run the pipeline:
kedro run
  1. Visualize the pipeline graph:
kedro viz run
# Opens browser at localhost:4141
§04

Example

Define a pipeline node that transforms data:

# src/project/pipelines/data_processing/nodes.py
import pandas as pd

def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
    companies['company_rating'] = companies['company_rating'].fillna(
        companies['company_rating'].mean()
    )
    return companies

Register it in the pipeline:

# src/project/pipelines/data_processing/pipeline.py
from kedro.pipeline import Pipeline, node
from .nodes import preprocess_companies

def create_pipeline(**kwargs) -> Pipeline:
    return Pipeline([
        node(
            func=preprocess_companies,
            inputs='companies',
            outputs='preprocessed_companies',
            name='preprocess_companies_node',
        ),
    ])

The data catalog YAML maps logical names to physical storage.

§05

Related on TokRepo

§06

Common pitfalls

  • Kedro is a pipeline framework, not an orchestrator. For scheduled execution, pair it with Airflow, Prefect, or Argo using Kedro's deployment plugins.
  • The data catalog requires explicit registration of every dataset. Forgetting to add an intermediate dataset to catalog.yml causes runtime errors.
  • Pipeline visualization with kedro viz requires installing the kedro-viz plugin separately. It is not included in the core package.
  • Always check the official documentation for the latest version-specific changes and migration guides before upgrading in production environments.
  • For team deployments, establish clear guidelines on configuration and usage patterns to ensure consistency across developers.

Frequently Asked Questions

What is the difference between Kedro and Airflow?+

Kedro is a pipeline authoring framework focused on code organization, data management, and reproducibility. Airflow is a workflow orchestrator focused on scheduling and monitoring. They complement each other: you write pipelines in Kedro and deploy them to Airflow for scheduled execution.

Does Kedro work with PySpark?+

Yes. Kedro has built-in support for PySpark through its data catalog. You define SparkDataSet entries in catalog.yml, and your pipeline nodes receive and return Spark DataFrames. This lets you scale from pandas prototypes to Spark production without changing node logic.

Who maintains Kedro?+

Kedro is maintained by McKinsey QuantumBlack, McKinsey's AI and data science division. It was originally built as an internal tool and open-sourced for the broader data science community.

Can I use Kedro for non-ML data pipelines?+

Yes. While Kedro was designed for ML workflows, its pipeline and data catalog abstractions work for any data processing task. ETL pipelines, reporting pipelines, and data quality checks all fit the Kedro model.

How does the Kedro data catalog work?+

The data catalog is a YAML file that maps logical dataset names to physical storage locations and formats. Your pipeline code references logical names only. Switching from a local CSV to S3 Parquet requires changing the catalog entry, not your Python code.

Citations (3)
  • Kedro GitHub— Kedro is an open-source Python framework by McKinsey QuantumBlack
  • Kedro Documentation— Standardized project structure with data catalog and pipeline visualization
  • Kedro Deployment Docs— Integration with Airflow, Prefect, and other orchestrators

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets