Kedro — Production-Ready ML Pipeline Framework for Python
Kedro is an open-source Python framework by McKinsey QuantumBlack that applies software engineering best practices to data science and ML pipelines. It provides a standardized project structure, data catalog, and pipeline abstraction that makes experimental code production-ready.
What it is
Kedro is an open-source Python framework created by McKinsey QuantumBlack that applies software engineering best practices to data science and machine learning code. It provides a standardized project structure, a declarative data catalog, and a pipeline abstraction that transforms experimental notebook code into maintainable, testable, production-ready pipelines.
Kedro targets data scientists and ML engineers who need to bridge the gap between prototype notebooks and production systems. It works alongside existing tools like pandas, scikit-learn, and PySpark without replacing them.
How it saves time or tokens
Kedro eliminates the 'notebook to production' refactoring cycle. The standardized project template means new team members understand the codebase layout immediately. The data catalog decouples data access from business logic, so switching between local CSV files and cloud storage requires changing a YAML config, not Python code. Pipeline visualization with kedro viz provides instant documentation of data flow without writing diagram code.
How to use
- Install Kedro and create a new project:
pip install kedro
kedro new --starter=spaceflights-pandas
cd spaceflights-pandas
- Run the pipeline:
kedro run
- Visualize the pipeline graph:
kedro viz run
# Opens browser at localhost:4141
Example
Define a pipeline node that transforms data:
# src/project/pipelines/data_processing/nodes.py
import pandas as pd
def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
companies['company_rating'] = companies['company_rating'].fillna(
companies['company_rating'].mean()
)
return companies
Register it in the pipeline:
# src/project/pipelines/data_processing/pipeline.py
from kedro.pipeline import Pipeline, node
from .nodes import preprocess_companies
def create_pipeline(**kwargs) -> Pipeline:
return Pipeline([
node(
func=preprocess_companies,
inputs='companies',
outputs='preprocessed_companies',
name='preprocess_companies_node',
),
])
The data catalog YAML maps logical names to physical storage.
Related on TokRepo
- AI Tools for Coding — Development tools that complement ML pipeline frameworks
- AI Tools for DevOps — CI/CD and deployment tools for ML pipeline orchestration
Common pitfalls
- Kedro is a pipeline framework, not an orchestrator. For scheduled execution, pair it with Airflow, Prefect, or Argo using Kedro's deployment plugins.
- The data catalog requires explicit registration of every dataset. Forgetting to add an intermediate dataset to
catalog.ymlcauses runtime errors. - Pipeline visualization with
kedro vizrequires installing thekedro-vizplugin separately. It is not included in the core package. - Always check the official documentation for the latest version-specific changes and migration guides before upgrading in production environments.
- For team deployments, establish clear guidelines on configuration and usage patterns to ensure consistency across developers.
Frequently Asked Questions
Kedro is a pipeline authoring framework focused on code organization, data management, and reproducibility. Airflow is a workflow orchestrator focused on scheduling and monitoring. They complement each other: you write pipelines in Kedro and deploy them to Airflow for scheduled execution.
Yes. Kedro has built-in support for PySpark through its data catalog. You define SparkDataSet entries in catalog.yml, and your pipeline nodes receive and return Spark DataFrames. This lets you scale from pandas prototypes to Spark production without changing node logic.
Kedro is maintained by McKinsey QuantumBlack, McKinsey's AI and data science division. It was originally built as an internal tool and open-sourced for the broader data science community.
Yes. While Kedro was designed for ML workflows, its pipeline and data catalog abstractions work for any data processing task. ETL pipelines, reporting pipelines, and data quality checks all fit the Kedro model.
The data catalog is a YAML file that maps logical dataset names to physical storage locations and formats. Your pipeline code references logical names only. Switching from a local CSV to S3 Parquet requires changing the catalog entry, not your Python code.
Citations (3)
- Kedro GitHub— Kedro is an open-source Python framework by McKinsey QuantumBlack
- Kedro Documentation— Standardized project structure with data catalog and pipeline visualization
- Kedro Deployment Docs— Integration with Airflow, Prefect, and other orchestrators
Related on TokRepo
Discussion
Related Assets
Conda — Cross-Platform Package and Environment Manager
Install, update, and manage packages and isolated environments for Python, R, C/C++, and hundreds of other languages from a single tool.
Sphinx — Python Documentation Generator
Generate professional documentation from reStructuredText and Markdown with cross-references, API autodoc, and multiple output formats.
Neutralinojs — Lightweight Cross-Platform Desktop Apps
Build desktop applications with HTML, CSS, and JavaScript using a tiny native runtime instead of bundling Chromium.