Kedro — Production-Ready ML Pipeline Framework for Python
Kedro is an open-source Python framework by McKinsey QuantumBlack that applies software engineering best practices to data science and ML pipelines. It provides a standardized project structure, data catalog, and pipeline abstraction that makes experimental code production-ready.
Installation agent prête
Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.
npx -y tokrepo@latest install a9468d16-39eb-11f1-9bc6-00163e2b0d79 --target codexÀ exécuter après confirmation du plan en dry-run.
What it is
Kedro is an open-source Python framework created by McKinsey QuantumBlack that applies software engineering best practices to data science and machine learning code. It provides a standardized project structure, a declarative data catalog, and a pipeline abstraction that transforms experimental notebook code into maintainable, testable, production-ready pipelines.
Kedro targets data scientists and ML engineers who need to bridge the gap between prototype notebooks and production systems. It works alongside existing tools like pandas, scikit-learn, and PySpark without replacing them.
How it saves time or tokens
Kedro eliminates the 'notebook to production' refactoring cycle. The standardized project template means new team members understand the codebase layout immediately. The data catalog decouples data access from business logic, so switching between local CSV files and cloud storage requires changing a YAML config, not Python code. Pipeline visualization with kedro viz provides instant documentation of data flow without writing diagram code.
How to use
- Install Kedro and create a new project:
pip install kedro
kedro new --starter=spaceflights-pandas
cd spaceflights-pandas
- Run the pipeline:
kedro run
- Visualize the pipeline graph:
kedro viz run
# Opens browser at localhost:4141
Example
Define a pipeline node that transforms data:
# src/project/pipelines/data_processing/nodes.py
import pandas as pd
def preprocess_companies(companies: pd.DataFrame) -> pd.DataFrame:
companies['company_rating'] = companies['company_rating'].fillna(
companies['company_rating'].mean()
)
return companies
Register it in the pipeline:
# src/project/pipelines/data_processing/pipeline.py
from kedro.pipeline import Pipeline, node
from .nodes import preprocess_companies
def create_pipeline(**kwargs) -> Pipeline:
return Pipeline([
node(
func=preprocess_companies,
inputs='companies',
outputs='preprocessed_companies',
name='preprocess_companies_node',
),
])
The data catalog YAML maps logical names to physical storage.
Related on TokRepo
- AI Tools for Coding — Development tools that complement ML pipeline frameworks
- AI Tools for DevOps — CI/CD and deployment tools for ML pipeline orchestration
Common pitfalls
- Kedro is a pipeline framework, not an orchestrator. For scheduled execution, pair it with Airflow, Prefect, or Argo using Kedro's deployment plugins.
- The data catalog requires explicit registration of every dataset. Forgetting to add an intermediate dataset to
catalog.ymlcauses runtime errors. - Pipeline visualization with
kedro vizrequires installing thekedro-vizplugin separately. It is not included in the core package. - Always check the official documentation for the latest version-specific changes and migration guides before upgrading in production environments.
- For team deployments, establish clear guidelines on configuration and usage patterns to ensure consistency across developers.
Questions fréquentes
Kedro is a pipeline authoring framework focused on code organization, data management, and reproducibility. Airflow is a workflow orchestrator focused on scheduling and monitoring. They complement each other: you write pipelines in Kedro and deploy them to Airflow for scheduled execution.
Yes. Kedro has built-in support for PySpark through its data catalog. You define SparkDataSet entries in catalog.yml, and your pipeline nodes receive and return Spark DataFrames. This lets you scale from pandas prototypes to Spark production without changing node logic.
Kedro is maintained by McKinsey QuantumBlack, McKinsey's AI and data science division. It was originally built as an internal tool and open-sourced for the broader data science community.
Yes. While Kedro was designed for ML workflows, its pipeline and data catalog abstractions work for any data processing task. ETL pipelines, reporting pipelines, and data quality checks all fit the Kedro model.
The data catalog is a YAML file that maps logical dataset names to physical storage locations and formats. Your pipeline code references logical names only. Switching from a local CSV to S3 Parquet requires changing the catalog entry, not your Python code.
Sources citées (3)
- Kedro GitHub— Kedro is an open-source Python framework by McKinsey QuantumBlack
- Kedro Documentation— Standardized project structure with data catalog and pipeline visualization
- Kedro Deployment Docs— Integration with Airflow, Prefect, and other orchestrators
En lien sur TokRepo
Fil de discussion
Actifs similaires
ZenML — MLOps Pipeline Framework from Development to Production
An open-source MLOps framework that lets you build portable, production-ready ML pipelines that run on any infrastructure stack.
Dropwizard — Production-Ready Java REST Framework
An opinionated Java framework that bundles Jetty, Jersey, Jackson, and Metrics into a single package for building RESTful web services.
Pydantic AI — Production AI Agent Framework
Build production-ready AI agents in Python with type-safe structured outputs, dependency injection, and multi-model support. By the creators of Pydantic.
PyTorch — The Deep Learning Framework for Research and Production
PyTorch is an open-source deep learning framework by Meta that provides tensor computation with GPU acceleration and automatic differentiation. Its dynamic computation graph and Pythonic API make it the dominant framework for AI research and increasingly for production.