# Metaflow — Human-Friendly ML Workflow Framework by Netflix > Metaflow is a Python framework from Netflix for building and managing real-life data science and ML projects, handling compute, data versioning, and orchestration with minimal boilerplate. ## Install Save as a script file and run: # Metaflow — Human-Friendly ML Workflow Framework by Netflix ## Quick Use ```bash pip install metaflow ``` ```python from metaflow import FlowSpec, step class MyFlow(FlowSpec): @step def start(self): self.data = [1, 2, 3] self.next(self.end) @step def end(self): print(f"Result: {sum(self.data)}") if __name__ == "__main__": MyFlow() ``` ```bash python my_flow.py run ``` ## Introduction Metaflow was built at Netflix to let data scientists write production ML pipelines using regular Python. It manages infrastructure concerns—versioning, compute scaling, dependency management—behind a simple decorator-based API, so teams can focus on modeling rather than plumbing. ## What Metaflow Does - Structures ML projects as flows with steps connected by a DAG - Automatically versions every run's data, code, and dependencies - Scales individual steps to cloud compute (AWS Batch, Kubernetes) with a single decorator - Provides a built-in client for inspecting past runs and retrieving artifacts - Supports branching and joining for parallel workloads within a flow ## Architecture Overview A Metaflow flow is a Python class where each method decorated with @step becomes a node in a DAG. When executed, the runtime snapshots code, data artifacts, and environment metadata for each step. Steps can be dispatched to local processes, AWS Batch, or Kubernetes. A metadata service tracks all runs, and a datastore (S3 or local filesystem) persists artifacts so any past result can be retrieved programmatically. ## Self-Hosting & Configuration - Install from PyPI for local execution with no extra infrastructure - Configure AWS integration by running metaflow configure aws for S3 and Batch - Deploy the metadata service for team-wide run tracking and artifact sharing - Use @conda or @pypi decorators to pin per-step dependencies automatically - Integrate with Argo Workflows or AWS Step Functions for production scheduling ## Key Features - Decorator-based API keeps flow definitions in plain Python without YAML or config files - Automatic data versioning lets you inspect or compare any historical run - @resources decorator requests specific CPU, memory, or GPU for individual steps - Fan-out with foreach enables parallel processing across data partitions - Built-in resume from the last successful step after failures ## Comparison with Similar Tools - **Prefect** — Python workflow engine; more general-purpose, less ML-specific artifact management - **Dagster** — asset-centric orchestrator; stronger typing but heavier abstraction layer - **Kedro** — pipeline framework for data science; more opinionated project structure - **Airflow** — DAG scheduler for batch jobs; requires more infrastructure and is less Python-native ## FAQ **Q: Do I need AWS to use Metaflow?** A: No. Metaflow runs fully locally. AWS and Kubernetes integrations are optional for scaling. **Q: How does data versioning work?** A: Every step's output artifacts are automatically persisted and tagged with the run ID. You can retrieve any artifact from any past run via the client API. **Q: Can I schedule flows for recurring execution?** A: Yes. Integrate with Argo Workflows, AWS Step Functions, or any cron-based scheduler to trigger flows on a schedule. **Q: Does Metaflow handle GPU workloads?** A: Yes. Use the @resources(gpu=1) decorator to request GPU instances for specific steps. ## Sources - https://github.com/Netflix/metaflow - https://docs.metaflow.org/ --- Source: https://tokrepo.com/en/workflows/11f68356-3fdb-11f1-9bc6-00163e2b0d79 Author: Script Depot