ScriptsApr 22, 2026·3 min read

Pachyderm — Data Versioning and Pipeline Orchestration

Version your data like Git, build reproducible data pipelines triggered by commits, and track lineage from raw input to model output on Kubernetes.

Introduction

Pachyderm is a data versioning and pipeline platform that brings Git-like version control to datasets and automates data transformations on Kubernetes. It lets teams track every change to their data, build reproducible pipelines, and maintain full lineage from raw input to final model output.

What Pachyderm Does

  • Versions datasets with Git-like commits, branches, and diffs stored in an object-storage backend
  • Triggers pipeline stages automatically when new data is committed to input repositories
  • Processes only changed data (incremental processing) to avoid recomputing entire datasets
  • Records complete data lineage showing which input files produced which output files
  • Runs pipeline steps as containers on Kubernetes, supporting any language or framework

Architecture Overview

Pachyderm runs on Kubernetes with two core services: pachd (the daemon) and the Pachyderm File System (PFS). PFS stores data as content-addressed objects in S3-compatible object storage while tracking metadata (commits, branches, provenance) in etcd and PostgreSQL. The Pipeline System (PPS) watches for new commits on input repos and schedules containerized jobs via Kubernetes pods. Each pipeline stage reads from one or more input repos and writes to an output repo, forming a DAG that Pachyderm manages automatically.

Self-Hosting & Configuration

  • Deploy on any Kubernetes cluster using the Helm chart with backend storage on S3, GCS, Azure Blob, or MinIO
  • Configure pachctl to connect to the cluster via port-forward or an ingress endpoint
  • Define pipelines in JSON or YAML specifying the Docker image, input repos, and transformation command
  • Set resource limits (CPU, memory, GPU) per pipeline to control Kubernetes pod scheduling
  • Enable authentication via OIDC providers and role-based access control on repos and pipelines

Key Features

  • Automatic incremental processing detects which files changed and only runs computations on the diff
  • Global IDs link a data commit to every downstream pipeline output it triggered, enabling full reproducibility
  • Datum-level parallelism splits input data into chunks and processes them concurrently across Kubernetes pods
  • Deferred processing lets pipelines subscribe to cron triggers or cross inputs that combine multiple repos
  • Built-in data deduplication at the block level minimizes object storage costs across versions

Comparison with Similar Tools

  • DVC — Git extension for data versioning; Pachyderm adds automatic pipeline orchestration and server-side data management
  • Apache Airflow — DAG-based workflow scheduler; Pachyderm pipelines are data-driven (triggered by commits) rather than schedule-driven
  • MLflow — experiment tracking and model registry; Pachyderm focuses on data versioning and pipeline lineage rather than model metadata
  • LakeFS — Git-like branching for data lakes; Pachyderm combines branching with built-in compute pipelines on Kubernetes
  • Dagster — software-defined asset orchestration; Pachyderm provides content-addressed data versioning at the storage layer

FAQ

Q: How does Pachyderm store data? A: Data is stored as content-addressed objects in any S3-compatible object storage. Metadata (commits, branches, provenance) is stored in PostgreSQL and etcd.

Q: Can Pachyderm handle large datasets? A: Yes. Pachyderm is designed for multi-terabyte datasets. Incremental processing and datum-level parallelism keep computation tractable at scale.

Q: Does Pachyderm require GPUs? A: No, but pipeline steps can request GPU resources via Kubernetes resource limits for ML training or inference workloads.

Q: What happened to the Pachyderm company? A: Pachyderm was acquired by Hewlett Packard Enterprise in 2023. The open-source project continues under the Community Edition license.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets