How do I install Pachyderm — Data Versioning and Pipeline Orchestration?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Pachyderm — Data Versioning and Pipeline Orchestration

Introduction

Pachyderm is a data versioning and pipeline platform that brings Git-like version control to datasets and automates data transformations on Kubernetes. It lets teams track every change to their data, build reproducible pipelines, and maintain full lineage from raw input to final model output.

What Pachyderm Does

Versions datasets with Git-like commits, branches, and diffs stored in an object-storage backend
Triggers pipeline stages automatically when new data is committed to input repositories
Processes only changed data (incremental processing) to avoid recomputing entire datasets
Records complete data lineage showing which input files produced which output files
Runs pipeline steps as containers on Kubernetes, supporting any language or framework

Architecture Overview

Pachyderm runs on Kubernetes with two core services: pachd (the daemon) and the Pachyderm File System (PFS). PFS stores data as content-addressed objects in S3-compatible object storage while tracking metadata (commits, branches, provenance) in etcd and PostgreSQL. The Pipeline System (PPS) watches for new commits on input repos and schedules containerized jobs via Kubernetes pods. Each pipeline stage reads from one or more input repos and writes to an output repo, forming a DAG that Pachyderm manages automatically.

Self-Hosting & Configuration

Deploy on any Kubernetes cluster using the Helm chart with backend storage on S3, GCS, Azure Blob, or MinIO
Configure pachctl to connect to the cluster via port-forward or an ingress endpoint
Define pipelines in JSON or YAML specifying the Docker image, input repos, and transformation command
Set resource limits (CPU, memory, GPU) per pipeline to control Kubernetes pod scheduling
Enable authentication via OIDC providers and role-based access control on repos and pipelines

Key Features

Automatic incremental processing detects which files changed and only runs computations on the diff
Global IDs link a data commit to every downstream pipeline output it triggered, enabling full reproducibility
Datum-level parallelism splits input data into chunks and processes them concurrently across Kubernetes pods
Deferred processing lets pipelines subscribe to cron triggers or cross inputs that combine multiple repos
Built-in data deduplication at the block level minimizes object storage costs across versions

Comparison with Similar Tools

DVC — Git extension for data versioning; Pachyderm adds automatic pipeline orchestration and server-side data management
Apache Airflow — DAG-based workflow scheduler; Pachyderm pipelines are data-driven (triggered by commits) rather than schedule-driven
MLflow — experiment tracking and model registry; Pachyderm focuses on data versioning and pipeline lineage rather than model metadata
LakeFS — Git-like branching for data lakes; Pachyderm combines branching with built-in compute pipelines on Kubernetes
Dagster — software-defined asset orchestration; Pachyderm provides content-addressed data versioning at the storage layer

FAQ

Q: How does Pachyderm store data? A: Data is stored as content-addressed objects in any S3-compatible object storage. Metadata (commits, branches, provenance) is stored in PostgreSQL and etcd.

Q: Can Pachyderm handle large datasets? A: Yes. Pachyderm is designed for multi-terabyte datasets. Incremental processing and datum-level parallelism keep computation tractable at scale.

Q: Does Pachyderm require GPUs? A: No, but pipeline steps can request GPU resources via Kubernetes resource limits for ML training or inference workloads.

Q: What happened to the Pachyderm company? A: Pachyderm was acquired by Hewlett Packard Enterprise in 2023. The open-source project continues under the Community Edition license.

Pachyderm — Data Versioning and Pipeline Orchestration

Introduction

What Pachyderm Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Kata Containers — Lightweight VMs for Secure Container Runtime

CloudNative-PG — Production PostgreSQL on Kubernetes

nerdctl — Docker-Compatible CLI for containerd