ConfigsMay 13, 2026·3 min read

LakeFS — Git-Like Version Control for Data Lakes

LakeFS adds Git-like branching, committing, and merging to your data lake on S3, GCS, or Azure Blob Storage, enabling reproducible data pipelines and zero-copy experimentation.

Introduction

LakeFS brings version control semantics to object storage. Data engineers can create branches, run experimental transformations in isolation, diff the results against production, and merge — all without copying data. It acts as a gateway that intercepts S3-compatible API calls and manages versioned metadata.

What LakeFS Does

  • Provides Git-like branching, committing, merging, and reverting for data stored in object storage
  • Exposes an S3-compatible API so existing tools (Spark, Trino, dbt, Airflow) work unchanged
  • Enables zero-copy branching — branches share underlying data until changes diverge
  • Tracks lineage and enables data diffing between any two references
  • Supports pre-merge and pre-commit hooks for data quality validation

Architecture Overview

LakeFS runs as a stateless Go service backed by PostgreSQL (for metadata) and your existing object store (S3, GCS, or Azure) for data. When a client writes via the S3 gateway, LakeFS records the object in a branch-specific namespace. Commits create immutable snapshots of the metadata tree. Merges perform a three-way diff on metadata pointers, not on data bytes, making them fast regardless of dataset size.

Self-Hosting & Configuration

  • Deploy via Docker, Kubernetes Helm chart, or native binaries
  • Requires PostgreSQL (or DynamoDB on AWS) for metadata storage
  • Configure the blockstore backend (S3, GCS, Azure, or local filesystem)
  • Set up authentication via built-in users, LDAP, or OIDC
  • Integrate with Airflow, Spark, or dbt using the S3-compatible endpoint with lakefs:// URIs

Key Features

  • Zero-copy branching — create branches instantly without duplicating data
  • S3-compatible gateway for transparent integration with any S3-aware tool
  • Pre-commit and pre-merge hooks for automated data validation
  • Web UI and CLI for browsing repositories, diffs, and commit history
  • Open source under the Apache 2.0 license with an active community

Comparison with Similar Tools

  • Delta Lake — table format with ACID transactions and time travel; LakeFS works at the object storage level across any file format
  • DVC — Git-based data versioning for ML experiments; LakeFS versions entire data lakes with branching semantics
  • Apache Iceberg — table format with snapshot isolation; LakeFS provides repository-level versioning independent of table format
  • Nessie — Git-like catalog for Iceberg tables; LakeFS is format-agnostic and operates at the storage layer

FAQ

Q: Does branching duplicate my data? A: No. LakeFS uses copy-on-write at the metadata level. Branches share the same underlying objects until changes are made.

Q: Can I use LakeFS with Spark? A: Yes. Point your Spark jobs at the LakeFS S3 gateway using lakefs:// URIs. No code changes needed beyond updating the endpoint.

Q: What happens if LakeFS goes down? A: Data in the object store remains accessible directly. LakeFS only manages metadata; it does not move or transform your data.

Q: Does it support garbage collection? A: Yes. A built-in GC process reclaims unreferenced objects from deleted branches or old commits.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets