Introduction
LakeFS brings version control semantics to object storage. Data engineers can create branches, run experimental transformations in isolation, diff the results against production, and merge — all without copying data. It acts as a gateway that intercepts S3-compatible API calls and manages versioned metadata.
What LakeFS Does
- Provides Git-like branching, committing, merging, and reverting for data stored in object storage
- Exposes an S3-compatible API so existing tools (Spark, Trino, dbt, Airflow) work unchanged
- Enables zero-copy branching — branches share underlying data until changes diverge
- Tracks lineage and enables data diffing between any two references
- Supports pre-merge and pre-commit hooks for data quality validation
Architecture Overview
LakeFS runs as a stateless Go service backed by PostgreSQL (for metadata) and your existing object store (S3, GCS, or Azure) for data. When a client writes via the S3 gateway, LakeFS records the object in a branch-specific namespace. Commits create immutable snapshots of the metadata tree. Merges perform a three-way diff on metadata pointers, not on data bytes, making them fast regardless of dataset size.
Self-Hosting & Configuration
- Deploy via Docker, Kubernetes Helm chart, or native binaries
- Requires PostgreSQL (or DynamoDB on AWS) for metadata storage
- Configure the blockstore backend (S3, GCS, Azure, or local filesystem)
- Set up authentication via built-in users, LDAP, or OIDC
- Integrate with Airflow, Spark, or dbt using the S3-compatible endpoint with lakefs:// URIs
Key Features
- Zero-copy branching — create branches instantly without duplicating data
- S3-compatible gateway for transparent integration with any S3-aware tool
- Pre-commit and pre-merge hooks for automated data validation
- Web UI and CLI for browsing repositories, diffs, and commit history
- Open source under the Apache 2.0 license with an active community
Comparison with Similar Tools
- Delta Lake — table format with ACID transactions and time travel; LakeFS works at the object storage level across any file format
- DVC — Git-based data versioning for ML experiments; LakeFS versions entire data lakes with branching semantics
- Apache Iceberg — table format with snapshot isolation; LakeFS provides repository-level versioning independent of table format
- Nessie — Git-like catalog for Iceberg tables; LakeFS is format-agnostic and operates at the storage layer
FAQ
Q: Does branching duplicate my data? A: No. LakeFS uses copy-on-write at the metadata level. Branches share the same underlying objects until changes are made.
Q: Can I use LakeFS with Spark? A: Yes. Point your Spark jobs at the LakeFS S3 gateway using lakefs:// URIs. No code changes needed beyond updating the endpoint.
Q: What happens if LakeFS goes down? A: Data in the object store remains accessible directly. LakeFS only manages metadata; it does not move or transform your data.
Q: Does it support garbage collection? A: Yes. A built-in GC process reclaims unreferenced objects from deleted branches or old commits.