What is LakeFS — Git-Like Version Control for Data Lakes?

LakeFS adds Git-like branching, committing, and merging to your data lake on S3, GCS, or Azure Blob Storage, enabling reproducible data pipelines and zero-copy experimentation.

Is LakeFS — Git-Like Version Control for Data Lakes free to use?

Yes. LakeFS — Git-Like Version Control for Data Lakes is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install LakeFS — Git-Like Version Control for Data Lakes?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

LakeFS — Git-Like Version Control for Data Lakes

Introduction

LakeFS brings version control semantics to object storage. Data engineers can create branches, run experimental transformations in isolation, diff the results against production, and merge — all without copying data. It acts as a gateway that intercepts S3-compatible API calls and manages versioned metadata.

What LakeFS Does

Provides Git-like branching, committing, merging, and reverting for data stored in object storage
Exposes an S3-compatible API so existing tools (Spark, Trino, dbt, Airflow) work unchanged
Enables zero-copy branching — branches share underlying data until changes diverge
Tracks lineage and enables data diffing between any two references
Supports pre-merge and pre-commit hooks for data quality validation

Architecture Overview

LakeFS runs as a stateless Go service backed by PostgreSQL (for metadata) and your existing object store (S3, GCS, or Azure) for data. When a client writes via the S3 gateway, LakeFS records the object in a branch-specific namespace. Commits create immutable snapshots of the metadata tree. Merges perform a three-way diff on metadata pointers, not on data bytes, making them fast regardless of dataset size.

Self-Hosting & Configuration

Deploy via Docker, Kubernetes Helm chart, or native binaries
Requires PostgreSQL (or DynamoDB on AWS) for metadata storage
Configure the blockstore backend (S3, GCS, Azure, or local filesystem)
Set up authentication via built-in users, LDAP, or OIDC
Integrate with Airflow, Spark, or dbt using the S3-compatible endpoint with lakefs:// URIs

Key Features

Zero-copy branching — create branches instantly without duplicating data
S3-compatible gateway for transparent integration with any S3-aware tool
Pre-commit and pre-merge hooks for automated data validation
Web UI and CLI for browsing repositories, diffs, and commit history
Open source under the Apache 2.0 license with an active community

Comparison with Similar Tools

Delta Lake — table format with ACID transactions and time travel; LakeFS works at the object storage level across any file format
DVC — Git-based data versioning for ML experiments; LakeFS versions entire data lakes with branching semantics
Apache Iceberg — table format with snapshot isolation; LakeFS provides repository-level versioning independent of table format
Nessie — Git-like catalog for Iceberg tables; LakeFS is format-agnostic and operates at the storage layer

FAQ

Q: Does branching duplicate my data? A: No. LakeFS uses copy-on-write at the metadata level. Branches share the same underlying objects until changes are made.

Q: Can I use LakeFS with Spark? A: Yes. Point your Spark jobs at the LakeFS S3 gateway using lakefs:// URIs. No code changes needed beyond updating the endpoint.

Q: What happens if LakeFS goes down? A: Data in the object store remains accessible directly. LakeFS only manages metadata; it does not move or transform your data.

Q: Does it support garbage collection? A: Yes. A built-in GC process reclaims unreferenced objects from deleted branches or old commits.

LakeFS — Git-Like Version Control for Data Lakes

Introduction

What LakeFS Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discusión

Activos relacionados

Jujutsu (jj) — A Git-Compatible Next-Generation Version Control System

Pachyderm — Data Versioning and Pipeline Orchestration

Dolt — The SQL Database You Can Fork, Clone, Branch and Merge

TerminusDB — Document Graph Database with Git-Like Versioning