# LakeFS — Git-Like Version Control for Data Lakes

> LakeFS adds Git-like branching, committing, and merging to your data lake on S3, GCS, or Azure Blob Storage, enabling reproducible data pipelines and zero-copy experimentation.

## Install

Save in your project root:

# LakeFS — Git-Like Version Control for Data Lakes

## Quick Use
```bash
# Run LakeFS with Docker
docker run --pull always -p 8000:8000 treeverse/lakefs run --local-settings

# Install the CLI
pip install lakefs-cli

# Create a repository backed by S3
lakectl repo create lakefs://my-repo s3://my-bucket/data

# Create a branch and commit
lakectl branch create lakefs://my-repo/experiment -s lakefs://my-repo/main
lakectl commit lakefs://my-repo/experiment -m "Add training dataset v2"
```

## Introduction
LakeFS brings version control semantics to object storage. Data engineers can create branches, run experimental transformations in isolation, diff the results against production, and merge — all without copying data. It acts as a gateway that intercepts S3-compatible API calls and manages versioned metadata.

## What LakeFS Does
- Provides Git-like branching, committing, merging, and reverting for data stored in object storage
- Exposes an S3-compatible API so existing tools (Spark, Trino, dbt, Airflow) work unchanged
- Enables zero-copy branching — branches share underlying data until changes diverge
- Tracks lineage and enables data diffing between any two references
- Supports pre-merge and pre-commit hooks for data quality validation

## Architecture Overview
LakeFS runs as a stateless Go service backed by PostgreSQL (for metadata) and your existing object store (S3, GCS, or Azure) for data. When a client writes via the S3 gateway, LakeFS records the object in a branch-specific namespace. Commits create immutable snapshots of the metadata tree. Merges perform a three-way diff on metadata pointers, not on data bytes, making them fast regardless of dataset size.

## Self-Hosting & Configuration
- Deploy via Docker, Kubernetes Helm chart, or native binaries
- Requires PostgreSQL (or DynamoDB on AWS) for metadata storage
- Configure the blockstore backend (S3, GCS, Azure, or local filesystem)
- Set up authentication via built-in users, LDAP, or OIDC
- Integrate with Airflow, Spark, or dbt using the S3-compatible endpoint with lakefs:// URIs

## Key Features
- Zero-copy branching — create branches instantly without duplicating data
- S3-compatible gateway for transparent integration with any S3-aware tool
- Pre-commit and pre-merge hooks for automated data validation
- Web UI and CLI for browsing repositories, diffs, and commit history
- Open source under the Apache 2.0 license with an active community

## Comparison with Similar Tools
- **Delta Lake** — table format with ACID transactions and time travel; LakeFS works at the object storage level across any file format
- **DVC** — Git-based data versioning for ML experiments; LakeFS versions entire data lakes with branching semantics
- **Apache Iceberg** — table format with snapshot isolation; LakeFS provides repository-level versioning independent of table format
- **Nessie** — Git-like catalog for Iceberg tables; LakeFS is format-agnostic and operates at the storage layer

## FAQ
**Q: Does branching duplicate my data?**
A: No. LakeFS uses copy-on-write at the metadata level. Branches share the same underlying objects until changes are made.

**Q: Can I use LakeFS with Spark?**
A: Yes. Point your Spark jobs at the LakeFS S3 gateway using lakefs:// URIs. No code changes needed beyond updating the endpoint.

**Q: What happens if LakeFS goes down?**
A: Data in the object store remains accessible directly. LakeFS only manages metadata; it does not move or transform your data.

**Q: Does it support garbage collection?**
A: Yes. A built-in GC process reclaims unreferenced objects from deleted branches or old commits.

## Sources
- https://github.com/treeverse/lakeFS
- https://docs.lakefs.io

---
Source: https://tokrepo.com/en/workflows/asset-5b7a9740
Author: AI Open Source