# LakeFS — Git-Like Version Control for Data Lakes > LakeFS adds Git-like branching, committing, and merging to your data lake on S3, GCS, or Azure Blob Storage, enabling reproducible data pipelines and zero-copy experimentation. ## Install Save in your project root: # LakeFS — Git-Like Version Control for Data Lakes ## Quick Use ```bash # Run LakeFS with Docker docker run --pull always -p 8000:8000 treeverse/lakefs run --local-settings # Install the CLI pip install lakefs-cli # Create a repository backed by S3 lakectl repo create lakefs://my-repo s3://my-bucket/data # Create a branch and commit lakectl branch create lakefs://my-repo/experiment -s lakefs://my-repo/main lakectl commit lakefs://my-repo/experiment -m "Add training dataset v2" ``` ## Introduction LakeFS brings version control semantics to object storage. Data engineers can create branches, run experimental transformations in isolation, diff the results against production, and merge — all without copying data. It acts as a gateway that intercepts S3-compatible API calls and manages versioned metadata. ## What LakeFS Does - Provides Git-like branching, committing, merging, and reverting for data stored in object storage - Exposes an S3-compatible API so existing tools (Spark, Trino, dbt, Airflow) work unchanged - Enables zero-copy branching — branches share underlying data until changes diverge - Tracks lineage and enables data diffing between any two references - Supports pre-merge and pre-commit hooks for data quality validation ## Architecture Overview LakeFS runs as a stateless Go service backed by PostgreSQL (for metadata) and your existing object store (S3, GCS, or Azure) for data. When a client writes via the S3 gateway, LakeFS records the object in a branch-specific namespace. Commits create immutable snapshots of the metadata tree. Merges perform a three-way diff on metadata pointers, not on data bytes, making them fast regardless of dataset size. ## Self-Hosting & Configuration - Deploy via Docker, Kubernetes Helm chart, or native binaries - Requires PostgreSQL (or DynamoDB on AWS) for metadata storage - Configure the blockstore backend (S3, GCS, Azure, or local filesystem) - Set up authentication via built-in users, LDAP, or OIDC - Integrate with Airflow, Spark, or dbt using the S3-compatible endpoint with lakefs:// URIs ## Key Features - Zero-copy branching — create branches instantly without duplicating data - S3-compatible gateway for transparent integration with any S3-aware tool - Pre-commit and pre-merge hooks for automated data validation - Web UI and CLI for browsing repositories, diffs, and commit history - Open source under the Apache 2.0 license with an active community ## Comparison with Similar Tools - **Delta Lake** — table format with ACID transactions and time travel; LakeFS works at the object storage level across any file format - **DVC** — Git-based data versioning for ML experiments; LakeFS versions entire data lakes with branching semantics - **Apache Iceberg** — table format with snapshot isolation; LakeFS provides repository-level versioning independent of table format - **Nessie** — Git-like catalog for Iceberg tables; LakeFS is format-agnostic and operates at the storage layer ## FAQ **Q: Does branching duplicate my data?** A: No. LakeFS uses copy-on-write at the metadata level. Branches share the same underlying objects until changes are made. **Q: Can I use LakeFS with Spark?** A: Yes. Point your Spark jobs at the LakeFS S3 gateway using lakefs:// URIs. No code changes needed beyond updating the endpoint. **Q: What happens if LakeFS goes down?** A: Data in the object store remains accessible directly. LakeFS only manages metadata; it does not move or transform your data. **Q: Does it support garbage collection?** A: Yes. A built-in GC process reclaims unreferenced objects from deleted branches or old commits. ## Sources - https://github.com/treeverse/lakeFS - https://docs.lakefs.io --- Source: https://tokrepo.com/en/workflows/asset-5b7a9740 Author: AI Open Source