# Delta Lake — Reliable Open Table Format for Data Lakehouses

> Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and time travel to data lakes. Originally created by Databricks, it runs on top of Apache Spark, Flink, Trino, and standalone engines via delta-rs.

## Install

Save in your project root:

# Delta Lake — Reliable Open Table Format for Data Lakehouses

## Quick Use
```bash
pip install delta-spark
# In PySpark:
# df.write.format("delta").save("/tmp/delta-table")
# spark.read.format("delta").load("/tmp/delta-table").show()
# Or use delta-rs for standalone Python access:
pip install deltalake
python -c "from deltalake import DeltaTable; dt = DeltaTable('./my_table')"
```

## Introduction
Delta Lake turns your cloud object store (S3, GCS, ADLS) into a reliable data lakehouse. It adds ACID transactions to Parquet files, so concurrent reads and writes never produce corrupt data. With schema enforcement, time travel, and audit history, it brings warehouse-grade reliability to the data lake.

## What Delta Lake Does
- Adds ACID transactions to data lake reads and writes via a JSON transaction log
- Enforces and evolves schemas automatically to prevent bad data from entering tables
- Provides time travel to query any historical version of a table
- Optimizes query performance with Z-ordering, file compaction, and data skipping
- Supports streaming and batch workloads on the same table simultaneously

## Architecture Overview
Delta Lake stores data as Parquet files plus a JSON-based transaction log (_delta_log/) in the same directory. Each commit appends a new log entry recording which files were added or removed. Readers use the log to construct a consistent snapshot without listing the filesystem. The protocol supports optimistic concurrency control for multi-writer scenarios. Delta-rs provides a Rust-native implementation for engines outside the JVM.

## Self-Hosting & Configuration
- Use with PySpark by adding delta-spark as a pip dependency and configuring the SparkSession
- For standalone Python access, use the deltalake (delta-rs) library with no Spark required
- Store tables on S3, GCS, ADLS, or local filesystem with standard cloud credentials
- Run OPTIMIZE and VACUUM commands periodically to compact small files and reclaim storage
- Configure table properties like delta.minReaderVersion and delta.enableChangeDataFeed

## Key Features
- ACID transactions with optimistic concurrency for concurrent pipelines
- Time travel queries via version number or timestamp
- Change Data Feed captures row-level insert, update, and delete changes
- Liquid clustering replaces static partitioning with adaptive data layout
- UniForm generates Iceberg and Hudi metadata for cross-engine compatibility

## Comparison with Similar Tools
- **Apache Iceberg** — similar open table format with strong Trino/Flink adoption; Delta Lake has deeper Spark and Databricks integration
- **Apache Hudi** — focuses on incremental processing and upserts; Delta Lake emphasizes transaction log simplicity and warehouse features
- **Plain Parquet** — no transactions, schema enforcement, or time travel; Delta Lake adds all of these
- **Snowflake/BigQuery** — managed warehouses with proprietary formats; Delta Lake is open and runs on your own storage
- **Apache Paimon** — newer table format from Flink community; Delta Lake has a larger ecosystem and production track record

## FAQ
**Q: Do I need Databricks to use Delta Lake?**
A: No. Delta Lake is fully open source. You can use it with Apache Spark, delta-rs (Python/Rust), Trino, Flink, or any compatible engine.

**Q: How is Delta Lake different from just using Parquet?**
A: Delta Lake adds ACID transactions, schema enforcement, time travel, and performance optimizations on top of Parquet files.

**Q: Can Delta Lake handle streaming data?**
A: Yes. Spark Structured Streaming can write to and read from Delta tables, enabling unified batch and streaming pipelines on the same table.

**Q: What is UniForm?**
A: UniForm automatically generates Apache Iceberg and Hudi metadata alongside Delta metadata, allowing any engine to read the same table in its preferred format.

## Sources
- https://github.com/delta-io/delta
- https://docs.delta.io/

---
Source: https://tokrepo.com/en/workflows/01a8e6d0-39ec-11f1-9bc6-00163e2b0d79
Author: AI Open Source