Introduction
Delta Lake turns your cloud object store (S3, GCS, ADLS) into a reliable data lakehouse. It adds ACID transactions to Parquet files, so concurrent reads and writes never produce corrupt data. With schema enforcement, time travel, and audit history, it brings warehouse-grade reliability to the data lake.
What Delta Lake Does
- Adds ACID transactions to data lake reads and writes via a JSON transaction log
- Enforces and evolves schemas automatically to prevent bad data from entering tables
- Provides time travel to query any historical version of a table
- Optimizes query performance with Z-ordering, file compaction, and data skipping
- Supports streaming and batch workloads on the same table simultaneously
Architecture Overview
Delta Lake stores data as Parquet files plus a JSON-based transaction log (_delta_log/) in the same directory. Each commit appends a new log entry recording which files were added or removed. Readers use the log to construct a consistent snapshot without listing the filesystem. The protocol supports optimistic concurrency control for multi-writer scenarios. Delta-rs provides a Rust-native implementation for engines outside the JVM.
Self-Hosting & Configuration
- Use with PySpark by adding delta-spark as a pip dependency and configuring the SparkSession
- For standalone Python access, use the deltalake (delta-rs) library with no Spark required
- Store tables on S3, GCS, ADLS, or local filesystem with standard cloud credentials
- Run OPTIMIZE and VACUUM commands periodically to compact small files and reclaim storage
- Configure table properties like delta.minReaderVersion and delta.enableChangeDataFeed
Key Features
- ACID transactions with optimistic concurrency for concurrent pipelines
- Time travel queries via version number or timestamp
- Change Data Feed captures row-level insert, update, and delete changes
- Liquid clustering replaces static partitioning with adaptive data layout
- UniForm generates Iceberg and Hudi metadata for cross-engine compatibility
Comparison with Similar Tools
- Apache Iceberg — similar open table format with strong Trino/Flink adoption; Delta Lake has deeper Spark and Databricks integration
- Apache Hudi — focuses on incremental processing and upserts; Delta Lake emphasizes transaction log simplicity and warehouse features
- Plain Parquet — no transactions, schema enforcement, or time travel; Delta Lake adds all of these
- Snowflake/BigQuery — managed warehouses with proprietary formats; Delta Lake is open and runs on your own storage
- Apache Paimon — newer table format from Flink community; Delta Lake has a larger ecosystem and production track record
FAQ
Q: Do I need Databricks to use Delta Lake? A: No. Delta Lake is fully open source. You can use it with Apache Spark, delta-rs (Python/Rust), Trino, Flink, or any compatible engine.
Q: How is Delta Lake different from just using Parquet? A: Delta Lake adds ACID transactions, schema enforcement, time travel, and performance optimizations on top of Parquet files.
Q: Can Delta Lake handle streaming data? A: Yes. Spark Structured Streaming can write to and read from Delta tables, enabling unified batch and streaming pipelines on the same table.
Q: What is UniForm? A: UniForm automatically generates Apache Iceberg and Hudi metadata alongside Delta metadata, allowing any engine to read the same table in its preferred format.