ScriptsApr 16, 2026·3 min read

Delta Lake — Open Storage Format for the Lakehouse

ACID transactions, time travel, and schema evolution for your data lake on top of Parquet and object storage.

Introduction

Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata, and time travel to Parquet files sitting on cloud object storage. It is the foundation of the Databricks lakehouse architecture, and also runs on Spark, Trino, Presto, Flink, Hive, and a growing set of pure-Rust/Python clients.

What Delta Lake Does

  • Adds ACID transactions to Parquet tables via a JSON + checkpointed transaction log.
  • Supports MERGE/UPDATE/DELETE, time travel, and schema evolution.
  • Handles concurrent writers with optimistic concurrency control.
  • Integrates with Spark SQL, Structured Streaming, Flink, Trino, Athena, and more.
  • Provides Z-order, Liquid Clustering, and data skipping for fast analytic queries.

Architecture Overview

A Delta table is a directory of Parquet data files plus a _delta_log that is an ordered sequence of JSON commits (*.json) with periodic Parquet checkpoints. Each commit records added/removed files and metadata changes. Readers reconstruct the latest snapshot from the log; writers append commits using optimistic concurrency and file-level conflict detection.

Self-Hosting & Configuration

  • Runtime options: Spark (io.delta:delta-spark), Flink, Trino (delta connector), Presto, Hive, Python (deltalake), Rust.
  • Use Databricks, EMR, Dataproc, or self-managed Spark/Flink with S3/GCS/Azure Blob/MinIO.
  • Enable Unity Catalog or Hive Metastore/Glue to register Delta tables for multi-engine access.
  • Tune retention with delta.deletedFileRetentionDuration and logRetentionDuration.
  • Optimize layout with OPTIMIZE + ZORDER BY (or Liquid Clustering) on hot query keys.

Key Features

  • ACID transactions on S3/GCS/Azure Blob/HDFS — no extra database required.
  • Time travel with VERSION AS OF / TIMESTAMP AS OF for audits and rollbacks.
  • Schema evolution and enforcement via mergeSchema and constraints.
  • Change Data Feed (CDF) lets downstream consumers read row-level changes.
  • UniForm: expose a Delta table as Iceberg or Hudi metadata for cross-engine reads.

Comparison with Similar Tools

  • Apache Iceberg — Similar ACID lakehouse format with stronger multi-engine catalog story.
  • Apache Hudi — Optimized for upserts and incremental pulls; Delta focuses on simple + fast analytics.
  • Hive ACID — Older, metastore-heavy; Delta is log-based, cloud-native, and vendor-neutral.
  • Parquet alone — No ACID or time travel; Delta adds them without rewriting data.
  • BigLake / Snowflake Iceberg — Managed lakehouse catalogs; Delta is OSS and engine-agnostic.

FAQ

Q: Does Delta Lake require Spark? A: No — delta-rs provides a pure Rust library with Python bindings, and Trino/Flink connectors exist too.

Q: How does concurrency work? A: Writers use optimistic concurrency: commit is a conditional put on the next log file; conflicts are re-checked against file-level overlaps.

Q: Can I query Delta from Athena or BigQuery? A: Yes — Athena supports Delta reads natively, and BigLake/Trino/Presto provide connectors.

Q: What is UniForm? A: A feature that writes Iceberg (and Hudi) metadata alongside Delta, enabling multi-format readers on the same data files.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets