ScriptsApr 16, 2026·3 min read

Delta Lake — Open Storage Format for the Lakehouse

ACID transactions, time travel, and schema evolution for your data lake on top of Parquet and object storage.

TL;DR
Delta Lake brings ACID transactions, time travel, and schema evolution to data lakes.
§01

What it is

Delta Lake is an open storage format that adds ACID transactions, time travel, and schema evolution to data lakes built on Parquet files and object storage like S3 or ADLS. It turns your data lake into a lakehouse by providing reliability guarantees that were previously only available in traditional data warehouses.

Delta Lake targets data engineers and analytics teams running Apache Spark, Flink, or Trino who need transactional consistency on their data lake without migrating to a closed-source warehouse. It is the default storage format for Databricks and is widely adopted in the open-source data ecosystem.

§02

Why it saves time or tokens

Without Delta Lake, data lakes suffer from partial writes, schema drift, and no ability to roll back bad data loads. Delta Lake's transaction log prevents these issues automatically. Time travel lets you query data as it existed at any past point, eliminating the need to maintain manual snapshot copies. For AI data pipelines that feed training data, this means reproducible datasets without complex versioning scripts.

§03

How to use

  1. Add the Delta Lake dependency to your Spark application: spark.jars.packages = io.delta:delta-spark_2.12:3.1.0
  2. Write DataFrames in Delta format: df.write.format('delta').save('/path/to/table')
  3. Read and time-travel with spark.read.format('delta').option('versionAsOf', 5).load('/path/to/table')
§04

Example

from delta.tables import DeltaTable

# Upsert (merge) new data into existing table
delta_table = DeltaTable.forPath(spark, '/data/users')

delta_table.alias('target').merge(
    new_data.alias('source'),
    'target.id = source.id'
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# Time travel: read version 10
old_data = spark.read.format('delta') \
    .option('versionAsOf', 10) \
    .load('/data/users')
FeatureData Lake (Parquet)Delta Lake
ACID transactionsNoYes
Time travelNoYes (version-based)
Schema evolutionManualAutomatic
Merge/UpsertCustom codeBuilt-in
Small file compactionManualOPTIMIZE command
§05

Related on TokRepo

§06

Common pitfalls

  • Delta Lake's transaction log grows with every write operation; run VACUUM periodically to clean up old files and reduce storage costs
  • Time travel retention is controlled by the delta.logRetentionDuration setting; default is 30 days, after which old versions are garbage collected
  • Concurrent writers from different Spark clusters require a shared lock mechanism like DynamoDB for S3 or a Hive metastore

Frequently Asked Questions

What is the difference between Delta Lake and Apache Iceberg?+

Both are open table formats adding ACID transactions to data lakes. Delta Lake originated from Databricks and integrates tightly with Spark. Apache Iceberg originated from Netflix and offers broader engine compatibility (Spark, Flink, Trino, Presto). Both support time travel and schema evolution. Choose based on your primary compute engine and ecosystem.

Can Delta Lake work without Databricks?+

Yes. Delta Lake is fully open source under the Apache 2.0 license. You can use it with open-source Apache Spark, Flink, or standalone Delta Rust readers. Databricks provides additional proprietary optimizations, but the core format and features work independently.

How does time travel work in Delta Lake?+

Every write to a Delta table creates a new version in the transaction log. You can query any previous version by specifying versionAsOf or timestampAsOf in your read query. This enables auditing, debugging data issues, and reproducing ML training datasets from a specific point in time.

What is the OPTIMIZE command?+

OPTIMIZE compacts small files in a Delta table into larger files, improving read performance. Data lakes accumulate many small files from streaming writes or frequent appends. Running OPTIMIZE periodically (e.g., daily) consolidates these into optimally-sized Parquet files without changing the data.

Does Delta Lake support streaming workloads?+

Yes. Delta Lake supports both batch and streaming reads and writes with Spark Structured Streaming. You can write streaming data to a Delta table and read it as a stream. The exactly-once semantics of Delta transactions ensure no data duplication or loss in streaming pipelines.

Citations (3)
  • Delta Lake GitHub— Delta Lake is an open storage format for data lakehouses
  • Delta Lake Docs— Delta Lake provides ACID transactions on top of Parquet
  • Apache Spark— Apache Spark is the unified analytics engine for big data

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets