Is Delta Lake — Open Storage Format for the Lakehouse free to use?

Yes. Delta Lake — Open Storage Format for the Lakehouse is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install Delta Lake — Open Storage Format for the Lakehouse?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ScriptsApr 16, 2026·3 min read

Delta Lake — Open Storage Format for the Lakehouse

ACID transactions, time travel, and schema evolution for your data lake on top of Parquet and object storage.

Script Depot · Community

TL;DR

Delta Lake brings ACID transactions, time travel, and schema evolution to data lakes.

§01

What it is

Delta Lake is an open storage format that adds ACID transactions, time travel, and schema evolution to data lakes built on Parquet files and object storage like S3 or ADLS. It turns your data lake into a lakehouse by providing reliability guarantees that were previously only available in traditional data warehouses.

Delta Lake targets data engineers and analytics teams running Apache Spark, Flink, or Trino who need transactional consistency on their data lake without migrating to a closed-source warehouse. It is the default storage format for Databricks and is widely adopted in the open-source data ecosystem.

§02

Why it saves time or tokens

Without Delta Lake, data lakes suffer from partial writes, schema drift, and no ability to roll back bad data loads. Delta Lake's transaction log prevents these issues automatically. Time travel lets you query data as it existed at any past point, eliminating the need to maintain manual snapshot copies. For AI data pipelines that feed training data, this means reproducible datasets without complex versioning scripts.

§03

How to use

Add the Delta Lake dependency to your Spark application: spark.jars.packages = io.delta:delta-spark_2.12:3.1.0
Write DataFrames in Delta format: df.write.format('delta').save('/path/to/table')
Read and time-travel with spark.read.format('delta').option('versionAsOf', 5).load('/path/to/table')

§04

Example

from delta.tables import DeltaTable

# Upsert (merge) new data into existing table
delta_table = DeltaTable.forPath(spark, '/data/users')

delta_table.alias('target').merge(
    new_data.alias('source'),
    'target.id = source.id'
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# Time travel: read version 10
old_data = spark.read.format('delta') \
    .option('versionAsOf', 10) \
    .load('/data/users')

Feature	Data Lake (Parquet)	Delta Lake
ACID transactions	No	Yes
Time travel	No	Yes (version-based)
Schema evolution	Manual	Automatic
Merge/Upsert	Custom code	Built-in
Small file compaction	Manual	OPTIMIZE command

§05

Related on TokRepo

AI tools for database — data storage and database tools on TokRepo
AI tools for automation — data pipeline automation tools

§06

Common pitfalls

Delta Lake's transaction log grows with every write operation; run VACUUM periodically to clean up old files and reduce storage costs
Time travel retention is controlled by the delta.logRetentionDuration setting; default is 30 days, after which old versions are garbage collected
Concurrent writers from different Spark clusters require a shared lock mechanism like DynamoDB for S3 or a Hive metastore

Frequently Asked Questions

What is the difference between Delta Lake and Apache Iceberg?+

Both are open table formats adding ACID transactions to data lakes. Delta Lake originated from Databricks and integrates tightly with Spark. Apache Iceberg originated from Netflix and offers broader engine compatibility (Spark, Flink, Trino, Presto). Both support time travel and schema evolution. Choose based on your primary compute engine and ecosystem.

Can Delta Lake work without Databricks?+

Yes. Delta Lake is fully open source under the Apache 2.0 license. You can use it with open-source Apache Spark, Flink, or standalone Delta Rust readers. Databricks provides additional proprietary optimizations, but the core format and features work independently.

How does time travel work in Delta Lake?+

Every write to a Delta table creates a new version in the transaction log. You can query any previous version by specifying versionAsOf or timestampAsOf in your read query. This enables auditing, debugging data issues, and reproducing ML training datasets from a specific point in time.

What is the OPTIMIZE command?+

OPTIMIZE compacts small files in a Delta table into larger files, improving read performance. Data lakes accumulate many small files from streaming writes or frequent appends. Running OPTIMIZE periodically (e.g., daily) consolidates these into optimally-sized Parquet files without changing the data.

Does Delta Lake support streaming workloads?+

Yes. Delta Lake supports both batch and streaming reads and writes with Spark Structured Streaming. You can write streaming data to a Delta table and read it as a stream. The exactly-once semantics of Delta transactions ensure no data duplication or loss in streaming pipelines.

Citations (3)

Delta Lake GitHub— Delta Lake is an open storage format for data lakehouses
Delta Lake Docs— Delta Lake provides ACID transactions on top of Parquet
Apache Spark— Apache Spark is the unified analytics engine for big data

Related on TokRepo

Database tools Automation tools Featured workflows

Discussion

No comments yet. Be the first to share your thoughts.

Delta Lake — Open Storage Format for the Lakehouse

What it is

Why it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

Frequently Asked Questions

Citations (3)

Related on TokRepo

Discussion

Related Assets

NAPI-RS — Build Node.js Native Addons in Rust

Mamba — Fast Cross-Platform Package Manager

Plasmo — The Browser Extension Framework