Delta Lake — Open Storage Format for the Lakehouse
ACID transactions, time travel, and schema evolution for your data lake on top of Parquet and object storage.
Instalación con revisión previa
Este activo requiere revisión. El prompt copiado pide dry-run, muestra escrituras y continúa solo tras confirmación.
npx -y tokrepo@latest install e4dc6f52-3931-11f1-9bc6-00163e2b0d79 --target codexPrimero dry-run, confirma las escrituras y luego ejecuta este comando.
What it is
Delta Lake is an open storage format that adds ACID transactions, time travel, and schema evolution to data lakes built on Parquet files and object storage like S3 or ADLS. It turns your data lake into a lakehouse by providing reliability guarantees that were previously only available in traditional data warehouses.
Delta Lake targets data engineers and analytics teams running Apache Spark, Flink, or Trino who need transactional consistency on their data lake without migrating to a closed-source warehouse. It is the default storage format for Databricks and is widely adopted in the open-source data ecosystem.
Why it saves time or tokens
Without Delta Lake, data lakes suffer from partial writes, schema drift, and no ability to roll back bad data loads. Delta Lake's transaction log prevents these issues automatically. Time travel lets you query data as it existed at any past point, eliminating the need to maintain manual snapshot copies. For AI data pipelines that feed training data, this means reproducible datasets without complex versioning scripts.
How to use
- Add the Delta Lake dependency to your Spark application:
spark.jars.packages = io.delta:delta-spark_2.12:3.1.0 - Write DataFrames in Delta format:
df.write.format('delta').save('/path/to/table') - Read and time-travel with
spark.read.format('delta').option('versionAsOf', 5).load('/path/to/table')
Example
from delta.tables import DeltaTable
# Upsert (merge) new data into existing table
delta_table = DeltaTable.forPath(spark, '/data/users')
delta_table.alias('target').merge(
new_data.alias('source'),
'target.id = source.id'
).whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# Time travel: read version 10
old_data = spark.read.format('delta') \
.option('versionAsOf', 10) \
.load('/data/users')
| Feature | Data Lake (Parquet) | Delta Lake |
|---|---|---|
| ACID transactions | No | Yes |
| Time travel | No | Yes (version-based) |
| Schema evolution | Manual | Automatic |
| Merge/Upsert | Custom code | Built-in |
| Small file compaction | Manual | OPTIMIZE command |
Related on TokRepo
- AI tools for database — data storage and database tools on TokRepo
- AI tools for automation — data pipeline automation tools
Common pitfalls
- Delta Lake's transaction log grows with every write operation; run VACUUM periodically to clean up old files and reduce storage costs
- Time travel retention is controlled by the
delta.logRetentionDurationsetting; default is 30 days, after which old versions are garbage collected - Concurrent writers from different Spark clusters require a shared lock mechanism like DynamoDB for S3 or a Hive metastore
Preguntas frecuentes
Both are open table formats adding ACID transactions to data lakes. Delta Lake originated from Databricks and integrates tightly with Spark. Apache Iceberg originated from Netflix and offers broader engine compatibility (Spark, Flink, Trino, Presto). Both support time travel and schema evolution. Choose based on your primary compute engine and ecosystem.
Yes. Delta Lake is fully open source under the Apache 2.0 license. You can use it with open-source Apache Spark, Flink, or standalone Delta Rust readers. Databricks provides additional proprietary optimizations, but the core format and features work independently.
Every write to a Delta table creates a new version in the transaction log. You can query any previous version by specifying versionAsOf or timestampAsOf in your read query. This enables auditing, debugging data issues, and reproducing ML training datasets from a specific point in time.
OPTIMIZE compacts small files in a Delta table into larger files, improving read performance. Data lakes accumulate many small files from streaming writes or frequent appends. Running OPTIMIZE periodically (e.g., daily) consolidates these into optimally-sized Parquet files without changing the data.
Yes. Delta Lake supports both batch and streaming reads and writes with Spark Structured Streaming. You can write streaming data to a Delta table and read it as a stream. The exactly-once semantics of Delta transactions ensure no data duplication or loss in streaming pipelines.
Referencias (3)
- Delta Lake GitHub— Delta Lake is an open storage format for data lakehouses
- Delta Lake Docs— Delta Lake provides ACID transactions on top of Parquet
- Apache Spark— Apache Spark is the unified analytics engine for big data
Relacionados en TokRepo
Discusión
Activos relacionados
Apache Iceberg — Open Table Format for Huge Analytical Datasets
High-performance, engine-agnostic table format that brings ACID transactions, schema evolution, and time travel to Parquet data lakes.
Apache Paimon — Streaming Data Lake Storage
Apache Paimon is a streaming data lake platform that supports both real-time streaming writes and high-performance batch reads using a lake format with changelog tracking.
Open-Sora — Open-Source Text-to-Video Generation
Open-source alternative to Sora by HPC-AI Tech. Generate videos from text prompts with an 11B parameter model. Apache 2.0 licensed. 28,800+ stars.
Rallly — Open Source Meeting Scheduling & Polling Tool
Rallly is an open-source Doodle alternative for scheduling group meetings. Create polls, share availability, and find the best time — no sign-up required for participants.