Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsApr 16, 2026·3 min de lectura

Delta Lake — Open Storage Format for the Lakehouse

ACID transactions, time travel, and schema evolution for your data lake on top of Parquet and object storage.

Script Depot · Community

Listo para agents

Instalación con revisión previa

Este activo requiere revisión. El prompt copiado pide dry-run, muestra escrituras y continúa solo tras confirmación.

Needs Confirmation · 64/100Política: confirmar

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: Established

Entrada

Delta Lake Lakehouse

Comando con revisión previa

npx -y tokrepo@latest install e4dc6f52-3931-11f1-9bc6-00163e2b0d79 --target codex

Primero dry-run, confirma las escrituras y luego ejecuta este comando.

TL;DR

Delta Lake brings ACID transactions, time travel, and schema evolution to data lakes.

§01

What it is

Delta Lake is an open storage format that adds ACID transactions, time travel, and schema evolution to data lakes built on Parquet files and object storage like S3 or ADLS. It turns your data lake into a lakehouse by providing reliability guarantees that were previously only available in traditional data warehouses.

Delta Lake targets data engineers and analytics teams running Apache Spark, Flink, or Trino who need transactional consistency on their data lake without migrating to a closed-source warehouse. It is the default storage format for Databricks and is widely adopted in the open-source data ecosystem.

§02

Why it saves time or tokens

Without Delta Lake, data lakes suffer from partial writes, schema drift, and no ability to roll back bad data loads. Delta Lake's transaction log prevents these issues automatically. Time travel lets you query data as it existed at any past point, eliminating the need to maintain manual snapshot copies. For AI data pipelines that feed training data, this means reproducible datasets without complex versioning scripts.

§03

How to use

Add the Delta Lake dependency to your Spark application: spark.jars.packages = io.delta:delta-spark_2.12:3.1.0
Write DataFrames in Delta format: df.write.format('delta').save('/path/to/table')
Read and time-travel with spark.read.format('delta').option('versionAsOf', 5).load('/path/to/table')

§04

Example

from delta.tables import DeltaTable

# Upsert (merge) new data into existing table
delta_table = DeltaTable.forPath(spark, '/data/users')

delta_table.alias('target').merge(
    new_data.alias('source'),
    'target.id = source.id'
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# Time travel: read version 10
old_data = spark.read.format('delta') \
    .option('versionAsOf', 10) \
    .load('/data/users')

Feature	Data Lake (Parquet)	Delta Lake
ACID transactions	No	Yes
Time travel	No	Yes (version-based)
Schema evolution	Manual	Automatic
Merge/Upsert	Custom code	Built-in
Small file compaction	Manual	OPTIMIZE command

§05

Related on TokRepo

AI tools for database — data storage and database tools on TokRepo
AI tools for automation — data pipeline automation tools

§06

Common pitfalls

Delta Lake's transaction log grows with every write operation; run VACUUM periodically to clean up old files and reduce storage costs
Time travel retention is controlled by the delta.logRetentionDuration setting; default is 30 days, after which old versions are garbage collected
Concurrent writers from different Spark clusters require a shared lock mechanism like DynamoDB for S3 or a Hive metastore

Preguntas frecuentes

What is the difference between Delta Lake and Apache Iceberg?+

Both are open table formats adding ACID transactions to data lakes. Delta Lake originated from Databricks and integrates tightly with Spark. Apache Iceberg originated from Netflix and offers broader engine compatibility (Spark, Flink, Trino, Presto). Both support time travel and schema evolution. Choose based on your primary compute engine and ecosystem.

Can Delta Lake work without Databricks?+

Yes. Delta Lake is fully open source under the Apache 2.0 license. You can use it with open-source Apache Spark, Flink, or standalone Delta Rust readers. Databricks provides additional proprietary optimizations, but the core format and features work independently.

How does time travel work in Delta Lake?+

Every write to a Delta table creates a new version in the transaction log. You can query any previous version by specifying versionAsOf or timestampAsOf in your read query. This enables auditing, debugging data issues, and reproducing ML training datasets from a specific point in time.

What is the OPTIMIZE command?+

OPTIMIZE compacts small files in a Delta table into larger files, improving read performance. Data lakes accumulate many small files from streaming writes or frequent appends. Running OPTIMIZE periodically (e.g., daily) consolidates these into optimally-sized Parquet files without changing the data.

Does Delta Lake support streaming workloads?+

Yes. Delta Lake supports both batch and streaming reads and writes with Spark Structured Streaming. You can write streaming data to a Delta table and read it as a stream. The exactly-once semantics of Delta transactions ensure no data duplication or loss in streaming pipelines.

Referencias (3)

Delta Lake GitHub— Delta Lake is an open storage format for data lakehouses
Delta Lake Docs— Delta Lake provides ACID transactions on top of Parquet
Apache Spark— Apache Spark is the unified analytics engine for big data

Relacionados en TokRepo

Database tools Automation tools Featured workflows

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

Apache Iceberg — Open Table Format for Huge Analytical Datasets

High-performance, engine-agnostic table format that brings ACID transactions, schema evolution, and time travel to Parquet data lakes.

Skills

Apache Software Foundation

Apache Paimon — Streaming Data Lake Storage

Apache Paimon is a streaming data lake platform that supports both real-time streaming writes and high-performance batch reads using a lake format with changelog tracking.

Scripts

Script Depot

Open-Sora — Open-Source Text-to-Video Generation

Open-source alternative to Sora by HPC-AI Tech. Generate videos from text prompts with an 11B parameter model. Apache 2.0 licensed. 28,800+ stars.

Skills

Script Depot

Rallly — Open Source Meeting Scheduling & Polling Tool

Rallly is an open-source Doodle alternative for scheduling group meetings. Create polls, share availability, and find the best time — no sign-up required for participants.

Skills

Script Depot