Esta página se muestra en inglés. Una traducción al español está en curso.

SkillsApr 16, 2026·3 min de lectura

Apache Iceberg — Open Table Format for Huge Analytical Datasets

High-performance, engine-agnostic table format that brings ACID transactions, schema evolution, and time travel to Parquet data lakes.

Apache Software Foundation · Community

Listo para agents

Instalación con revisión previa

Este activo requiere revisión. El prompt copiado pide dry-run, muestra escrituras y continúa solo tras confirmación.

Needs Confirmation · 64/100Política: confirmar

Superficie agent

Cualquier agent MCP/CLI

Tipo

Skill

Instalación

Single

Confianza

Confianza: Community

Entrada

Iceberg Table Format

Comando con revisión previa

npx -y tokrepo@latest install fba4cec0-3931-11f1-9bc6-00163e2b0d79 --target codex

Primero dry-run, confirma las escrituras y luego ejecuta este comando.

TL;DR

Iceberg is an open table format that adds ACID transactions and schema evolution to Parquet data lakes.

§01

What it is

Apache Iceberg is a high-performance, engine-agnostic open table format designed for huge analytical datasets. It sits between your compute engine (Spark, Trino, Flink, Dremio) and your object storage (S3, GCS, HDFS), providing ACID transactions, schema evolution, partition evolution, and time travel on top of Parquet or ORC files.

Iceberg is for data engineers and platform teams who manage petabyte-scale data lakes and need reliability guarantees that raw Parquet directories cannot provide.

The project is actively maintained with regular releases and a growing user community. Documentation covers common use cases, and the open-source nature means you can inspect the source code, contribute fixes, and adapt the tool to your specific requirements.

§02

How it saves time or tokens

Without Iceberg, schema changes require full table rewrites. Partition changes require data migration. Concurrent writes cause data corruption. Iceberg eliminates all three problems through metadata-layer tracking: each commit produces an immutable snapshot, and readers always see a consistent view. This means fewer pipeline failures, no manual repair jobs, and zero downtime during schema changes.

§03

How to use

Add the Iceberg dependency to your Spark session or Trino catalog configuration.
Create an Iceberg table using standard SQL DDL with the Iceberg format specified.
Write data using INSERT, MERGE INTO, or DataFrame APIs. Iceberg handles snapshots and metadata automatically.

§04

Example

-- Create an Iceberg table in Spark SQL
CREATE TABLE catalog.db.events (
  event_id BIGINT,
  event_type STRING,
  ts TIMESTAMP,
  payload STRING
) USING iceberg
PARTITIONED BY (days(ts));

-- Insert data
INSERT INTO catalog.db.events VALUES
  (1, 'click', TIMESTAMP '2026-04-15 10:00:00', '{"page": "/home"}');

-- Time travel query
SELECT * FROM catalog.db.events
  FOR SYSTEM_TIME AS OF TIMESTAMP '2026-04-14 00:00:00';

-- Schema evolution (no rewrite)
ALTER TABLE catalog.db.events ADD COLUMN region STRING;

§05

Related on TokRepo

AI Tools for Database -- Explore database tools and utilities curated on TokRepo
AI Tools for DevOps -- Infrastructure and pipeline tooling for data platform teams

§06

Common pitfalls

Forgetting to configure a metadata catalog (Hive Metastore, AWS Glue, or REST catalog) leads to orphaned metadata files and broken reads.
Running compaction too infrequently results in thousands of small files, degrading query performance significantly.
Mixing Iceberg and non-Iceberg writes to the same directory corrupts the table state because non-Iceberg writers bypass snapshot tracking.

Before adopting this tool, evaluate whether it fits your team's existing workflow. Read the official documentation thoroughly, and start with a small proof-of-concept rather than a full migration. Community forums, GitHub issues, and Stack Overflow are valuable resources when you encounter edge cases not covered in the documentation.

Preguntas frecuentes

What engines work with Apache Iceberg?+

Apache Spark, Trino, Apache Flink, Dremio, Snowflake, BigQuery, and Amazon Athena all support Iceberg tables natively. The format is engine-agnostic by design, so you can write with Spark and read with Trino without data duplication.

How does Iceberg handle schema evolution?+

Iceberg tracks schema changes in metadata, not in data files. Adding, dropping, renaming, or reordering columns does not require rewriting existing Parquet files. Readers automatically map old data files to the current schema using column IDs.

What is time travel in Iceberg?+

Every write to an Iceberg table creates an immutable snapshot. You can query any historical snapshot by timestamp or snapshot ID using FOR SYSTEM_TIME AS OF syntax. This is useful for debugging, auditing, and reproducible analytics.

How does Iceberg compare to Delta Lake?+

Both provide ACID transactions on data lakes. Iceberg is engine-agnostic and uses a catalog-based metadata layer. Delta Lake is tightly integrated with Spark and Databricks. Iceberg offers partition evolution without data rewrites; Delta Lake requires explicit repartitioning.

Does Iceberg work with object storage like S3?+

Yes. Iceberg is designed for cloud object storage. It works with S3, Google Cloud Storage, Azure Blob Storage, and HDFS. The metadata layer handles consistency guarantees that object stores lack natively.

Referencias (3)

Apache Iceberg GitHub— Apache Iceberg is an open table format for huge analytic datasets
Apache Iceberg Documentation— Iceberg supports Spark, Flink, Trino, and other engines
Iceberg Spec - Partition Evolution— Partition evolution without data rewrites

Relacionados en TokRepo

AI database tools DevOps tools Featured workflows

Discusión

Inicia sesión para unirte a la discusión.

Aún no hay comentarios. Sé el primero en compartir tus ideas.

Activos relacionados

Apache Doris — Modern MPP Analytical Database for Real-Time Reporting

Apache Doris is a high-performance real-time analytical database. It combines MySQL-compatible SQL, sub-second query latency, and support for federated queries across data lakes, Hive, Iceberg, and Hudi — the open-source answer to Snowflake and BigQuery.

Skills

Apache Software Foundation

Apache DataFusion — Fast In-Process SQL Query Engine in Rust

An extensible query engine written in Rust that uses Apache Arrow as its in-memory format, enabling fast analytical SQL queries embeddable in any application.

Skills

Apache Software Foundation

Apache Hudi — Incremental Data Processing for Data Lakehouses

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform that provides record-level insert, update, and delete capabilities on data lakes. It powers incremental pipelines, CDC ingestion, and near-real-time analytics on S3, GCS, and HDFS.

Skills

Apache Software Foundation

Apache Pinot — Real-Time Distributed OLAP Datastore

Apache Pinot is a real-time distributed OLAP datastore designed to deliver low-latency analytical queries at high throughput. It powers user-facing analytics at companies like LinkedIn, Uber, and Stripe by ingesting data from Kafka and batch sources.

Skills

Apache Software Foundation