ConfigsApr 16, 2026·3 min read

Apache Iceberg — Open Table Format for Huge Analytical Datasets

High-performance, engine-agnostic table format that brings ACID transactions, schema evolution, and time travel to Parquet data lakes.

TL;DR
Iceberg is an open table format that adds ACID transactions and schema evolution to Parquet data lakes.
§01

What it is

Apache Iceberg is a high-performance, engine-agnostic open table format designed for huge analytical datasets. It sits between your compute engine (Spark, Trino, Flink, Dremio) and your object storage (S3, GCS, HDFS), providing ACID transactions, schema evolution, partition evolution, and time travel on top of Parquet or ORC files.

Iceberg is for data engineers and platform teams who manage petabyte-scale data lakes and need reliability guarantees that raw Parquet directories cannot provide.

The project is actively maintained with regular releases and a growing user community. Documentation covers common use cases, and the open-source nature means you can inspect the source code, contribute fixes, and adapt the tool to your specific requirements.

§02

How it saves time or tokens

Without Iceberg, schema changes require full table rewrites. Partition changes require data migration. Concurrent writes cause data corruption. Iceberg eliminates all three problems through metadata-layer tracking: each commit produces an immutable snapshot, and readers always see a consistent view. This means fewer pipeline failures, no manual repair jobs, and zero downtime during schema changes.

§03

How to use

  1. Add the Iceberg dependency to your Spark session or Trino catalog configuration.
  2. Create an Iceberg table using standard SQL DDL with the Iceberg format specified.
  3. Write data using INSERT, MERGE INTO, or DataFrame APIs. Iceberg handles snapshots and metadata automatically.
§04

Example

-- Create an Iceberg table in Spark SQL
CREATE TABLE catalog.db.events (
  event_id BIGINT,
  event_type STRING,
  ts TIMESTAMP,
  payload STRING
) USING iceberg
PARTITIONED BY (days(ts));

-- Insert data
INSERT INTO catalog.db.events VALUES
  (1, 'click', TIMESTAMP '2026-04-15 10:00:00', '{"page": "/home"}');

-- Time travel query
SELECT * FROM catalog.db.events
  FOR SYSTEM_TIME AS OF TIMESTAMP '2026-04-14 00:00:00';

-- Schema evolution (no rewrite)
ALTER TABLE catalog.db.events ADD COLUMN region STRING;
§05

Related on TokRepo

§06

Common pitfalls

  • Forgetting to configure a metadata catalog (Hive Metastore, AWS Glue, or REST catalog) leads to orphaned metadata files and broken reads.
  • Running compaction too infrequently results in thousands of small files, degrading query performance significantly.
  • Mixing Iceberg and non-Iceberg writes to the same directory corrupts the table state because non-Iceberg writers bypass snapshot tracking.

Before adopting this tool, evaluate whether it fits your team's existing workflow. Read the official documentation thoroughly, and start with a small proof-of-concept rather than a full migration. Community forums, GitHub issues, and Stack Overflow are valuable resources when you encounter edge cases not covered in the documentation.

Frequently Asked Questions

What engines work with Apache Iceberg?+

Apache Spark, Trino, Apache Flink, Dremio, Snowflake, BigQuery, and Amazon Athena all support Iceberg tables natively. The format is engine-agnostic by design, so you can write with Spark and read with Trino without data duplication.

How does Iceberg handle schema evolution?+

Iceberg tracks schema changes in metadata, not in data files. Adding, dropping, renaming, or reordering columns does not require rewriting existing Parquet files. Readers automatically map old data files to the current schema using column IDs.

What is time travel in Iceberg?+

Every write to an Iceberg table creates an immutable snapshot. You can query any historical snapshot by timestamp or snapshot ID using FOR SYSTEM_TIME AS OF syntax. This is useful for debugging, auditing, and reproducible analytics.

How does Iceberg compare to Delta Lake?+

Both provide ACID transactions on data lakes. Iceberg is engine-agnostic and uses a catalog-based metadata layer. Delta Lake is tightly integrated with Spark and Databricks. Iceberg offers partition evolution without data rewrites; Delta Lake requires explicit repartitioning.

Does Iceberg work with object storage like S3?+

Yes. Iceberg is designed for cloud object storage. It works with S3, Google Cloud Storage, Azure Blob Storage, and HDFS. The metadata layer handles consistency guarantees that object stores lack natively.

Citations (3)

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets