Apache Iceberg — Open Table Format for Huge Analytical Datasets
High-performance, engine-agnostic table format that brings ACID transactions, schema evolution, and time travel to Parquet data lakes.
What it is
Apache Iceberg is a high-performance, engine-agnostic open table format designed for huge analytical datasets. It sits between your compute engine (Spark, Trino, Flink, Dremio) and your object storage (S3, GCS, HDFS), providing ACID transactions, schema evolution, partition evolution, and time travel on top of Parquet or ORC files.
Iceberg is for data engineers and platform teams who manage petabyte-scale data lakes and need reliability guarantees that raw Parquet directories cannot provide.
The project is actively maintained with regular releases and a growing user community. Documentation covers common use cases, and the open-source nature means you can inspect the source code, contribute fixes, and adapt the tool to your specific requirements.
How it saves time or tokens
Without Iceberg, schema changes require full table rewrites. Partition changes require data migration. Concurrent writes cause data corruption. Iceberg eliminates all three problems through metadata-layer tracking: each commit produces an immutable snapshot, and readers always see a consistent view. This means fewer pipeline failures, no manual repair jobs, and zero downtime during schema changes.
How to use
- Add the Iceberg dependency to your Spark session or Trino catalog configuration.
- Create an Iceberg table using standard SQL DDL with the Iceberg format specified.
- Write data using INSERT, MERGE INTO, or DataFrame APIs. Iceberg handles snapshots and metadata automatically.
Example
-- Create an Iceberg table in Spark SQL
CREATE TABLE catalog.db.events (
event_id BIGINT,
event_type STRING,
ts TIMESTAMP,
payload STRING
) USING iceberg
PARTITIONED BY (days(ts));
-- Insert data
INSERT INTO catalog.db.events VALUES
(1, 'click', TIMESTAMP '2026-04-15 10:00:00', '{"page": "/home"}');
-- Time travel query
SELECT * FROM catalog.db.events
FOR SYSTEM_TIME AS OF TIMESTAMP '2026-04-14 00:00:00';
-- Schema evolution (no rewrite)
ALTER TABLE catalog.db.events ADD COLUMN region STRING;
Related on TokRepo
- AI Tools for Database -- Explore database tools and utilities curated on TokRepo
- AI Tools for DevOps -- Infrastructure and pipeline tooling for data platform teams
Common pitfalls
- Forgetting to configure a metadata catalog (Hive Metastore, AWS Glue, or REST catalog) leads to orphaned metadata files and broken reads.
- Running compaction too infrequently results in thousands of small files, degrading query performance significantly.
- Mixing Iceberg and non-Iceberg writes to the same directory corrupts the table state because non-Iceberg writers bypass snapshot tracking.
Before adopting this tool, evaluate whether it fits your team's existing workflow. Read the official documentation thoroughly, and start with a small proof-of-concept rather than a full migration. Community forums, GitHub issues, and Stack Overflow are valuable resources when you encounter edge cases not covered in the documentation.
Frequently Asked Questions
Apache Spark, Trino, Apache Flink, Dremio, Snowflake, BigQuery, and Amazon Athena all support Iceberg tables natively. The format is engine-agnostic by design, so you can write with Spark and read with Trino without data duplication.
Iceberg tracks schema changes in metadata, not in data files. Adding, dropping, renaming, or reordering columns does not require rewriting existing Parquet files. Readers automatically map old data files to the current schema using column IDs.
Every write to an Iceberg table creates an immutable snapshot. You can query any historical snapshot by timestamp or snapshot ID using FOR SYSTEM_TIME AS OF syntax. This is useful for debugging, auditing, and reproducible analytics.
Both provide ACID transactions on data lakes. Iceberg is engine-agnostic and uses a catalog-based metadata layer. Delta Lake is tightly integrated with Spark and Databricks. Iceberg offers partition evolution without data rewrites; Delta Lake requires explicit repartitioning.
Yes. Iceberg is designed for cloud object storage. It works with S3, Google Cloud Storage, Azure Blob Storage, and HDFS. The metadata layer handles consistency guarantees that object stores lack natively.
Citations (3)
- Apache Iceberg GitHub— Apache Iceberg is an open table format for huge analytic datasets
- Apache Iceberg Documentation— Iceberg supports Spark, Flink, Trino, and other engines
- Iceberg Spec - Partition Evolution— Partition evolution without data rewrites
Related on TokRepo
Discussion
Related Assets
HumHub — Open-Source Enterprise Social Network
A flexible, open-source social networking platform built on Yii2 for creating private communities, intranets, and collaboration spaces within organizations.
Dolibarr — Open-Source ERP & CRM for Business Management
A modular open-source ERP and CRM application written in PHP for managing contacts, invoices, orders, inventory, accounting, and more from a single web interface.
PrestaShop — Open-Source PHP E-Commerce Platform
A widely adopted open-source e-commerce platform written in PHP with a rich module marketplace, multi-language support, and a strong European user base.