# Apache Iceberg — Open Table Format for Huge Analytical Datasets

> High-performance, engine-agnostic table format that brings ACID transactions, schema evolution, and time travel to Parquet data lakes.

## Install

Save in your project root:

# Apache Iceberg — Open Table Format for Huge Analytics Tables

## Quick Use
```bash
# Using Iceberg with PySpark and a local REST catalog (Tabular's minimal REST)
pip install pyiceberg pyspark

# Create a SparkSession that loads the Iceberg runtime
python - <<'PY'
from pyspark.sql import SparkSession
spark = (SparkSession.builder
  .config("spark.jars.packages","org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2")
  .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
  .config("spark.sql.catalog.local","org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.local.type","hadoop")
  .config("spark.sql.catalog.local.warehouse","/tmp/warehouse")
  .getOrCreate())

spark.sql("CREATE TABLE local.db.events (id bigint, ts timestamp, kind string) USING iceberg")
spark.sql("INSERT INTO local.db.events VALUES (1, current_timestamp(), 'login')")
spark.sql("SELECT * FROM local.db.events.snapshots").show()
PY
```

## Introduction
Apache Iceberg is a high-performance table format designed for huge analytical datasets. It delivers ACID transactions, hidden partitioning, schema evolution, and time travel on top of Parquet/ORC/Avro files — without locking you into any specific query engine. Spark, Trino, Flink, Presto, Snowflake, BigQuery, DuckDB, Dremio, Athena, and others all read Iceberg natively.

## What Iceberg Does
- Stores massive tables as immutable data files tracked by metadata snapshots.
- Enables atomic commits, concurrent writes, and isolation without locking the whole table.
- Supports schema evolution (add/drop/rename columns) and partition evolution.
- Provides time travel and rollback via snapshot-level reads.
- Powers vendor-neutral lakehouses: the same table works across many engines.

## Architecture Overview
An Iceberg table is a three-layer tree: a current metadata JSON pointer, a list of manifest lists per snapshot, and manifests that enumerate data files with per-column statistics. Engines prune partitions and files via these stats before reading any data. A catalog (REST, Nessie, Glue, Hive, Unity, JDBC, or the new Polaris) brokers atomic swaps of the metadata pointer to implement transactions.

## Self-Hosting & Configuration
- Pick a catalog: REST (`iceberg-rest-fixture`), Nessie (Git-like), Hive Metastore, Glue, Unity, or JDBC.
- Store table data in S3/GCS/Azure/MinIO/HDFS with lifecycle policies that align with Iceberg retention.
- Configure properties like `write.format.default`, `write.target-file-size-bytes`, and `write.distribution-mode`.
- Use `rewrite_data_files` and `expire_snapshots` actions to compact and reclaim storage.
- Integrate with Spark, Flink, Trino, Dremio, Starrocks, ClickHouse, and DuckDB via first-class connectors.

## Key Features
- Engine-agnostic: the same table is consumed by batch, streaming, and interactive engines.
- Hidden partitioning — users query by `event_time` while Iceberg maps to partition columns.
- Schema evolution that is metadata-only: no file rewrites.
- Row-level updates via copy-on-write and merge-on-read deletion vectors.
- Branches and tags (Git-style semantics) for experimentation, backfills, and GDPR deletes.

## Comparison with Similar Tools
- **Delta Lake** — ACID lakehouse format; tied tightly to Databricks ecosystem, now with UniForm interop.
- **Apache Hudi** — Focused on upserts and incremental pulls; Iceberg targets broader analytics.
- **Hive tables** — Directory-based; Iceberg replaces partition discovery with metadata-tracked files.
- **Parquet alone** — Great columnar storage, no table semantics; Iceberg layers ACID on top.
- **Snowflake / BigQuery native** — Closed; Iceberg keeps your data portable across engines.

## FAQ
**Q:** Which engines can read Iceberg?
A: Spark, Trino, Flink, Presto, Athena, Snowflake, BigQuery, DuckDB, Dremio, Starrocks, and many more.

**Q:** What catalog should I use?
A: REST is emerging as the de-facto standard; Glue/Unity are common in cloud; Nessie adds Git-like branching.

**Q:** How do deletes work?
A: Copy-on-write rewrites affected files, or merge-on-read writes delete files that readers merge in at query time.

**Q:** Can Iceberg handle streaming?
A: Yes — Flink and Spark Structured Streaming can upsert into Iceberg tables with exactly-once commits.

## Sources
- https://github.com/apache/iceberg
- https://iceberg.apache.org/docs

---
Source: https://tokrepo.com/en/workflows/fba4cec0-3931-11f1-9bc6-00163e2b0d79
Author: AI Open Source