# Apache Iceberg — Open Table Format for Huge Analytical Datasets > High-performance, engine-agnostic table format that brings ACID transactions, schema evolution, and time travel to Parquet data lakes. ## Install Save in your project root: # Apache Iceberg — Open Table Format for Huge Analytics Tables ## Quick Use ```bash # Using Iceberg with PySpark and a local REST catalog (Tabular's minimal REST) pip install pyiceberg pyspark # Create a SparkSession that loads the Iceberg runtime python - <<'PY' from pyspark.sql import SparkSession spark = (SparkSession.builder .config("spark.jars.packages","org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2") .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.local","org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.local.type","hadoop") .config("spark.sql.catalog.local.warehouse","/tmp/warehouse") .getOrCreate()) spark.sql("CREATE TABLE local.db.events (id bigint, ts timestamp, kind string) USING iceberg") spark.sql("INSERT INTO local.db.events VALUES (1, current_timestamp(), 'login')") spark.sql("SELECT * FROM local.db.events.snapshots").show() PY ``` ## Introduction Apache Iceberg is a high-performance table format designed for huge analytical datasets. It delivers ACID transactions, hidden partitioning, schema evolution, and time travel on top of Parquet/ORC/Avro files — without locking you into any specific query engine. Spark, Trino, Flink, Presto, Snowflake, BigQuery, DuckDB, Dremio, Athena, and others all read Iceberg natively. ## What Iceberg Does - Stores massive tables as immutable data files tracked by metadata snapshots. - Enables atomic commits, concurrent writes, and isolation without locking the whole table. - Supports schema evolution (add/drop/rename columns) and partition evolution. - Provides time travel and rollback via snapshot-level reads. - Powers vendor-neutral lakehouses: the same table works across many engines. ## Architecture Overview An Iceberg table is a three-layer tree: a current metadata JSON pointer, a list of manifest lists per snapshot, and manifests that enumerate data files with per-column statistics. Engines prune partitions and files via these stats before reading any data. A catalog (REST, Nessie, Glue, Hive, Unity, JDBC, or the new Polaris) brokers atomic swaps of the metadata pointer to implement transactions. ## Self-Hosting & Configuration - Pick a catalog: REST (`iceberg-rest-fixture`), Nessie (Git-like), Hive Metastore, Glue, Unity, or JDBC. - Store table data in S3/GCS/Azure/MinIO/HDFS with lifecycle policies that align with Iceberg retention. - Configure properties like `write.format.default`, `write.target-file-size-bytes`, and `write.distribution-mode`. - Use `rewrite_data_files` and `expire_snapshots` actions to compact and reclaim storage. - Integrate with Spark, Flink, Trino, Dremio, Starrocks, ClickHouse, and DuckDB via first-class connectors. ## Key Features - Engine-agnostic: the same table is consumed by batch, streaming, and interactive engines. - Hidden partitioning — users query by `event_time` while Iceberg maps to partition columns. - Schema evolution that is metadata-only: no file rewrites. - Row-level updates via copy-on-write and merge-on-read deletion vectors. - Branches and tags (Git-style semantics) for experimentation, backfills, and GDPR deletes. ## Comparison with Similar Tools - **Delta Lake** — ACID lakehouse format; tied tightly to Databricks ecosystem, now with UniForm interop. - **Apache Hudi** — Focused on upserts and incremental pulls; Iceberg targets broader analytics. - **Hive tables** — Directory-based; Iceberg replaces partition discovery with metadata-tracked files. - **Parquet alone** — Great columnar storage, no table semantics; Iceberg layers ACID on top. - **Snowflake / BigQuery native** — Closed; Iceberg keeps your data portable across engines. ## FAQ **Q:** Which engines can read Iceberg? A: Spark, Trino, Flink, Presto, Athena, Snowflake, BigQuery, DuckDB, Dremio, Starrocks, and many more. **Q:** What catalog should I use? A: REST is emerging as the de-facto standard; Glue/Unity are common in cloud; Nessie adds Git-like branching. **Q:** How do deletes work? A: Copy-on-write rewrites affected files, or merge-on-read writes delete files that readers merge in at query time. **Q:** Can Iceberg handle streaming? A: Yes — Flink and Spark Structured Streaming can upsert into Iceberg tables with exactly-once commits. ## Sources - https://github.com/apache/iceberg - https://iceberg.apache.org/docs --- Source: https://tokrepo.com/en/workflows/fba4cec0-3931-11f1-9bc6-00163e2b0d79 Author: AI Open Source