Apache Iceberg — Open Table Format for Huge Analytical Datasets
High-performance, engine-agnostic table format that brings ACID transactions, schema evolution, and time travel to Parquet data lakes.
Instalación con revisión previa
Este activo requiere revisión. El prompt copiado pide dry-run, muestra escrituras y continúa solo tras confirmación.
npx -y tokrepo@latest install fba4cec0-3931-11f1-9bc6-00163e2b0d79 --target codexPrimero dry-run, confirma las escrituras y luego ejecuta este comando.
What it is
Apache Iceberg is a high-performance, engine-agnostic open table format designed for huge analytical datasets. It sits between your compute engine (Spark, Trino, Flink, Dremio) and your object storage (S3, GCS, HDFS), providing ACID transactions, schema evolution, partition evolution, and time travel on top of Parquet or ORC files.
Iceberg is for data engineers and platform teams who manage petabyte-scale data lakes and need reliability guarantees that raw Parquet directories cannot provide.
The project is actively maintained with regular releases and a growing user community. Documentation covers common use cases, and the open-source nature means you can inspect the source code, contribute fixes, and adapt the tool to your specific requirements.
How it saves time or tokens
Without Iceberg, schema changes require full table rewrites. Partition changes require data migration. Concurrent writes cause data corruption. Iceberg eliminates all three problems through metadata-layer tracking: each commit produces an immutable snapshot, and readers always see a consistent view. This means fewer pipeline failures, no manual repair jobs, and zero downtime during schema changes.
How to use
- Add the Iceberg dependency to your Spark session or Trino catalog configuration.
- Create an Iceberg table using standard SQL DDL with the Iceberg format specified.
- Write data using INSERT, MERGE INTO, or DataFrame APIs. Iceberg handles snapshots and metadata automatically.
Example
-- Create an Iceberg table in Spark SQL
CREATE TABLE catalog.db.events (
event_id BIGINT,
event_type STRING,
ts TIMESTAMP,
payload STRING
) USING iceberg
PARTITIONED BY (days(ts));
-- Insert data
INSERT INTO catalog.db.events VALUES
(1, 'click', TIMESTAMP '2026-04-15 10:00:00', '{"page": "/home"}');
-- Time travel query
SELECT * FROM catalog.db.events
FOR SYSTEM_TIME AS OF TIMESTAMP '2026-04-14 00:00:00';
-- Schema evolution (no rewrite)
ALTER TABLE catalog.db.events ADD COLUMN region STRING;
Related on TokRepo
- AI Tools for Database -- Explore database tools and utilities curated on TokRepo
- AI Tools for DevOps -- Infrastructure and pipeline tooling for data platform teams
Common pitfalls
- Forgetting to configure a metadata catalog (Hive Metastore, AWS Glue, or REST catalog) leads to orphaned metadata files and broken reads.
- Running compaction too infrequently results in thousands of small files, degrading query performance significantly.
- Mixing Iceberg and non-Iceberg writes to the same directory corrupts the table state because non-Iceberg writers bypass snapshot tracking.
Before adopting this tool, evaluate whether it fits your team's existing workflow. Read the official documentation thoroughly, and start with a small proof-of-concept rather than a full migration. Community forums, GitHub issues, and Stack Overflow are valuable resources when you encounter edge cases not covered in the documentation.
Preguntas frecuentes
Apache Spark, Trino, Apache Flink, Dremio, Snowflake, BigQuery, and Amazon Athena all support Iceberg tables natively. The format is engine-agnostic by design, so you can write with Spark and read with Trino without data duplication.
Iceberg tracks schema changes in metadata, not in data files. Adding, dropping, renaming, or reordering columns does not require rewriting existing Parquet files. Readers automatically map old data files to the current schema using column IDs.
Every write to an Iceberg table creates an immutable snapshot. You can query any historical snapshot by timestamp or snapshot ID using FOR SYSTEM_TIME AS OF syntax. This is useful for debugging, auditing, and reproducible analytics.
Both provide ACID transactions on data lakes. Iceberg is engine-agnostic and uses a catalog-based metadata layer. Delta Lake is tightly integrated with Spark and Databricks. Iceberg offers partition evolution without data rewrites; Delta Lake requires explicit repartitioning.
Yes. Iceberg is designed for cloud object storage. It works with S3, Google Cloud Storage, Azure Blob Storage, and HDFS. The metadata layer handles consistency guarantees that object stores lack natively.
Referencias (3)
- Apache Iceberg GitHub— Apache Iceberg is an open table format for huge analytic datasets
- Apache Iceberg Documentation— Iceberg supports Spark, Flink, Trino, and other engines
- Iceberg Spec - Partition Evolution— Partition evolution without data rewrites
Relacionados en TokRepo
Discusión
Activos relacionados
Apache Doris — Modern MPP Analytical Database for Real-Time Reporting
Apache Doris is a high-performance real-time analytical database. It combines MySQL-compatible SQL, sub-second query latency, and support for federated queries across data lakes, Hive, Iceberg, and Hudi — the open-source answer to Snowflake and BigQuery.
Apache DataFusion — Fast In-Process SQL Query Engine in Rust
An extensible query engine written in Rust that uses Apache Arrow as its in-memory format, enabling fast analytical SQL queries embeddable in any application.
Apache Pinot — Real-Time Distributed OLAP Datastore
Apache Pinot is a real-time distributed OLAP datastore designed to deliver low-latency analytical queries at high throughput. It powers user-facing analytics at companies like LinkedIn, Uber, and Stripe by ingesting data from Kafka and batch sources.
Apache Kafka — Distributed Event Streaming Platform
Apache Kafka is the open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, and mission-critical applications. Trillions of messages per day at LinkedIn, Netflix, Uber.