Apache Hudi — Incremental Data Processing for Data Lakehouses
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform that provides record-level insert, update, and delete capabilities on data lakes. It powers incremental pipelines, CDC ingestion, and near-real-time analytics on S3, GCS, and HDFS.
Review-first install path
This asset needs a review step. The copied prompt tells the agent to dry-run, show the writes, then proceed only after confirmation.
npx -y tokrepo@latest install 2db0b23f-39ec-11f1-9bc6-00163e2b0d79 --target codexDry-run first, confirm the writes, then run this command.
What it is
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform. It provides record-level insert, update, and delete capabilities on top of data lakes stored in S3, GCS, HDFS, or any Hadoop-compatible filesystem. Hudi turns your object storage into a mutable, transactional data store with ACID guarantees.
Hudi targets data engineers building incremental ETL pipelines, teams migrating from traditional data warehouses to lakehouse architectures, and organizations that need change data capture (CDC) ingestion from operational databases into their analytics layer.
How it saves time or tokens
Without Hudi, updating or deleting records in a data lake requires rewriting entire partitions. Hudi's record-level operations mean you only process changed records, dramatically reducing compute costs and pipeline runtime. Incremental queries let downstream consumers read only new or changed data since their last checkpoint, eliminating full-table scans.
For AI and ML workflows, Hudi's time travel feature lets you query historical snapshots of your data for reproducible training datasets without maintaining separate copies.
How to use
- Add Hudi to your Spark, Flink, or Hive environment. For Spark: include the
hudi-spark-bundleJAR in your spark-submit command. - Write data to a Hudi table using the Hudi datasource. Specify the record key, partition path, and precombine field.
- Query the table using Spark SQL, Trino, or any engine that supports Hudi's metadata. Use incremental queries to process only new records.
Example
# Write to a Hudi table with upsert
df.write.format('hudi') \
.option('hoodie.table.name', 'user_events') \
.option('hoodie.datasource.write.recordkey.field', 'event_id') \
.option('hoodie.datasource.write.partitionpath.field', 'event_date') \
.option('hoodie.datasource.write.precombine.field', 'updated_at') \
.option('hoodie.datasource.write.operation', 'upsert') \
.mode('append') \
.save('s3://my-lake/user_events')
# Incremental read: only changes since last checkpoint
df_incremental = spark.read.format('hudi') \
.option('hoodie.datasource.query.type', 'incremental') \
.option('hoodie.datasource.read.begin.instanttime', '20260401000000') \
.load('s3://my-lake/user_events')
Related on TokRepo
- AI tools for database — Data infrastructure and database tools
- Automation tools — Pipeline orchestration and automation
Common pitfalls
- Hudi's write operations require a Spark or Flink runtime. Ensure your cluster is properly sized for the write amplification that comes with merge-on-read tables.
- Choosing between Copy-on-Write (CoW) and Merge-on-Read (MoR) table types matters. CoW is simpler and better for read-heavy workloads. MoR optimizes write performance but adds complexity to read queries.
- Metadata table and timeline management need periodic compaction and cleaning. Configure Hudi's cleaner and archiver to prevent unbounded storage growth.
Frequently Asked Questions
Both provide ACID transactions on data lakes. Hudi focuses on record-level upserts and incremental processing, while Delta Lake emphasizes Spark integration and simple append/merge operations. Hudi offers more table type options (CoW vs MoR) for tuning read/write tradeoffs.
Yes. Hudi tables can be queried by Trino (Presto), Hive, AWS Athena, Google BigQuery, Snowflake (via external tables), and other engines. Read support varies by engine and table type.
Copy-on-Write (CoW) rewrites entire files on each update, giving fast read performance. Merge-on-Read (MoR) writes deltas to log files and merges them at read time, giving faster writes but slightly slower reads until compaction runs.
Yes. Hudi supports adding, renaming, and deleting columns. Schema changes are tracked in the timeline and applied transparently to readers. Backward and forward compatibility follows Avro schema evolution rules.
Hudi tracks a timeline of commits. Incremental queries specify a begin timestamp and return only records that changed after that point. This enables efficient CDC pipelines where downstream consumers process only new data instead of scanning the full table.
Citations (3)
- Apache Hudi GitHub— Apache Hudi provides record-level insert, update, and delete on data lakes
- Apache Hudi Documentation— Supports S3, GCS, HDFS storage with ACID guarantees
- Hudi Concepts— Incremental queries and CDC ingestion patterns
Related on TokRepo
Discussion
Related Assets
Apache Flink — Stream Processing Framework for Real-Time Data
Apache Flink is the leading open-source framework for stateful stream processing. It processes unbounded data streams with exactly-once semantics, low latency, and high throughput — powering real-time analytics, fraud detection, and event-driven applications.
Apache Beam — Unified Batch and Stream Data Processing
Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. Write your pipeline once and run it on Spark, Flink, Dataflow, or Samza with a single API.
Apache Spark — Unified Analytics Engine for Big Data
Apache Spark is the most widely used engine for large-scale data processing. It provides in-memory computing for batch processing, SQL queries, machine learning, graph processing, and streaming — all through a unified API in Python, Scala, Java, and R.
Apache Doris — Modern MPP Analytical Database for Real-Time Reporting
Apache Doris is a high-performance real-time analytical database. It combines MySQL-compatible SQL, sub-second query latency, and support for federated queries across data lakes, Hive, Iceberg, and Hudi — the open-source answer to Snowflake and BigQuery.