ConfigsApr 16, 2026·3 min read

Apache Hudi — Incremental Data Processing for Data Lakehouses

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform that provides record-level insert, update, and delete capabilities on data lakes. It powers incremental pipelines, CDC ingestion, and near-real-time analytics on S3, GCS, and HDFS.

TL;DR
Apache Hudi adds record-level upserts, deletes, and incremental queries to your data lake.
§01

What it is

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform. It provides record-level insert, update, and delete capabilities on top of data lakes stored in S3, GCS, HDFS, or any Hadoop-compatible filesystem. Hudi turns your object storage into a mutable, transactional data store with ACID guarantees.

Hudi targets data engineers building incremental ETL pipelines, teams migrating from traditional data warehouses to lakehouse architectures, and organizations that need change data capture (CDC) ingestion from operational databases into their analytics layer.

§02

How it saves time or tokens

Without Hudi, updating or deleting records in a data lake requires rewriting entire partitions. Hudi's record-level operations mean you only process changed records, dramatically reducing compute costs and pipeline runtime. Incremental queries let downstream consumers read only new or changed data since their last checkpoint, eliminating full-table scans.

For AI and ML workflows, Hudi's time travel feature lets you query historical snapshots of your data for reproducible training datasets without maintaining separate copies.

§03

How to use

  1. Add Hudi to your Spark, Flink, or Hive environment. For Spark: include the hudi-spark-bundle JAR in your spark-submit command.
  2. Write data to a Hudi table using the Hudi datasource. Specify the record key, partition path, and precombine field.
  3. Query the table using Spark SQL, Trino, or any engine that supports Hudi's metadata. Use incremental queries to process only new records.
§04

Example

# Write to a Hudi table with upsert
df.write.format('hudi') \
  .option('hoodie.table.name', 'user_events') \
  .option('hoodie.datasource.write.recordkey.field', 'event_id') \
  .option('hoodie.datasource.write.partitionpath.field', 'event_date') \
  .option('hoodie.datasource.write.precombine.field', 'updated_at') \
  .option('hoodie.datasource.write.operation', 'upsert') \
  .mode('append') \
  .save('s3://my-lake/user_events')

# Incremental read: only changes since last checkpoint
df_incremental = spark.read.format('hudi') \
  .option('hoodie.datasource.query.type', 'incremental') \
  .option('hoodie.datasource.read.begin.instanttime', '20260401000000') \
  .load('s3://my-lake/user_events')
§05

Related on TokRepo

§06

Common pitfalls

  • Hudi's write operations require a Spark or Flink runtime. Ensure your cluster is properly sized for the write amplification that comes with merge-on-read tables.
  • Choosing between Copy-on-Write (CoW) and Merge-on-Read (MoR) table types matters. CoW is simpler and better for read-heavy workloads. MoR optimizes write performance but adds complexity to read queries.
  • Metadata table and timeline management need periodic compaction and cleaning. Configure Hudi's cleaner and archiver to prevent unbounded storage growth.

Frequently Asked Questions

How does Apache Hudi differ from Delta Lake?+

Both provide ACID transactions on data lakes. Hudi focuses on record-level upserts and incremental processing, while Delta Lake emphasizes Spark integration and simple append/merge operations. Hudi offers more table type options (CoW vs MoR) for tuning read/write tradeoffs.

Can Hudi work with query engines other than Spark?+

Yes. Hudi tables can be queried by Trino (Presto), Hive, AWS Athena, Google BigQuery, Snowflake (via external tables), and other engines. Read support varies by engine and table type.

What is the difference between CoW and MoR tables?+

Copy-on-Write (CoW) rewrites entire files on each update, giving fast read performance. Merge-on-Read (MoR) writes deltas to log files and merges them at read time, giving faster writes but slightly slower reads until compaction runs.

Does Hudi support schema evolution?+

Yes. Hudi supports adding, renaming, and deleting columns. Schema changes are tracked in the timeline and applied transparently to readers. Backward and forward compatibility follows Avro schema evolution rules.

How does incremental querying work?+

Hudi tracks a timeline of commits. Incremental queries specify a begin timestamp and return only records that changed after that point. This enables efficient CDC pipelines where downstream consumers process only new data instead of scanning the full table.

Citations (3)

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets