Apache Hudi — Incremental Data Processing for Data Lakehouses
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform that provides record-level insert, update, and delete capabilities on data lakes. It powers incremental pipelines, CDC ingestion, and near-real-time analytics on S3, GCS, and HDFS.
What it is
Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform. It provides record-level insert, update, and delete capabilities on top of data lakes stored in S3, GCS, HDFS, or any Hadoop-compatible filesystem. Hudi turns your object storage into a mutable, transactional data store with ACID guarantees.
Hudi targets data engineers building incremental ETL pipelines, teams migrating from traditional data warehouses to lakehouse architectures, and organizations that need change data capture (CDC) ingestion from operational databases into their analytics layer.
How it saves time or tokens
Without Hudi, updating or deleting records in a data lake requires rewriting entire partitions. Hudi's record-level operations mean you only process changed records, dramatically reducing compute costs and pipeline runtime. Incremental queries let downstream consumers read only new or changed data since their last checkpoint, eliminating full-table scans.
For AI and ML workflows, Hudi's time travel feature lets you query historical snapshots of your data for reproducible training datasets without maintaining separate copies.
How to use
- Add Hudi to your Spark, Flink, or Hive environment. For Spark: include the
hudi-spark-bundleJAR in your spark-submit command. - Write data to a Hudi table using the Hudi datasource. Specify the record key, partition path, and precombine field.
- Query the table using Spark SQL, Trino, or any engine that supports Hudi's metadata. Use incremental queries to process only new records.
Example
# Write to a Hudi table with upsert
df.write.format('hudi') \
.option('hoodie.table.name', 'user_events') \
.option('hoodie.datasource.write.recordkey.field', 'event_id') \
.option('hoodie.datasource.write.partitionpath.field', 'event_date') \
.option('hoodie.datasource.write.precombine.field', 'updated_at') \
.option('hoodie.datasource.write.operation', 'upsert') \
.mode('append') \
.save('s3://my-lake/user_events')
# Incremental read: only changes since last checkpoint
df_incremental = spark.read.format('hudi') \
.option('hoodie.datasource.query.type', 'incremental') \
.option('hoodie.datasource.read.begin.instanttime', '20260401000000') \
.load('s3://my-lake/user_events')
Related on TokRepo
- AI tools for database — Data infrastructure and database tools
- Automation tools — Pipeline orchestration and automation
Common pitfalls
- Hudi's write operations require a Spark or Flink runtime. Ensure your cluster is properly sized for the write amplification that comes with merge-on-read tables.
- Choosing between Copy-on-Write (CoW) and Merge-on-Read (MoR) table types matters. CoW is simpler and better for read-heavy workloads. MoR optimizes write performance but adds complexity to read queries.
- Metadata table and timeline management need periodic compaction and cleaning. Configure Hudi's cleaner and archiver to prevent unbounded storage growth.
Frequently Asked Questions
Both provide ACID transactions on data lakes. Hudi focuses on record-level upserts and incremental processing, while Delta Lake emphasizes Spark integration and simple append/merge operations. Hudi offers more table type options (CoW vs MoR) for tuning read/write tradeoffs.
Yes. Hudi tables can be queried by Trino (Presto), Hive, AWS Athena, Google BigQuery, Snowflake (via external tables), and other engines. Read support varies by engine and table type.
Copy-on-Write (CoW) rewrites entire files on each update, giving fast read performance. Merge-on-Read (MoR) writes deltas to log files and merges them at read time, giving faster writes but slightly slower reads until compaction runs.
Yes. Hudi supports adding, renaming, and deleting columns. Schema changes are tracked in the timeline and applied transparently to readers. Backward and forward compatibility follows Avro schema evolution rules.
Hudi tracks a timeline of commits. Incremental queries specify a begin timestamp and return only records that changed after that point. This enables efficient CDC pipelines where downstream consumers process only new data instead of scanning the full table.
Citations (3)
- Apache Hudi GitHub— Apache Hudi provides record-level insert, update, and delete on data lakes
- Apache Hudi Documentation— Supports S3, GCS, HDFS storage with ACID guarantees
- Hudi Concepts— Incremental queries and CDC ingestion patterns
Related on TokRepo
Discussion
Related Assets
HumHub — Open-Source Enterprise Social Network
A flexible, open-source social networking platform built on Yii2 for creating private communities, intranets, and collaboration spaces within organizations.
Dolibarr — Open-Source ERP & CRM for Business Management
A modular open-source ERP and CRM application written in PHP for managing contacts, invoices, orders, inventory, accounting, and more from a single web interface.
PrestaShop — Open-Source PHP E-Commerce Platform
A widely adopted open-source e-commerce platform written in PHP with a rich module marketplace, multi-language support, and a strong European user base.