# Apache Hudi — Incremental Data Processing for Data Lakehouses

> Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform that provides record-level insert, update, and delete capabilities on data lakes. It powers incremental pipelines, CDC ingestion, and near-real-time analytics on S3, GCS, and HDFS.

## Install

Save in your project root:

# Apache Hudi — Incremental Data Processing for Data Lakehouses

## Quick Use
```bash
# Add to Spark shell
spark-shell --packages org.apache.hudi:hudi-spark3-bundle_2.12:0.15.0
# Write a Hudi table
# df.write.format("hudi").option("hoodie.table.name","trips").save("/tmp/hudi_trips")
# Read it back
# spark.read.format("hudi").load("/tmp/hudi_trips").show()
```

## Introduction
Apache Hudi was created at Uber to solve the problem of updating records in data lakes efficiently. Traditional data lakes are append-only, making updates and deletes expensive full-table rewrites. Hudi adds record-level upsert and delete semantics on top of columnar file formats, enabling near-real-time analytics on mutable datasets.

## What Apache Hudi Does
- Provides record-level upserts and deletes on data lake storage (S3, GCS, HDFS)
- Supports two table types: Copy-on-Write for read-heavy workloads and Merge-on-Read for write-heavy workloads
- Enables incremental queries that return only changed records since a given commit
- Manages automatic file sizing, clustering, and compaction for query performance
- Integrates with Spark, Flink, Presto, Trino, and Hive for reading and writing

## Architecture Overview
Hudi manages a timeline of commits on each table. Copy-on-Write tables rewrite entire Parquet files on each upsert, giving fast reads. Merge-on-Read tables write deltas to row-based log files and periodically compact them into columnar files, giving fast writes. An indexing layer (Bloom filter, HBase, or bucket index) maps record keys to file groups for efficient lookups. The timeline server coordinates metadata and provides incremental pull semantics.

## Self-Hosting & Configuration
- Add the Hudi Spark bundle JAR to your Spark application or cluster
- Configure hoodie.datasource.write.operation (upsert, insert, bulk_insert, delete) per write job
- Choose table type (COPY_ON_WRITE or MERGE_ON_READ) based on read/write ratio
- Set up Hudi's built-in compaction and clustering services for Merge-on-Read tables
- Use the Hudi CLI or timeline server UI to inspect table state and run admin commands

## Key Features
- Record-level upserts and deletes without full-table rewrites
- Incremental queries for efficient downstream CDC-style processing
- Two storage layouts (CoW and MoR) optimized for different workload patterns
- Built-in indexing, compaction, and clustering for automatic performance tuning
- Multi-engine support: read and write from Spark, Flink, Presto, and Trino

## Comparison with Similar Tools
- **Delta Lake** — similar ACID lakehouse layer with stronger Databricks integration; Hudi excels at incremental processing and upsert-heavy workloads
- **Apache Iceberg** — focuses on metadata management and partition evolution; Hudi has more mature indexing and record-level operations
- **Apache Paimon** — Flink-native table format; Hudi has broader engine support and longer production track record
- **Plain Parquet on S3** — no upserts, deletes, or ACID guarantees; Hudi adds all of these
- **Kudu** — columnar storage with fast updates but requires its own cluster; Hudi runs on standard object stores

## FAQ
**Q: When should I use Hudi over Delta Lake?**
A: Hudi shines when you have high-frequency upserts, CDC ingestion from databases, or need incremental query capabilities. Delta Lake may be simpler for batch-heavy Spark workloads.

**Q: Does Hudi work with Flink?**
A: Yes. Hudi has a native Flink integration for both writing streamed CDC data and reading Hudi tables with Flink SQL.

**Q: What is the difference between Copy-on-Write and Merge-on-Read?**
A: CoW rewrites files on every write for fast reads. MoR appends to log files for fast writes and compacts asynchronously. Choose based on your read-to-write ratio.

**Q: Is Hudi production-ready?**
A: Yes. Hudi powers production data lakes at Uber, Amazon, ByteDance, Robinhood, and many other companies processing petabytes of data.

## Sources
- https://github.com/apache/hudi
- https://hudi.apache.org/docs/overview

---
Source: https://tokrepo.com/en/workflows/2db0b23f-39ec-11f1-9bc6-00163e2b0d79
Author: AI Open Source