Skills2026年4月16日·1 分钟阅读

Apache Hudi — Incremental Data Processing for Data Lakehouses

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform that provides record-level insert, update, and delete capabilities on data lakes. It powers incremental pipelines, CDC ingestion, and near-real-time analytics on S3, GCS, and HDFS.

Apache Software Foundation · Community

Agent 就绪

先审查再安装

这个资产需要先审查。复制的指令会要求 Agent dry-run、列出写入项，确认后再继续。

Needs Confirmation · 64/100策略：需确认

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Community

入口

Apache Hudi Overview

先审查命令

npx -y tokrepo@latest install 2db0b23f-39ec-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run，确认写入项后再运行此命令。

TL;DR

Apache Hudi adds record-level upserts, deletes, and incremental queries to your data lake.

§01

What it is

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform. It provides record-level insert, update, and delete capabilities on top of data lakes stored in S3, GCS, HDFS, or any Hadoop-compatible filesystem. Hudi turns your object storage into a mutable, transactional data store with ACID guarantees.

Hudi targets data engineers building incremental ETL pipelines, teams migrating from traditional data warehouses to lakehouse architectures, and organizations that need change data capture (CDC) ingestion from operational databases into their analytics layer.

§02

How it saves time or tokens

Without Hudi, updating or deleting records in a data lake requires rewriting entire partitions. Hudi's record-level operations mean you only process changed records, dramatically reducing compute costs and pipeline runtime. Incremental queries let downstream consumers read only new or changed data since their last checkpoint, eliminating full-table scans.

For AI and ML workflows, Hudi's time travel feature lets you query historical snapshots of your data for reproducible training datasets without maintaining separate copies.

§03

How to use

Add Hudi to your Spark, Flink, or Hive environment. For Spark: include the hudi-spark-bundle JAR in your spark-submit command.
Write data to a Hudi table using the Hudi datasource. Specify the record key, partition path, and precombine field.
Query the table using Spark SQL, Trino, or any engine that supports Hudi's metadata. Use incremental queries to process only new records.

§04

Example

# Write to a Hudi table with upsert
df.write.format('hudi') \
  .option('hoodie.table.name', 'user_events') \
  .option('hoodie.datasource.write.recordkey.field', 'event_id') \
  .option('hoodie.datasource.write.partitionpath.field', 'event_date') \
  .option('hoodie.datasource.write.precombine.field', 'updated_at') \
  .option('hoodie.datasource.write.operation', 'upsert') \
  .mode('append') \
  .save('s3://my-lake/user_events')

# Incremental read: only changes since last checkpoint
df_incremental = spark.read.format('hudi') \
  .option('hoodie.datasource.query.type', 'incremental') \
  .option('hoodie.datasource.read.begin.instanttime', '20260401000000') \
  .load('s3://my-lake/user_events')

§05

Related on TokRepo

AI tools for database — Data infrastructure and database tools
Automation tools — Pipeline orchestration and automation

§06

Common pitfalls

Hudi's write operations require a Spark or Flink runtime. Ensure your cluster is properly sized for the write amplification that comes with merge-on-read tables.
Choosing between Copy-on-Write (CoW) and Merge-on-Read (MoR) table types matters. CoW is simpler and better for read-heavy workloads. MoR optimizes write performance but adds complexity to read queries.
Metadata table and timeline management need periodic compaction and cleaning. Configure Hudi's cleaner and archiver to prevent unbounded storage growth.

常见问题

How does Apache Hudi differ from Delta Lake?+

Both provide ACID transactions on data lakes. Hudi focuses on record-level upserts and incremental processing, while Delta Lake emphasizes Spark integration and simple append/merge operations. Hudi offers more table type options (CoW vs MoR) for tuning read/write tradeoffs.

Can Hudi work with query engines other than Spark?+

Yes. Hudi tables can be queried by Trino (Presto), Hive, AWS Athena, Google BigQuery, Snowflake (via external tables), and other engines. Read support varies by engine and table type.

What is the difference between CoW and MoR tables?+

Copy-on-Write (CoW) rewrites entire files on each update, giving fast read performance. Merge-on-Read (MoR) writes deltas to log files and merges them at read time, giving faster writes but slightly slower reads until compaction runs.

Does Hudi support schema evolution?+

Yes. Hudi supports adding, renaming, and deleting columns. Schema changes are tracked in the timeline and applied transparently to readers. Backward and forward compatibility follows Avro schema evolution rules.

How does incremental querying work?+

Hudi tracks a timeline of commits. Incremental queries specify a begin timestamp and return only records that changed after that point. This enables efficient CDC pipelines where downstream consumers process only new data instead of scanning the full table.

引用来源 (3)

Apache Hudi GitHub— Apache Hudi provides record-level insert, update, and delete on data lakes
Apache Hudi Documentation— Supports S3, GCS, HDFS storage with ACID guarantees
Hudi Concepts— Incremental queries and CDC ingestion patterns

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

Apache Hudi — Incremental Data Processing for Data Lakehouses

先审查再安装

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

讨论

相关资产

Apache Beam — Unified Batch and Stream Data Processing

Apache Flink — Stream Processing Framework for Real-Time Data

Apache Spark — Unified Analytics Engine for Big Data

Apache Doris — Modern MPP Analytical Database for Real-Time Reporting