Skills2026年4月16日·1 分钟阅读

Apache Hudi — Incremental Data Processing for Data Lakehouses

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform that provides record-level insert, update, and delete capabilities on data lakes. It powers incremental pipelines, CDC ingestion, and near-real-time analytics on S3, GCS, and HDFS.

Agent 就绪

先审查再安装

这个资产需要先审查。复制的指令会要求 Agent dry-run、列出写入项,确认后再继续。

Needs Confirmation · 64/100策略:需确认
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Community
入口
Apache Hudi Overview
先审查命令
npx -y tokrepo@latest install 2db0b23f-39ec-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run,确认写入项后再运行此命令。

TL;DR
Apache Hudi adds record-level upserts, deletes, and incremental queries to your data lake.
§01

What it is

Apache Hudi (Hadoop Upserts Deletes and Incrementals) is an open-source data lakehouse platform. It provides record-level insert, update, and delete capabilities on top of data lakes stored in S3, GCS, HDFS, or any Hadoop-compatible filesystem. Hudi turns your object storage into a mutable, transactional data store with ACID guarantees.

Hudi targets data engineers building incremental ETL pipelines, teams migrating from traditional data warehouses to lakehouse architectures, and organizations that need change data capture (CDC) ingestion from operational databases into their analytics layer.

§02

How it saves time or tokens

Without Hudi, updating or deleting records in a data lake requires rewriting entire partitions. Hudi's record-level operations mean you only process changed records, dramatically reducing compute costs and pipeline runtime. Incremental queries let downstream consumers read only new or changed data since their last checkpoint, eliminating full-table scans.

For AI and ML workflows, Hudi's time travel feature lets you query historical snapshots of your data for reproducible training datasets without maintaining separate copies.

§03

How to use

  1. Add Hudi to your Spark, Flink, or Hive environment. For Spark: include the hudi-spark-bundle JAR in your spark-submit command.
  2. Write data to a Hudi table using the Hudi datasource. Specify the record key, partition path, and precombine field.
  3. Query the table using Spark SQL, Trino, or any engine that supports Hudi's metadata. Use incremental queries to process only new records.
§04

Example

# Write to a Hudi table with upsert
df.write.format('hudi') \
  .option('hoodie.table.name', 'user_events') \
  .option('hoodie.datasource.write.recordkey.field', 'event_id') \
  .option('hoodie.datasource.write.partitionpath.field', 'event_date') \
  .option('hoodie.datasource.write.precombine.field', 'updated_at') \
  .option('hoodie.datasource.write.operation', 'upsert') \
  .mode('append') \
  .save('s3://my-lake/user_events')

# Incremental read: only changes since last checkpoint
df_incremental = spark.read.format('hudi') \
  .option('hoodie.datasource.query.type', 'incremental') \
  .option('hoodie.datasource.read.begin.instanttime', '20260401000000') \
  .load('s3://my-lake/user_events')
§05

Related on TokRepo

§06

Common pitfalls

  • Hudi's write operations require a Spark or Flink runtime. Ensure your cluster is properly sized for the write amplification that comes with merge-on-read tables.
  • Choosing between Copy-on-Write (CoW) and Merge-on-Read (MoR) table types matters. CoW is simpler and better for read-heavy workloads. MoR optimizes write performance but adds complexity to read queries.
  • Metadata table and timeline management need periodic compaction and cleaning. Configure Hudi's cleaner and archiver to prevent unbounded storage growth.

常见问题

How does Apache Hudi differ from Delta Lake?+

Both provide ACID transactions on data lakes. Hudi focuses on record-level upserts and incremental processing, while Delta Lake emphasizes Spark integration and simple append/merge operations. Hudi offers more table type options (CoW vs MoR) for tuning read/write tradeoffs.

Can Hudi work with query engines other than Spark?+

Yes. Hudi tables can be queried by Trino (Presto), Hive, AWS Athena, Google BigQuery, Snowflake (via external tables), and other engines. Read support varies by engine and table type.

What is the difference between CoW and MoR tables?+

Copy-on-Write (CoW) rewrites entire files on each update, giving fast read performance. Merge-on-Read (MoR) writes deltas to log files and merges them at read time, giving faster writes but slightly slower reads until compaction runs.

Does Hudi support schema evolution?+

Yes. Hudi supports adding, renaming, and deleting columns. Schema changes are tracked in the timeline and applied transparently to readers. Backward and forward compatibility follows Avro schema evolution rules.

How does incremental querying work?+

Hudi tracks a timeline of commits. Incremental queries specify a begin timestamp and return only records that changed after that point. This enables efficient CDC pipelines where downstream consumers process only new data instead of scanning the full table.

引用来源 (3)

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产