Skills2026年4月16日·1 分钟阅读

Apache Beam — Unified Batch and Stream Data Processing

Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. Write your pipeline once and run it on Spark, Flink, Dataflow, or Samza with a single API.

Apache Software Foundation · Community

Agent 就绪

先审查再安装

这个资产需要先审查。复制的指令会要求 Agent dry-run、列出写入项，确认后再继续。

Needs Confirmation · 64/100策略：需确认

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Community

入口

Apache Beam Overview

先审查命令

npx -y tokrepo@latest install be423f59-39eb-11f1-9bc6-00163e2b0d79 --target codex

先 dry-run，确认写入项后再运行此命令。

TL;DR

Apache Beam lets you write data pipelines once and run them on Spark, Flink, Dataflow, or Samza with the same code.

§01

What it is

Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. You write your pipeline once using the Beam SDK (Python, Java, or Go) and run it on multiple execution engines (runners) including Apache Flink, Apache Spark, Google Cloud Dataflow, and Apache Samza. The model provides a consistent API for windowing, triggering, and watermark handling across batch and streaming modes.

Beam is for data engineers who build ETL pipelines, real-time analytics, or event processing systems. If you want pipeline portability across execution engines or need to run the same logic in both batch and streaming modes, Beam provides the abstraction layer.

§02

How it saves time or tokens

Beam eliminates vendor lock-in for data pipelines. Write your pipeline in Beam's SDK and switch runners without code changes -- start on Spark, move to Dataflow for managed infrastructure, or switch to Flink for advanced streaming. The unified batch/streaming model means one pipeline handles both historical backfill and real-time processing. The Python SDK with its concise transform API generates compact pipeline code that AI assistants can produce accurately.

§03

How to use

Install the SDK: pip install apache-beam[gcp] (Python) or add the Maven dependency (Java).
Define your pipeline: read from a source, apply transforms, write to a sink.
Run locally with DirectRunner or on a cluster with your chosen runner.

§04

Example

import apache_beam as beam

with beam.Pipeline() as pipeline:
    (
        pipeline
        | 'Read' >> beam.io.ReadFromText('input.txt')
        | 'Split' >> beam.FlatMap(lambda line: line.split())
        | 'PairWithOne' >> beam.Map(lambda word: (word, 1))
        | 'GroupAndSum' >> beam.CombinePerKey(sum)
        | 'Format' >> beam.Map(lambda kv: f'{kv[0]}: {kv[1]}')
        | 'Write' >> beam.io.WriteToText('output')
    )

§05

Related on TokRepo

AI tools for automation -- data pipeline automation tools
Database AI tools -- data processing and storage

§06

Common pitfalls

Forgetting that Beam pipelines are DAGs, not imperative code. Transforms are lazy -- they define the computation graph but do not execute until the pipeline runs. Do not mix imperative Python logic with Beam transforms.
Not understanding windowing for streaming pipelines. Without proper windowing and triggering, streaming pipelines either hold data indefinitely or emit incomplete results. Study fixed, sliding, and session windows.
Using the DirectRunner in production. DirectRunner is for local testing only. It runs on a single machine without parallelism. Use Flink, Spark, or Dataflow runners for production workloads.

常见问题

What runners does Apache Beam support?+

Beam supports Apache Flink, Apache Spark, Google Cloud Dataflow, Apache Samza, and the DirectRunner for local testing. Each runner executes the same pipeline definition, so you can switch runners without changing your pipeline code.

Can I use Beam for both batch and streaming?+

Yes. Beam's unified model handles both. The same pipeline code processes bounded (batch) and unbounded (streaming) data. Windowing and triggering configurations control how streaming data is grouped and emitted.

What languages does Beam support?+

Beam has official SDKs for Python, Java, and Go. The Python SDK is the most concise for data processing tasks. Java provides the most mature feature set. Go support is newer but growing.

How does Beam compare to Apache Flink?+

Flink is both a programming model and a runtime. Beam is a programming model that can run on Flink as one of its runners. Using Beam on Flink gives you pipeline portability -- if you later want to switch to Dataflow or Spark, your code stays the same.

Is Beam suitable for small-scale projects?+

Beam adds abstraction overhead that is most valuable at scale or when runner portability matters. For small, single-machine pipelines, plain Python with pandas or PySpark may be simpler. Beam shines when pipelines need to scale or move between execution environments.

引用来源 (3)

Apache Beam GitHub— Apache Beam is a unified batch and streaming data processing model
Beam Runner Capabilities— Supports Flink, Spark, Dataflow, and Samza runners
Beam Python SDK Docs— Python, Java, and Go SDKs for pipeline development

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

Apache Beam — Unified Batch and Stream Data Processing

先审查再安装

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

讨论

相关资产

Apache Spark — Unified Analytics Engine for Big Data

Apache Flink — Stream Processing Framework for Real-Time Data

Apache Hudi — Incremental Data Processing for Data Lakehouses

Apache Hive — Distributed Data Warehouse for Big Data Analytics