# Apache Beam — Unified Batch and Stream Data Processing

> Apache Beam is a unified programming model for defining both batch and streaming data-parallel processing pipelines. Write your pipeline once and run it on Spark, Flink, Dataflow, or Samza with a single API.

## Install

Save as a script file and run:

# Apache Beam — Unified Batch and Stream Data Processing

## Quick Use
```bash
pip install apache-beam[gcp]
python -m apache_beam.examples.wordcount 
  --input gs://dataflow-samples/shakespeare/kinglear.txt 
  --output /tmp/counts
# Or run locally with DirectRunner (default)
```

## Introduction
Apache Beam provides a single API to express data processing logic that can execute on multiple distributed backends. Born from Google's internal Dataflow model, it lets teams write pipelines once and deploy them on Apache Flink, Spark, Google Cloud Dataflow, or other runners without rewriting code.

## What Apache Beam Does
- Provides a unified SDK for batch and streaming pipelines in Python, Java, Go, and TypeScript
- Abstracts the execution engine so pipelines are portable across runners
- Handles windowing, triggers, and watermarks for complex event-time processing
- Offers built-in I/O connectors for Kafka, BigQuery, Parquet, JDBC, and 30+ systems
- Supports advanced patterns like stateful processing, timers, and side inputs

## Architecture Overview
A Beam pipeline is a DAG of PTransforms that operate on PCollections (bounded or unbounded datasets). The SDK constructs this DAG, then a runner translates it into the execution engine's native plan. The DirectRunner executes locally for testing; production runners (FlinkRunner, DataflowRunner, SparkRunner) distribute work across clusters. Beam's model separates what to compute from where to compute it.

## Self-Hosting & Configuration
- Install the Python SDK with pip install apache-beam and choose extras for your runner
- Use the DirectRunner for local development and unit testing
- Deploy on Flink clusters by packaging as a JAR or using the portable runner
- Configure Google Cloud Dataflow via --runner=DataflowRunner with project and region flags
- Set pipeline options (parallelism, temp location, autoscaling) in code or command-line args

## Key Features
- True portability: same code runs on Flink, Spark, Dataflow, and Samza
- First-class streaming with event-time semantics, late data handling, and exactly-once guarantees
- Multi-language pipelines that mix Python and Java transforms
- Beam SQL and DataFrames API for declarative processing
- Extensible I/O framework with 30+ built-in connectors

## Comparison with Similar Tools
- **Apache Flink** — a runner/engine; Beam is the portable API layer that can target Flink
- **Apache Spark Structured Streaming** — engine-specific API; Beam provides runner-agnostic code
- **dbt** — SQL-first transformation in the warehouse; Beam handles raw stream and batch processing outside the warehouse
- **Kafka Streams** — Kafka-only stream library; Beam supports multiple sources and runners
- **Prefect/Airflow** — workflow orchestrators that schedule jobs; Beam defines the data processing logic inside those jobs

## FAQ
**Q: When should I use Beam over writing native Flink code?**
A: Use Beam when you want runner portability or your team prefers the Beam SDK. Use native Flink when you need engine-specific low-level control.

**Q: Is Beam only for big data?**
A: No. The DirectRunner works for small local jobs. Beam scales from laptop to multi-terabyte cluster pipelines.

**Q: Can Beam handle real-time streaming?**
A: Yes. Beam's model natively supports unbounded streaming with windowing, watermarks, and triggers for low-latency processing.

**Q: What is the relationship between Beam and Google Dataflow?**
A: Dataflow is a managed runner on Google Cloud. Beam is the open-source SDK. You can use Beam without Dataflow on Flink or Spark.

## Sources
- https://github.com/apache/beam
- https://beam.apache.org/documentation/

---
Source: https://tokrepo.com/en/workflows/be423f59-39eb-11f1-9bc6-00163e2b0d79
Author: Script Depot