# Apache Flink — Stream Processing Framework for Real-Time Data

> Apache Flink is the leading open-source framework for stateful stream processing. It processes unbounded data streams with exactly-once semantics, low latency, and high throughput — powering real-time analytics, fraud detection, and event-driven applications.

## Install

Save as a script file and run:

# Apache Flink — Stream Processing Framework for Real-Time Data

## Quick Use
```bash
# Install Flink (local mode)
wget https://dlcdn.apache.org/flink/flink-1.19.0/flink-1.19.0-bin-scala_2.12.tgz
tar xzf flink-1.19.0-bin-scala_2.12.tgz
cd flink-1.19.0

# Start local cluster
./bin/start-cluster.sh
# Web UI at http://localhost:8081

# Run SQL client
./bin/sql-client.sh
# > SELECT name, COUNT(*) FROM orders GROUP BY name;

# PyFlink
pip install apache-flink
```

## Introduction
Apache Flink is purpose-built for stream processing. While Spark added streaming as an afterthought (micro-batching), Flink was designed from the ground up for continuous, stateful computation over unbounded data streams. It provides true event-time processing, exactly-once state consistency, and millisecond latency.

With over 26,000 GitHub stars, Flink powers real-time systems at Alibaba (processing billions of events per day), Netflix, Uber, Apple, and thousands of companies. It is the de facto standard for applications requiring low-latency, high-throughput stream processing.

## What Flink Does
Flink processes continuous streams of data with rich transformations: windowing, joins, aggregations, pattern detection, and complex event processing. It maintains state across events (counters, ML models, session data) with exactly-once guarantees, and can recover state from failures via checkpointing.

## Architecture Overview
```
[Data Sources]
Kafka, Kinesis, Files,
Databases (CDC), Sockets
        |
   [Flink Application]
   DataStream API or SQL
        |
   [Stream Processing]
   Event-time windows
   State management
   Exactly-once semantics
        |
+-------+-------+
|       |       |
[Flink SQL] [DataStream API]
Declarative  Programmatic
SQL queries  Java/Python
on streams   transformations
        |
   [Checkpointing]
   Periodic state snapshots
   for fault tolerance
        |
   [Sinks]
   Kafka, Elasticsearch,
   Databases, Files, S3
```

## Self-Hosting & Configuration
```python
# PyFlink example: real-time aggregation
from pyflink.table import EnvironmentSettings, TableEnvironment

env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode())

# Define Kafka source
env.execute_sql("""
    CREATE TABLE orders (
        order_id STRING,
        product STRING,
        amount DECIMAL(10,2),
        order_time TIMESTAMP(3),
        WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'orders',
        'properties.bootstrap.servers' = 'localhost:9092',
        'format' = 'json'
    )
""")

# Tumbling window aggregation
env.execute_sql("""
    SELECT
        product,
        TUMBLE_START(order_time, INTERVAL '1' MINUTE) as window_start,
        COUNT(*) as order_count,
        SUM(amount) as total_amount
    FROM orders
    GROUP BY product, TUMBLE(order_time, INTERVAL '1' MINUTE)
""").print()
```

## Key Features
- **True Streaming** — processes events one-by-one, not micro-batches
- **Exactly-Once** — guaranteed state consistency across failures
- **Event-Time Processing** — handle out-of-order events correctly
- **Stateful Computation** — maintain and query application state
- **Flink SQL** — SQL on streams for analytics and ETL
- **Savepoints** — snapshot and restore application state for upgrades
- **Windowing** — tumbling, sliding, session, and custom windows
- **CDC** — capture database changes as streams (Debezium integration)

## Comparison with Similar Tools
| Feature | Flink | Spark Streaming | Kafka Streams | Pulsar Functions |
|---|---|---|---|---|
| Processing Model | True streaming | Micro-batch | Per-record | Per-record |
| Latency | Milliseconds | Seconds | Milliseconds | Milliseconds |
| State Management | Built-in (RocksDB) | Limited | Built-in | Limited |
| SQL Support | Flink SQL | Spark SQL | ksqlDB | Pulsar SQL |
| Exactly-Once | Yes | Yes | Yes | Yes |
| Standalone Mode | Yes | Part of Spark | Embedded (no cluster) | Part of Pulsar |
| Best For | Complex streaming | Batch + streaming | Simple streaming | Pulsar-native |
| GitHub Stars | 26K | Part of Spark (43K) | Part of Kafka | Part of Pulsar |

## FAQ
**Q: Flink vs Spark Streaming — which should I choose?**
A: Flink for low-latency requirements, complex event processing, and true event-time semantics. Spark Structured Streaming for teams already using Spark and when second-level latency is acceptable.

**Q: Does Flink only do streaming?**
A: No. Flink handles both streaming and batch processing with the same API. Batch is treated as a special case of streaming (bounded streams).

**Q: How does Flink handle failures?**
A: Flink periodically checkpoints application state to durable storage (S3, HDFS). On failure, it restores state from the latest checkpoint and replays events from the source (e.g., Kafka offsets).

**Q: Is there a managed Flink service?**
A: Yes. AWS Managed Flink, Confluent Cloud (Flink), and Ververica Platform provide managed Flink clusters.

## Sources
- GitHub: https://github.com/apache/flink
- Documentation: https://flink.apache.org
- Created at TU Berlin, Apache Top-Level Project
- License: Apache-2.0

---
Source: https://tokrepo.com/en/workflows/8cf8efc6-3734-11f1-9bc6-00163e2b0d79
Author: Script Depot