How do I install Apache Flink — Stream Processing Framework for Real-Time Data?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Apache Flink — Stream Processing Framework for Real-Time Data

Introduction

Apache Flink is purpose-built for stream processing. While Spark added streaming as an afterthought (micro-batching), Flink was designed from the ground up for continuous, stateful computation over unbounded data streams. It provides true event-time processing, exactly-once state consistency, and millisecond latency.

With over 26,000 GitHub stars, Flink powers real-time systems at Alibaba (processing billions of events per day), Netflix, Uber, Apple, and thousands of companies. It is the de facto standard for applications requiring low-latency, high-throughput stream processing.

What Flink Does

Flink processes continuous streams of data with rich transformations: windowing, joins, aggregations, pattern detection, and complex event processing. It maintains state across events (counters, ML models, session data) with exactly-once guarantees, and can recover state from failures via checkpointing.

Architecture Overview

[Data Sources]
Kafka, Kinesis, Files,
Databases (CDC), Sockets
        |
   [Flink Application]
   DataStream API or SQL
        |
   [Stream Processing]
   Event-time windows
   State management
   Exactly-once semantics
        |
+-------+-------+
|       |       |
[Flink SQL] [DataStream API]
Declarative  Programmatic
SQL queries  Java/Python
on streams   transformations
        |
   [Checkpointing]
   Periodic state snapshots
   for fault tolerance
        |
   [Sinks]
   Kafka, Elasticsearch,
   Databases, Files, S3

Self-Hosting & Configuration

# PyFlink example: real-time aggregation
from pyflink.table import EnvironmentSettings, TableEnvironment

env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode())

# Define Kafka source
env.execute_sql("""
    CREATE TABLE orders (
        order_id STRING,
        product STRING,
        amount DECIMAL(10,2),
        order_time TIMESTAMP(3),
        WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'orders',
        'properties.bootstrap.servers' = 'localhost:9092',
        'format' = 'json'
    )
""")

# Tumbling window aggregation
env.execute_sql("""
    SELECT
        product,
        TUMBLE_START(order_time, INTERVAL '1' MINUTE) as window_start,
        COUNT(*) as order_count,
        SUM(amount) as total_amount
    FROM orders
    GROUP BY product, TUMBLE(order_time, INTERVAL '1' MINUTE)
""").print()

Key Features

True Streaming — processes events one-by-one, not micro-batches
Exactly-Once — guaranteed state consistency across failures
Event-Time Processing — handle out-of-order events correctly
Stateful Computation — maintain and query application state
Flink SQL — SQL on streams for analytics and ETL
Savepoints — snapshot and restore application state for upgrades
Windowing — tumbling, sliding, session, and custom windows
CDC — capture database changes as streams (Debezium integration)

Comparison with Similar Tools

Feature	Flink	Spark Streaming	Kafka Streams	Pulsar Functions
Processing Model	True streaming	Micro-batch	Per-record	Per-record
Latency	Milliseconds	Seconds	Milliseconds	Milliseconds
State Management	Built-in (RocksDB)	Limited	Built-in	Limited
SQL Support	Flink SQL	Spark SQL	ksqlDB	Pulsar SQL
Exactly-Once	Yes	Yes	Yes	Yes
Standalone Mode	Yes	Part of Spark	Embedded (no cluster)	Part of Pulsar
Best For	Complex streaming	Batch + streaming	Simple streaming	Pulsar-native
GitHub Stars	26K	Part of Spark (43K)	Part of Kafka	Part of Pulsar

FAQ

Q: Flink vs Spark Streaming — which should I choose? A: Flink for low-latency requirements, complex event processing, and true event-time semantics. Spark Structured Streaming for teams already using Spark and when second-level latency is acceptable.

Q: Does Flink only do streaming? A: No. Flink handles both streaming and batch processing with the same API. Batch is treated as a special case of streaming (bounded streams).

Q: How does Flink handle failures? A: Flink periodically checkpoints application state to durable storage (S3, HDFS). On failure, it restores state from the latest checkpoint and replays events from the source (e.g., Kafka offsets).

Q: Is there a managed Flink service? A: Yes. AWS Managed Flink, Confluent Cloud (Flink), and Ververica Platform provide managed Flink clusters.

Sources

GitHub: https://github.com/apache/flink
Documentation: https://flink.apache.org
Created at TU Berlin, Apache Top-Level Project
License: Apache-2.0

Apache Flink — Stream Processing Framework for Real-Time Data

Use it first, then decide how deep to go

Introduction

What Flink Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

dust — A More Intuitive Version of du in Rust

Rayon — Data Parallelism Library for Rust

Clap — Command Line Argument Parser for Rust