Introduction
Apache Flink is purpose-built for stream processing. While Spark added streaming as an afterthought (micro-batching), Flink was designed from the ground up for continuous, stateful computation over unbounded data streams. It provides true event-time processing, exactly-once state consistency, and millisecond latency.
With over 26,000 GitHub stars, Flink powers real-time systems at Alibaba (processing billions of events per day), Netflix, Uber, Apple, and thousands of companies. It is the de facto standard for applications requiring low-latency, high-throughput stream processing.
What Flink Does
Flink processes continuous streams of data with rich transformations: windowing, joins, aggregations, pattern detection, and complex event processing. It maintains state across events (counters, ML models, session data) with exactly-once guarantees, and can recover state from failures via checkpointing.
Architecture Overview
[Data Sources]
Kafka, Kinesis, Files,
Databases (CDC), Sockets
|
[Flink Application]
DataStream API or SQL
|
[Stream Processing]
Event-time windows
State management
Exactly-once semantics
|
+-------+-------+
| | |
[Flink SQL] [DataStream API]
Declarative Programmatic
SQL queries Java/Python
on streams transformations
|
[Checkpointing]
Periodic state snapshots
for fault tolerance
|
[Sinks]
Kafka, Elasticsearch,
Databases, Files, S3Self-Hosting & Configuration
# PyFlink example: real-time aggregation
from pyflink.table import EnvironmentSettings, TableEnvironment
env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode())
# Define Kafka source
env.execute_sql("""
CREATE TABLE orders (
order_id STRING,
product STRING,
amount DECIMAL(10,2),
order_time TIMESTAMP(3),
WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'orders',
'properties.bootstrap.servers' = 'localhost:9092',
'format' = 'json'
)
""")
# Tumbling window aggregation
env.execute_sql("""
SELECT
product,
TUMBLE_START(order_time, INTERVAL '1' MINUTE) as window_start,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM orders
GROUP BY product, TUMBLE(order_time, INTERVAL '1' MINUTE)
""").print()Key Features
- True Streaming — processes events one-by-one, not micro-batches
- Exactly-Once — guaranteed state consistency across failures
- Event-Time Processing — handle out-of-order events correctly
- Stateful Computation — maintain and query application state
- Flink SQL — SQL on streams for analytics and ETL
- Savepoints — snapshot and restore application state for upgrades
- Windowing — tumbling, sliding, session, and custom windows
- CDC — capture database changes as streams (Debezium integration)
Comparison with Similar Tools
| Feature | Flink | Spark Streaming | Kafka Streams | Pulsar Functions |
|---|---|---|---|---|
| Processing Model | True streaming | Micro-batch | Per-record | Per-record |
| Latency | Milliseconds | Seconds | Milliseconds | Milliseconds |
| State Management | Built-in (RocksDB) | Limited | Built-in | Limited |
| SQL Support | Flink SQL | Spark SQL | ksqlDB | Pulsar SQL |
| Exactly-Once | Yes | Yes | Yes | Yes |
| Standalone Mode | Yes | Part of Spark | Embedded (no cluster) | Part of Pulsar |
| Best For | Complex streaming | Batch + streaming | Simple streaming | Pulsar-native |
| GitHub Stars | 26K | Part of Spark (43K) | Part of Kafka | Part of Pulsar |
FAQ
Q: Flink vs Spark Streaming — which should I choose? A: Flink for low-latency requirements, complex event processing, and true event-time semantics. Spark Structured Streaming for teams already using Spark and when second-level latency is acceptable.
Q: Does Flink only do streaming? A: No. Flink handles both streaming and batch processing with the same API. Batch is treated as a special case of streaming (bounded streams).
Q: How does Flink handle failures? A: Flink periodically checkpoints application state to durable storage (S3, HDFS). On failure, it restores state from the latest checkpoint and replays events from the source (e.g., Kafka offsets).
Q: Is there a managed Flink service? A: Yes. AWS Managed Flink, Confluent Cloud (Flink), and Ververica Platform provide managed Flink clusters.
Sources
- GitHub: https://github.com/apache/flink
- Documentation: https://flink.apache.org
- Created at TU Berlin, Apache Top-Level Project
- License: Apache-2.0