Scripts2026年4月13日·1 分钟阅读

Apache Flink — Stream Processing Framework for Real-Time Data

Apache Flink is the leading open-source framework for stateful stream processing. It processes unbounded data streams with exactly-once semantics, low latency, and high throughput — powering real-time analytics, fraud detection, and event-driven applications.

SC
Script Depot · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

# Install Flink (local mode)
wget https://dlcdn.apache.org/flink/flink-1.19.0/flink-1.19.0-bin-scala_2.12.tgz
tar xzf flink-1.19.0-bin-scala_2.12.tgz
cd flink-1.19.0

# Start local cluster
./bin/start-cluster.sh
# Web UI at http://localhost:8081

# Run SQL client
./bin/sql-client.sh
# > SELECT name, COUNT(*) FROM orders GROUP BY name;

# PyFlink
pip install apache-flink

Introduction

Apache Flink is purpose-built for stream processing. While Spark added streaming as an afterthought (micro-batching), Flink was designed from the ground up for continuous, stateful computation over unbounded data streams. It provides true event-time processing, exactly-once state consistency, and millisecond latency.

With over 26,000 GitHub stars, Flink powers real-time systems at Alibaba (processing billions of events per day), Netflix, Uber, Apple, and thousands of companies. It is the de facto standard for applications requiring low-latency, high-throughput stream processing.

What Flink Does

Flink processes continuous streams of data with rich transformations: windowing, joins, aggregations, pattern detection, and complex event processing. It maintains state across events (counters, ML models, session data) with exactly-once guarantees, and can recover state from failures via checkpointing.

Architecture Overview

[Data Sources]
Kafka, Kinesis, Files,
Databases (CDC), Sockets
        |
   [Flink Application]
   DataStream API or SQL
        |
   [Stream Processing]
   Event-time windows
   State management
   Exactly-once semantics
        |
+-------+-------+
|       |       |
[Flink SQL] [DataStream API]
Declarative  Programmatic
SQL queries  Java/Python
on streams   transformations
        |
   [Checkpointing]
   Periodic state snapshots
   for fault tolerance
        |
   [Sinks]
   Kafka, Elasticsearch,
   Databases, Files, S3

Self-Hosting & Configuration

# PyFlink example: real-time aggregation
from pyflink.table import EnvironmentSettings, TableEnvironment

env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode())

# Define Kafka source
env.execute_sql("""
    CREATE TABLE orders (
        order_id STRING,
        product STRING,
        amount DECIMAL(10,2),
        order_time TIMESTAMP(3),
        WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'orders',
        'properties.bootstrap.servers' = 'localhost:9092',
        'format' = 'json'
    )
""")

# Tumbling window aggregation
env.execute_sql("""
    SELECT
        product,
        TUMBLE_START(order_time, INTERVAL '1' MINUTE) as window_start,
        COUNT(*) as order_count,
        SUM(amount) as total_amount
    FROM orders
    GROUP BY product, TUMBLE(order_time, INTERVAL '1' MINUTE)
""").print()

Key Features

  • True Streaming — processes events one-by-one, not micro-batches
  • Exactly-Once — guaranteed state consistency across failures
  • Event-Time Processing — handle out-of-order events correctly
  • Stateful Computation — maintain and query application state
  • Flink SQL — SQL on streams for analytics and ETL
  • Savepoints — snapshot and restore application state for upgrades
  • Windowing — tumbling, sliding, session, and custom windows
  • CDC — capture database changes as streams (Debezium integration)

Comparison with Similar Tools

Feature Flink Spark Streaming Kafka Streams Pulsar Functions
Processing Model True streaming Micro-batch Per-record Per-record
Latency Milliseconds Seconds Milliseconds Milliseconds
State Management Built-in (RocksDB) Limited Built-in Limited
SQL Support Flink SQL Spark SQL ksqlDB Pulsar SQL
Exactly-Once Yes Yes Yes Yes
Standalone Mode Yes Part of Spark Embedded (no cluster) Part of Pulsar
Best For Complex streaming Batch + streaming Simple streaming Pulsar-native
GitHub Stars 26K Part of Spark (43K) Part of Kafka Part of Pulsar

FAQ

Q: Flink vs Spark Streaming — which should I choose? A: Flink for low-latency requirements, complex event processing, and true event-time semantics. Spark Structured Streaming for teams already using Spark and when second-level latency is acceptable.

Q: Does Flink only do streaming? A: No. Flink handles both streaming and batch processing with the same API. Batch is treated as a special case of streaming (bounded streams).

Q: How does Flink handle failures? A: Flink periodically checkpoints application state to durable storage (S3, HDFS). On failure, it restores state from the latest checkpoint and replays events from the source (e.g., Kafka offsets).

Q: Is there a managed Flink service? A: Yes. AWS Managed Flink, Confluent Cloud (Flink), and Ververica Platform provide managed Flink clusters.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产