# Apache Flink — Stream Processing Framework for Real-Time Data > Apache Flink is the leading open-source framework for stateful stream processing. It processes unbounded data streams with exactly-once semantics, low latency, and high throughput — powering real-time analytics, fraud detection, and event-driven applications. ## Install Save as a script file and run: # Apache Flink — Stream Processing Framework for Real-Time Data ## Quick Use ```bash # Install Flink (local mode) wget https://dlcdn.apache.org/flink/flink-1.19.0/flink-1.19.0-bin-scala_2.12.tgz tar xzf flink-1.19.0-bin-scala_2.12.tgz cd flink-1.19.0 # Start local cluster ./bin/start-cluster.sh # Web UI at http://localhost:8081 # Run SQL client ./bin/sql-client.sh # > SELECT name, COUNT(*) FROM orders GROUP BY name; # PyFlink pip install apache-flink ``` ## Introduction Apache Flink is purpose-built for stream processing. While Spark added streaming as an afterthought (micro-batching), Flink was designed from the ground up for continuous, stateful computation over unbounded data streams. It provides true event-time processing, exactly-once state consistency, and millisecond latency. With over 26,000 GitHub stars, Flink powers real-time systems at Alibaba (processing billions of events per day), Netflix, Uber, Apple, and thousands of companies. It is the de facto standard for applications requiring low-latency, high-throughput stream processing. ## What Flink Does Flink processes continuous streams of data with rich transformations: windowing, joins, aggregations, pattern detection, and complex event processing. It maintains state across events (counters, ML models, session data) with exactly-once guarantees, and can recover state from failures via checkpointing. ## Architecture Overview ``` [Data Sources] Kafka, Kinesis, Files, Databases (CDC), Sockets | [Flink Application] DataStream API or SQL | [Stream Processing] Event-time windows State management Exactly-once semantics | +-------+-------+ | | | [Flink SQL] [DataStream API] Declarative Programmatic SQL queries Java/Python on streams transformations | [Checkpointing] Periodic state snapshots for fault tolerance | [Sinks] Kafka, Elasticsearch, Databases, Files, S3 ``` ## Self-Hosting & Configuration ```python # PyFlink example: real-time aggregation from pyflink.table import EnvironmentSettings, TableEnvironment env = TableEnvironment.create(EnvironmentSettings.in_streaming_mode()) # Define Kafka source env.execute_sql(""" CREATE TABLE orders ( order_id STRING, product STRING, amount DECIMAL(10,2), order_time TIMESTAMP(3), WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND ) WITH ( 'connector' = 'kafka', 'topic' = 'orders', 'properties.bootstrap.servers' = 'localhost:9092', 'format' = 'json' ) """) # Tumbling window aggregation env.execute_sql(""" SELECT product, TUMBLE_START(order_time, INTERVAL '1' MINUTE) as window_start, COUNT(*) as order_count, SUM(amount) as total_amount FROM orders GROUP BY product, TUMBLE(order_time, INTERVAL '1' MINUTE) """).print() ``` ## Key Features - **True Streaming** — processes events one-by-one, not micro-batches - **Exactly-Once** — guaranteed state consistency across failures - **Event-Time Processing** — handle out-of-order events correctly - **Stateful Computation** — maintain and query application state - **Flink SQL** — SQL on streams for analytics and ETL - **Savepoints** — snapshot and restore application state for upgrades - **Windowing** — tumbling, sliding, session, and custom windows - **CDC** — capture database changes as streams (Debezium integration) ## Comparison with Similar Tools | Feature | Flink | Spark Streaming | Kafka Streams | Pulsar Functions | |---|---|---|---|---| | Processing Model | True streaming | Micro-batch | Per-record | Per-record | | Latency | Milliseconds | Seconds | Milliseconds | Milliseconds | | State Management | Built-in (RocksDB) | Limited | Built-in | Limited | | SQL Support | Flink SQL | Spark SQL | ksqlDB | Pulsar SQL | | Exactly-Once | Yes | Yes | Yes | Yes | | Standalone Mode | Yes | Part of Spark | Embedded (no cluster) | Part of Pulsar | | Best For | Complex streaming | Batch + streaming | Simple streaming | Pulsar-native | | GitHub Stars | 26K | Part of Spark (43K) | Part of Kafka | Part of Pulsar | ## FAQ **Q: Flink vs Spark Streaming — which should I choose?** A: Flink for low-latency requirements, complex event processing, and true event-time semantics. Spark Structured Streaming for teams already using Spark and when second-level latency is acceptable. **Q: Does Flink only do streaming?** A: No. Flink handles both streaming and batch processing with the same API. Batch is treated as a special case of streaming (bounded streams). **Q: How does Flink handle failures?** A: Flink periodically checkpoints application state to durable storage (S3, HDFS). On failure, it restores state from the latest checkpoint and replays events from the source (e.g., Kafka offsets). **Q: Is there a managed Flink service?** A: Yes. AWS Managed Flink, Confluent Cloud (Flink), and Ververica Platform provide managed Flink clusters. ## Sources - GitHub: https://github.com/apache/flink - Documentation: https://flink.apache.org - Created at TU Berlin, Apache Top-Level Project - License: Apache-2.0 --- Source: https://tokrepo.com/en/workflows/8cf8efc6-3734-11f1-9bc6-00163e2b0d79 Author: Script Depot