# Flink CDC — Real-Time Change Data Capture for Apache Flink > Flink CDC is a streaming data integration framework built on Apache Flink. It captures row-level changes from databases like MySQL, PostgreSQL, and MongoDB in real time and delivers them as Flink DataStreams for processing, transformation, and synchronization. ## Install Save in your project root: # Flink CDC — Real-Time Change Data Capture for Apache Flink ## Quick Use ```bash # Download Flink CDC connectors wget https://repo1.maven.org/maven2/org/apache/flink/flink-cdc-pipeline-connector-mysql/3.3.0/flink-cdc-pipeline-connector-mysql-3.3.0.jar # Place in Flink lib/ and submit a CDC pipeline via YAML bin/flink-cdc.sh mysql-to-doris.yaml ``` ## Introduction Flink CDC brings change-data-capture into the Apache Flink ecosystem. It reads database transaction logs (binlog, WAL, oplog) and converts them into Flink streams, enabling real-time ETL, data lake ingestion, and cross-database synchronization without custom glue code. ## What Flink CDC Does - Reads binlog/WAL/oplog from MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, and more - Delivers insert, update, and delete events as structured Flink DataStreams - Supports full snapshot followed by continuous incremental capture in a single job - Provides a YAML-based pipeline definition for codeless database-to-database sync - Handles schema evolution by propagating DDL changes downstream automatically ## Architecture Overview Flink CDC connectors embed Debezium engines within Flink source operators. On startup a snapshot reader performs a parallel chunked scan of existing data, then hands off to a binlog reader for ongoing changes. Events are checkpointed using Flink exactly-once semantics so no data is lost or duplicated, even across restarts. ## Self-Hosting & Configuration - Deploy Apache Flink 1.18+ and add the appropriate CDC connector JARs to the lib directory - Configure source database credentials and binlog/WAL access permissions - Define a pipeline in YAML or write a Flink job in Java specifying source tables and sink targets - Tune parallelism and checkpoint intervals for throughput and latency requirements - Monitor via the Flink Web UI or integrate with Prometheus metrics ## Key Features - Exactly-once processing semantics for reliable data delivery - Parallel snapshot reading using table chunk splitting for fast initial loads - Schema evolution support propagates ALTER TABLE changes to downstream sinks - Codeless YAML pipeline mode for common sync scenarios - Compatible with the full Apache Flink ecosystem including SQL, Table API, and DataStream API ## Comparison with Similar Tools - **Debezium** — Standalone CDC platform using Kafka Connect; Flink CDC embeds Debezium inside Flink for tighter integration - **Airbyte** — General ELT platform with CDC connectors, but batch-oriented rather than continuous streaming - **AWS DMS** — Managed CDC service locked to AWS, not open source - **Canal** — Alibaba MySQL binlog reader focused on MySQL-only use cases - **Maxwell** — Lightweight MySQL-only binlog reader that writes to Kafka; no built-in transformation ## FAQ **Q: Do I need Kafka to use Flink CDC?** A: No. Flink CDC reads database logs directly without requiring an intermediate message queue. **Q: Which databases are supported?** A: MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, Db2, OceanBase, TiDB, and Vitess, with the list growing. **Q: Can Flink CDC handle schema changes automatically?** A: Yes. The pipeline mode can propagate DDL changes like column additions to supported sinks. **Q: What is the difference between Flink CDC and Debezium?** A: Flink CDC uses Debezium internally but runs inside the Flink runtime, giving you access to Flink SQL, exactly-once checkpointing, and the full Flink ecosystem. ## Sources - https://github.com/apache/flink-cdc - https://nightlies.apache.org/flink/flink-cdc-docs-stable/ --- Source: https://tokrepo.com/en/workflows/asset-9ef19846 Author: AI Open Source