# Flink CDC — Real-Time Change Data Capture for Apache Flink

> Flink CDC is a streaming data integration framework built on Apache Flink. It captures row-level changes from databases like MySQL, PostgreSQL, and MongoDB in real time and delivers them as Flink DataStreams for processing, transformation, and synchronization.

## Install

Save in your project root:

# Flink CDC — Real-Time Change Data Capture for Apache Flink

## Quick Use
```bash
# Download Flink CDC connectors
wget https://repo1.maven.org/maven2/org/apache/flink/flink-cdc-pipeline-connector-mysql/3.3.0/flink-cdc-pipeline-connector-mysql-3.3.0.jar
# Place in Flink lib/ and submit a CDC pipeline via YAML
bin/flink-cdc.sh mysql-to-doris.yaml
```

## Introduction
Flink CDC brings change-data-capture into the Apache Flink ecosystem. It reads database transaction logs (binlog, WAL, oplog) and converts them into Flink streams, enabling real-time ETL, data lake ingestion, and cross-database synchronization without custom glue code.

## What Flink CDC Does
- Reads binlog/WAL/oplog from MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, and more
- Delivers insert, update, and delete events as structured Flink DataStreams
- Supports full snapshot followed by continuous incremental capture in a single job
- Provides a YAML-based pipeline definition for codeless database-to-database sync
- Handles schema evolution by propagating DDL changes downstream automatically

## Architecture Overview
Flink CDC connectors embed Debezium engines within Flink source operators. On startup a snapshot reader performs a parallel chunked scan of existing data, then hands off to a binlog reader for ongoing changes. Events are checkpointed using Flink exactly-once semantics so no data is lost or duplicated, even across restarts.

## Self-Hosting & Configuration
- Deploy Apache Flink 1.18+ and add the appropriate CDC connector JARs to the lib directory
- Configure source database credentials and binlog/WAL access permissions
- Define a pipeline in YAML or write a Flink job in Java specifying source tables and sink targets
- Tune parallelism and checkpoint intervals for throughput and latency requirements
- Monitor via the Flink Web UI or integrate with Prometheus metrics

## Key Features
- Exactly-once processing semantics for reliable data delivery
- Parallel snapshot reading using table chunk splitting for fast initial loads
- Schema evolution support propagates ALTER TABLE changes to downstream sinks
- Codeless YAML pipeline mode for common sync scenarios
- Compatible with the full Apache Flink ecosystem including SQL, Table API, and DataStream API

## Comparison with Similar Tools
- **Debezium** — Standalone CDC platform using Kafka Connect; Flink CDC embeds Debezium inside Flink for tighter integration
- **Airbyte** — General ELT platform with CDC connectors, but batch-oriented rather than continuous streaming
- **AWS DMS** — Managed CDC service locked to AWS, not open source
- **Canal** — Alibaba MySQL binlog reader focused on MySQL-only use cases
- **Maxwell** — Lightweight MySQL-only binlog reader that writes to Kafka; no built-in transformation

## FAQ
**Q: Do I need Kafka to use Flink CDC?**
A: No. Flink CDC reads database logs directly without requiring an intermediate message queue.

**Q: Which databases are supported?**
A: MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, Db2, OceanBase, TiDB, and Vitess, with the list growing.

**Q: Can Flink CDC handle schema changes automatically?**
A: Yes. The pipeline mode can propagate DDL changes like column additions to supported sinks.

**Q: What is the difference between Flink CDC and Debezium?**
A: Flink CDC uses Debezium internally but runs inside the Flink runtime, giving you access to Flink SQL, exactly-once checkpointing, and the full Flink ecosystem.

## Sources
- https://github.com/apache/flink-cdc
- https://nightlies.apache.org/flink/flink-cdc-docs-stable/

---
Source: https://tokrepo.com/en/workflows/asset-9ef19846
Author: AI Open Source