How do I install Flink CDC — Real-Time Change Data Capture for Apache Flink?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Flink CDC — Real-Time Change Data Capture for Apache Flink

Introduction

Flink CDC brings change-data-capture into the Apache Flink ecosystem. It reads database transaction logs (binlog, WAL, oplog) and converts them into Flink streams, enabling real-time ETL, data lake ingestion, and cross-database synchronization without custom glue code.

What Flink CDC Does

Reads binlog/WAL/oplog from MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, and more
Delivers insert, update, and delete events as structured Flink DataStreams
Supports full snapshot followed by continuous incremental capture in a single job
Provides a YAML-based pipeline definition for codeless database-to-database sync
Handles schema evolution by propagating DDL changes downstream automatically

Architecture Overview

Flink CDC connectors embed Debezium engines within Flink source operators. On startup a snapshot reader performs a parallel chunked scan of existing data, then hands off to a binlog reader for ongoing changes. Events are checkpointed using Flink exactly-once semantics so no data is lost or duplicated, even across restarts.

Self-Hosting & Configuration

Deploy Apache Flink 1.18+ and add the appropriate CDC connector JARs to the lib directory
Configure source database credentials and binlog/WAL access permissions
Define a pipeline in YAML or write a Flink job in Java specifying source tables and sink targets
Tune parallelism and checkpoint intervals for throughput and latency requirements
Monitor via the Flink Web UI or integrate with Prometheus metrics

Key Features

Exactly-once processing semantics for reliable data delivery
Parallel snapshot reading using table chunk splitting for fast initial loads
Schema evolution support propagates ALTER TABLE changes to downstream sinks
Codeless YAML pipeline mode for common sync scenarios
Compatible with the full Apache Flink ecosystem including SQL, Table API, and DataStream API

Comparison with Similar Tools

Debezium — Standalone CDC platform using Kafka Connect; Flink CDC embeds Debezium inside Flink for tighter integration
Airbyte — General ELT platform with CDC connectors, but batch-oriented rather than continuous streaming
AWS DMS — Managed CDC service locked to AWS, not open source
Canal — Alibaba MySQL binlog reader focused on MySQL-only use cases
Maxwell — Lightweight MySQL-only binlog reader that writes to Kafka; no built-in transformation

FAQ

Q: Do I need Kafka to use Flink CDC? A: No. Flink CDC reads database logs directly without requiring an intermediate message queue.

Q: Which databases are supported? A: MySQL, PostgreSQL, MongoDB, Oracle, SQL Server, Db2, OceanBase, TiDB, and Vitess, with the list growing.

Q: Can Flink CDC handle schema changes automatically? A: Yes. The pipeline mode can propagate DDL changes like column additions to supported sinks.

Q: What is the difference between Flink CDC and Debezium? A: Flink CDC uses Debezium internally but runs inside the Flink runtime, giving you access to Flink SQL, exactly-once checkpointing, and the full Flink ecosystem.

Flink CDC — Real-Time Change Data Capture for Apache Flink

This asset can be read and installed directly by agents

Introduction

What Flink CDC Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Apache Hudi — Incremental Data Processing for Data Lakehouses

Apache Pinot — Real-Time Distributed OLAP Datastore

Apache Doris — Modern MPP Analytical Database for Real-Time Reporting

Moshi — Real-Time AI Voice Conversation Engine