Debezium — Real-Time Change Data Capture Platform
A distributed platform for streaming database changes into event logs, capturing row-level inserts, updates, and deletes from MySQL, PostgreSQL, MongoDB, and more.
What it is
Debezium is a distributed platform for change data capture (CDC). It monitors database transaction logs and streams row-level inserts, updates, and deletes into Apache Kafka topics. Debezium supports MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, Cassandra, and Db2.
Debezium targets data engineers and platform teams building real-time data pipelines, event-driven architectures, cache invalidation systems, and data warehouse synchronization.
How it saves time or tokens
Debezium eliminates polling-based data synchronization. Instead of querying databases on an interval to detect changes, Debezium reads the transaction log and emits changes as they happen. This reduces database load, eliminates missed changes between poll intervals, and provides sub-second latency. The Kafka Connect architecture means you configure connectors declaratively without writing code.
How to use
- Start the required infrastructure:
docker run -d --name zookeeper -p 2181:2181 quay.io/debezium/zookeeper
docker run -d --name kafka -p 9092:9092 \
--link zookeeper quay.io/debezium/kafka
docker run -d --name connect -p 8083:8083 \
--link kafka --link zookeeper quay.io/debezium/connect
- Register a MySQL connector:
curl -X POST http://localhost:8083/connectors -H 'Content-Type: application/json' -d '{
"name": "mysql-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql",
"database.port": "3306",
"database.user": "debezium",
"database.password": "dbz",
"database.server.id": "1",
"topic.prefix": "dbserver1",
"schema.history.internal.kafka.bootstrap.servers": "kafka:9092",
"schema.history.internal.kafka.topic": "schema-changes"
}
}'
- Consume change events from Kafka topics named
dbserver1.<database>.<table>.
Example
A Debezium change event JSON structure:
{
"before": {"id": 1, "name": "Alice", "email": "alice@old.com"},
"after": {"id": 1, "name": "Alice", "email": "alice@new.com"},
"source": {"db": "inventory", "table": "customers"},
"op": "u",
"ts_ms": 1713000000000
}
Related on TokRepo
- Database tools — database utilities and connectors
- DevOps tools — infrastructure and data pipeline resources
Common pitfalls
- MySQL requires binlog_format=ROW and binlog_row_image=FULL. Without these settings, Debezium cannot capture complete change events.
- Initial snapshots of large tables can take hours and put load on the source database. Schedule initial snapshots during low-traffic periods.
- Kafka topic retention must outlast your downstream consumer lag. If consumers fall behind, they lose events when topics are compacted.
Frequently Asked Questions
The primary deployment uses Kafka Connect. However, Debezium Server provides a standalone runtime that can send events to Amazon Kinesis, Google Pub/Sub, Apache Pulsar, and other messaging systems without Kafka.
Debezium supports MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, Db2, Cassandra, and Vitess. Each database has a dedicated connector that reads its specific transaction log format.
CDC reads the database transaction log to capture every change in order with sub-second latency. Polling queries the database on an interval, missing changes between polls and adding query load to the database.
Yes. Debezium tracks schema changes through the transaction log and records them in a schema history topic. Downstream consumers can detect when columns are added, removed, or modified.
Debezium stores its position in the transaction log in Kafka Connect offsets. When the connector restarts, it resumes from the last committed offset without missing or duplicating events.
Citations (3)
- Debezium GitHub— Debezium captures row-level changes from database transaction logs
- Debezium Documentation— Supports MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, and more
- Debezium Tutorial— Kafka Connect architecture for declarative connector configuration
Related on TokRepo
Discussion
Related Assets
Flower — Federated Learning Framework for Any ML Platform
A unified framework for federated learning and federated analytics that works with PyTorch, TensorFlow, JAX, or any machine learning library.
H2O-3 — Scalable Open-Source Machine Learning Platform
An in-memory distributed machine learning platform with AutoML support, offering gradient boosting, deep learning, GLM, and more through Python, R, and Java APIs.
Open3D — Modern Library for 3D Data Processing
An open-source library for 3D data processing with fast implementations for point clouds, meshes, RGB-D images, and 3D visualization using both C++ and Python APIs.