Introduction
Debezium is an open-source distributed platform for change data capture (CDC). It monitors database transaction logs and streams every row-level change as an event to Apache Kafka or other messaging systems, enabling real-time data pipelines without modifying application code.
What Debezium Does
- Captures row-level INSERT, UPDATE, and DELETE events from database transaction logs
- Supports MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, Cassandra, and Db2
- Streams change events to Kafka topics with exactly-once delivery semantics
- Provides before and after snapshots of changed rows in each event
- Handles initial snapshots of existing data before switching to streaming mode
Architecture Overview
Debezium runs as Kafka Connect source connectors. Each connector reads the database's write-ahead log (WAL in Postgres, binlog in MySQL) and converts changes into structured events with a consistent envelope format. The connector tracks offsets in Kafka so it can resume after failures. A schema history topic stores DDL changes to correctly interpret row data as schemas evolve over time.
Self-Hosting & Configuration
- Deploy as Kafka Connect connectors in an existing Kafka cluster
- Use Debezium Server for standalone operation without Kafka Connect infrastructure
- Configure database connection, topic routing, and snapshot mode per connector
- Set up schema registry (Confluent or Apicurio) for Avro or Protobuf serialization
- Use signal tables and incremental snapshots for re-snapshotting without downtime
Key Features
- Log-based CDC with no polling, no triggers, and no application code changes
- Exactly-once semantics when combined with Kafka transactions
- Schema evolution tracking with automatic topic schema updates
- Single Message Transforms (SMTs) for filtering, routing, and reshaping events
- Debezium UI for visual connector management and monitoring
Comparison with Similar Tools
- Maxwell — MySQL-only CDC; Debezium supports 8+ database types
- Canal — Alibaba MySQL binlog parser; Debezium provides a broader connector ecosystem
- AWS DMS — managed service with CDC; Debezium is self-hosted and open source
- Airbyte — batch-first ELT platform; Debezium is real-time stream-first
- Fivetran — managed SaaS CDC; Debezium gives full control over infrastructure
FAQ
Q: Does Debezium require Kafka? A: Not necessarily. Debezium Server can send events directly to Redis, Pulsar, Kinesis, or HTTP endpoints without Kafka.
Q: How does CDC differ from triggers or polling? A: CDC reads the transaction log directly, adding zero overhead to the database. Triggers add write latency, and polling misses intermediate states between intervals.
Q: Can Debezium handle schema changes? A: Yes. Debezium tracks DDL changes in a schema history topic and applies them to correctly serialize events as tables evolve.
Q: What happens if the connector falls behind? A: Debezium maintains offsets and will catch up by reading from the log. If the log has been purged, an incremental snapshot can re-capture the data.