Skip to content

Change Data Capture (CDC) with Debezium

Change Data Capture (CDC) with Debezium

Change Data Capture (CDC) is a set of software design patterns used to determine and track the data that has changed so that action can be taken using the changed data. Debezium is the gold standard for open-source CDC.


🏗️ The Traditional Problem: Dual Writes

In a microservices environment, you often need to update a database AND send a message to Kafka.

  • Problem: If the database update succeeds but the message send fails, your system is inconsistent.
  • Problem: If you use a “Polling” approach (checking for updated_at columns), you put significant load on the production DB and might miss intermediate changes.

🚀 The Debezium Solution: Log-Based CDC

Debezium sits on top of the database’s Transaction Log (e.g., MySQL Binlog, PostgreSQL WAL).

  1. Non-Invasive: It reads the logs from disk; it doesn’t query the tables.
  2. Low Latency: Changes are captured in milliseconds.
  3. Captures ALL changes: Unlike polling, it captures DELETE operations and every intermediate UPDATE.

🛠️ Architecture: Debezium on Kafka Connect

Debezium is usually deployed as a set of connectors running on Kafka Connect.

  • Source: PostgreSQL, MySQL, SQL Server, MongoDB, Oracle.
  • Transport: Kafka Connect sends “Change Events” to Kafka topics.
  • Sink: Downstream systems (Elasticsearch, Snowflake, S3) consume the change events.

Example Change Event (JSON)

… (existing json) …


🛠️ Implementation: Kafka Connect Configuration

To deploy Debezium, you POST a JSON configuration to the Kafka Connect REST API.

{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.dbname": "inventory",
    "topic.prefix": "dbserver1",
    "table.include.list": "public.orders,public.customers",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot"
  }
}

Note: This configuration tells Debezium to watch the orders and customers tables and stream every change to Kafka topics named dbserver1.public.orders and dbserver1.public.customers.


💡 Key Use Cases

  1. Microservice Synchronization: Keep a read-optimized view (e.g., in Elasticsearch) in sync with a write-optimized DB.
  2. Zero-Downtime Migrations: Stream data from an old database to a new one in real-time.
  3. Audit Logs: Automatically create a perfect history of every change ever made to a table.
  4. Cache Invalidation: Automatically clear Redis caches when the source of truth changes.

⚠️ Challenges & Best Practices

  • Snapshotting: How Debezium handles existing data before starting the log tailing.
  • Schema Evolution: Handling ALTER TABLE commands.
  • Message Size: Transaction logs can be verbose; use Avro or Protobuf with Schema Registry to reduce payload size.