Unlock the Power of Data for Change Data Capture with Debezium

From Traditional Data Syncing Pipelines to Real-time Data Streaming and Integration for Modern Applications

IndiaMART Author
IndiaMART Blogs
Published in
7 min readJan 15, 2025

--

By @Saloni Batra and @Yeshveer Yadav

Yeshveer Yadav using ChatGPT

In today’s data-driven world, information is more than just an asset — it’s the lifeblood that fuels engineering, analytics, and data modelling efforts across industries. To harness the true power of data, capturing it in a timely and accurate manner is crucial. This is where Change Data Capture (CDC) becomes a game-changer.
Among the various CDC tools available, Debezium stands out as an open-source solution that efficiently captures data changes from source databases, making it indispensable for modern data architectures.

Understanding the Core Concepts

Before we dive into setting up Debezium, let’s explore some key concepts that will help you better understand its operation:

Debezium Source Connector: We’re leveraging the Debezium Project for a critical part of our design. This widely adopted open-source solution, used by many tech giants, efficiently detects changes to rows in Postgres and generates an event for each modification — a process known as Change Data Capture (CDC). In our case, we’re specifically using the Debezium Postgres Source Connector, though it also supports other databases like MySQL and MongoDB.

The Postgres connector utilises a concept called logical decoding to read from the Postgres write-ahead log (WAL), also known as the transaction log, and generates events for every change to a row. The records produced by the connector can be quite verbose, so we employ a Single Message Transform to simplify the event to just the new values in the row. To ensure the proper ordering of events, we maintain a single partition for each Kafka topic and include the Postgres log sequence number (LSN) with each event. The LSN serves as the offset in the Postgres write-ahead log.

I want to highlight a critical issue that could lead to significant problems if not managed properly: while Postgres’ logical decoding offers an efficient way to stream database changes to external consumers, it has its drawbacks. Most importantly, Postgres will not delete any part of the write-ahead log until all existing logical replication slots have consumed the data. This means that if your consumer (in this case, the Debezium connector) stops consuming for any reason, Postgres will continue to append to the write-ahead log until the disk is full, potentially rendering the database unresponsive. Using logical decoding necessitates sufficient disk space to buffer against temporary outages and effective monitoring to alert you to any issues early on.

Pgoutput: When creating a logical decoding slot in Postgres, you must specify an output plugin for the slot. Logical decoding output plugins decode and convert the contents of the write-ahead log into a format suitable for consumption. We also use the pgoutput plugin, which is the built-in output plugin provided by Postgres. It is designed for logical replication and offers efficient data streaming capabilities. Unlike wal2json, which produces JSON-formatted output, pgoutput is optimized for compatibility with various replication scenarios, making it a versatile choice for many applications.

Apache Kafka: Central to our new ELT system is a Kafka cluster that maintains one topic for each Postgres table we’re replicating. This cluster is dedicated exclusively to ELT events, facilitating the seamless integration of data changes captured by Debezium. As Debezium streams changes from the Postgres database, it publishes them to the corresponding Kafka topics, ensuring that each event is efficiently processed and available for downstream applications. By leveraging Kafka’s distributed architecture, we gain scalability and reliability, allowing us to handle high volumes of data with minimal latency.

Kafka Connect: Kafka Connect is an open-source framework that facilitates the integration of Kafka with various existing systems, such as databases and filesystems, through pre-built components known as connectors. In the context of Debezium, we leverage Kafka Connect to stream change data from our databases into Kafka. Debezium provides Source connectors specifically designed for this purpose, allowing us to efficiently capture and transmit database changes in real-time. This integration enables seamless data movement from the database to Kafka, where it can be processed and consumed by other applications within our ecosystem.

Debezium Overview

Setting Up Debezium with Kafka

To ensure continuous availability of the data pipeline, we’ve configured Debezium as a Kafka Connect plugin running in distributed mode. The setup involves using two virtual machines (VMs) with Java and Kafka installed.

1. Kafka Installation: First, download and install Kafka.
2. Debezium PostgreSQL Installation: Follow the Debezium Installation Guide to get started.

Set Up Debezium

Create Directory /opt: Begin by creating the `/opt` directory on your server.

Place Debezium Plugin: Move the Debezium plugin into the `/opt` folder.

Configure Kafka:

Update the connect-distributed.properties file in Kafka’s config folder for distributed mode:

bootstrap.servers=<list_of_kafka_cluster_IPs>

plugin.path=/opt/debezium

group.id=<your_kafka_connect_cluster_group_id> # same for all nodes in distributed mode

For high availability, we are running Kafka along with Zookeeper in distributed mode, with three Kafka brokers.

Syncing Data into Secondary Databases
To move data into the destination database, you’ll need the appropriate sink connector. We use the JDBC sink connector to route data into the destination PostgreSQL database, which is also open-source.

Database-Level Configuration
Further configurations are required at the database level to fetch data. The Debezium PostgreSQL Connector Guide provides detailed instructions on how to do this.

Monitoring and Alerting
To ensure smooth operation, we monitor the entire pipeline using Prometheus and Kafka. Metrics are exported to Prometheus and visualized in Grafana for monitoring and alerting. This setup enables real-time tracking of the health and performance of our CDC pipeline. Additionally, you can configure Grafana to send alerts for critical conditions.

Debezium at IndiaMart

At IndiaMart, we leverage Debezium for real-time data streaming into secondary databases. Here are two primary use cases:

The key requirements that guided the design were:
1. Zero downtime DB Upgradation
2. Single Write Responsibility
3. Easy Adoption

Database Version Upgrade: The process of updating or migrating a database without interrupting the availability of the applications that depend on it. This approach is essential for businesses that require continuous access to their services and cannot afford any downtime during maintenance or upgrades.

We utilized Debezium to upgrade our PostgreSQL databases with zero downtime, ensuring that data remained current throughout the process. To date, we have successfully upgraded five databases, including the critical Auth PG database used for login at IndiaMart, which manages approximately 3 crore records per day.

Single-write Responsibility in the context of APIs and databases refers to the principle that each piece of data should have a single source of truth for its creation or modification. Here’s how it applies:

  1. APIs: In a microservices architecture, each service should be responsible for writing to its own data store. This prevents conflicts and ensures that changes to data are centralized within a single service, making it easier to manage data integrity and consistency.
  2. Databases: Each database should be the sole authority for the data it contains. This means that updates, inserts, and deletes should occur through a defined interface (such as an API), ensuring that all data changes are controlled and validated. This approach helps prevent issues like data duplication and inconsistency, as there are no multiple points of entry for writing data.

By adhering to the Single Write Responsibility principle, organizations can streamline data management, improve performance, and reduce the risk of errors that can arise from concurrent writes to multiple databases or services.

IndiaMart manages large volumes of data across multiple databases. To ensure high data flow and availability, replication is crucial. By implementing Debezium in our primary database, we currently synchronize data to 15 destination databases. This setup has minimized the need for additional consumer instances and RabbitMQ, effectively reducing infrastructure costs. Currently, we stream 56 tables in our production pipeline, with an overall data rate of around 1 crore records per day.

Debezium Architecture At IndiaMart

Challenges Faced While Running Debezium

Even with a well-configured setup, challenges can arise. Here are some of the issues we encountered and how we addressed them:

Streaming Stops Due to Bulk WAL Generation: When a large amount of Write-Ahead Logging (WAL) is generated, the buffer may overflow, temporarily stopping Debezium’s streaming. This can be mitigated by avoiding bulk WAL generation processes on the source database.

Latency Between Source and Destination Database: During periods of high traffic, we experienced latency of 1–2 minutes between the source and destination databases. Adjusting the polling interval of the source connector helped reduce this latency.

Adjusting Polling Time:

Property Name:`poll.interval.ms`

Default Value: 500ms

A lower polling time reduces latency but increases processing frequency. Adjust this setting according to your needs to balance data transmission speed and system load.

By fine-tuning these settings, you can create a robust and responsive data pipeline that delivers near-real-time data for your applications.

Key Considerations for Production

In a production environment, it’s essential to be mindful of certain factors:

Consumer Name: The consumer name associated with the Kafka consumer group maintains the log sequence number (LSN) from which data has been committed to Kafka. Changing the consumer name in a running pipeline could impact data flow and potentially cause data loss.

Snapshot Tuning: While taking incremental snapshots, the publishing rate in Kafka is lower than in normal snapshots. This can be adjusted using the `incremental.chunk.size` property. However, after reaching a certain threshold, you’ll need to tune other parameters such as `producer.override.batch.size` and `producer.override.buffer.memory` to optimize throughput.

Developed By: Saloni Batra and yeshveer singh yadav

Designed By: Alok Kumar

Written By: yeshveer singh yadav

Edited & Reviewed By: Paras Lehana

--

--

Published in IndiaMART Blogs

Explore insights into the Indian business community, MSMEs, emerging opportunities, data-driven trends, and the technology powering India’s largest online B2B marketplace — IndiaMART.

Written by IndiaMART Author

I am many voices speaking as one. As IndiaMART's author, I represent the experiences and contributions of thousands of my creative and incredible employees!

Responses (1)

Write a response