PostgreSQL CDC Connector

Overview

PostgreSQL offers a few different ways of reading in CDC fashion. A commonly used pattern is to read from the Write-Ahead Logs (WAL). The WAL contains changes at the storage level, so this needs to be converted into application specific data through a process called logical decoding. Here, data is read from a replication slot which represents a stream of ordered changes that can be replayed for a client. At Datacoral, we offer a couple of different ways of retrieving changes from the replication slot.

  1. Through SQL interface to read the changes and after reading, deleting the changes from the replication slot. The sql interface uses file-based shipping of WAL which means every time we issue a peek/get command, the WAL sender process reads the WAL files on the server and serves them to the client.
  2. Through streaming replication which provides a way to publish logs for only the tables that we are interested in replicating. The logical replication slot can be subscribed to the publisher and get the changes in a streaming way. The streaming replication is much faster and in most cases the delay between a transaction being written onto the primary and it being sent to the subscriber is 1 second. This is supported for PostgreSQL version 10 and above.

Features and Capabilities

  • Historical sync :
    • Full historical sync of your entire data
    • When new tables and columns are detected at source and added to the destination, historical syncs start automatically
    • Manual triggers of historical sync
  • Data Extraction Modes:
    • Data is extracted in an incremental way from the WAL
  • Data Load Modes:
    • Data can be loaded in append and merge modes
  • Soft deletes :
    • Soft deletes can be applied in merge mode
  • Schema Change Handling :
    • All schema changes at source are automatically applied at destination
    • Deleted columns at source can be renamed in the destination
  • Tables and Columns selection:
    • Individual schemas can be selected along with tables and columns for replication
    • Syncing multiple schemas is supported
  • Data Quality Checks:
    • Supports automated and customized data quality checks between source and destination
  • Scheduling:
    • Highly flexible scheduling system with support for near real-time data replication
  • Monitoring and Alerting :
    • Additional monitoring at slot-management

Steady State

Once PostgreSQL CDC slice has been set up, data will start flowing from Write Ahead Logs to the data warehouse -

  1. Change data (Insert / Update / Delete) from the WAL will flow into S3 on a real-time basis
  2. On the schedule specified by the user, data will be stored in the datacoral_slot_table in Athena
  3. Final changes will be applied to corresponding tables in the destination

Monitoring and Failure Handling

Datacoral monitors for slot size and data freshness

  • When slot size starts to grow and it is consuming too much disk space in the PG database, Datacoral can automatically recover using a long-running process in AWS Batch
  • When slot size becomes too large or irrecoverable, we delete the slot, perform historical sync from replica, and then start capturing changes in a CDC manner again.

As this is an action on Master database, any reduction in free space can have an adverse impact and hence a close monitoring and failure recovery is critical. Customers can also set up alerts around slot sizes so that when there is an increase in slot size, it can be managed appropriately

Connector output

Amazon S3: Data stored in Amazon S3 is partitioned by date and time in the following bucket s3//:customer_installation.datacoral/<connector-name>.

Slot table - datacoral_<customer-installation-name>_<connector-name>.datacoral_slot_table

Individual tables - datacoral_<customer-installation-name>_<connector-name>.<loadunit-name>

Data Warehouse: Schema name will be same as the connector name. Tables produced by the connector will have the same name as the source database.

Supported Versions

Questions? Interested?

If you have questions or feedback, feel free to reach out at hello@datacoral.co or Request a demo.