PostgreSQL CDC Connector

Overview

PostgreSQL offers a few different ways of reading in CDC fashion. A commonly used pattern is to read from the Write-Ahead Logs (WAL). The WAL contains changes at the storage level, so this needs to be converted into application specific data through a process called logical decoding. Here, data is read from a replication slot which represents a stream of ordered changes that can be replayed for a client. At Datacoral, we then use SQL to read and retrieve changes from the replication slot, which is then loaded into the data warehouse

Features and Capabilities

  • Historical sync :
    • Full historical sync of your entire data
    • When new tables and columns are detected at source and added to the destination, historical syncs start automatically
    • Manual triggers of historical sync
  • Data Extraction Modes:
    • Data is extracted in an incremental way from the WAL
  • Data Load Modes:
    • Data can be loaded in append and merge modes
  • Soft deletes :
    • Soft deletes can be applied in merge mode
  • Schema Change Handling :
    • All schema changes at source are automatically applied at destination
    • Deleted columns at source can be renamed in the destination
  • Tables and Columns selection:
    • Individual schemas can be selected along with tables and columns for replication
    • Syncing multiple schemas is supported
  • Data Quality Checks:
    • Supports automated and customized data quality checks between source and destination
  • Scheduling:
    • Highly flexible scheduling system with support for near real-time data replication
  • Monitoring and Alerting :
    • Additional monitoring at slot-management

Steady State

Once PostgreSQL CDC slice has been set up, data will start flowing from Write Ahead Logs to the data warehouse -

  1. Change data (Insert / Update / Delete) from the WAL will flow into S3 on a real-time basis
  2. On the schedule specified by the user, data will be stored in the datacoral_slot_table in Athena
  3. Final changes will be applied to corresponding tables in the destination

Monitoring and Failure Handling

Datacoral monitors for slot size and data freshness

  • When slot size starts to grow and it is consuming too much disk space in the PG database, Datacoral can automatically recover using a long-running process in AWS Batch
  • When slot size becomes too large or irrecoverable, we delete the slot, perform historical sync from replica, and then start capturing changes in a CDC manner again.

As this is an action on Master database, any reduction in free space can have an adverse impact and hence a close monitoring and failure recovery is critical. Customers can also set up alerts around slot sizes so that when there is an increase in slot size, it can be managed appropriately

Connector output

Amazon S3: Data stored in Amazon S3 is partitioned by date and time in the following bucket s3//:customer_installation.datacoral/<connector-name>.

Slot table - datacoral_<customer-installation-name>_<connector-name>.datacoral_slot_table

Individual tables - datacoral_<customer-installation-name>_<connector-name>.<loadunit-name>

Data Warehouse: Schema name will be same as the connector name. Tables produced by the connector will have the same name as the source database.

Supported Versions

Questions? Interested?

If you have questions or feedback, feel free to reach out at hello@datacoral.co or Request a demo.