PostgreSQL offers a few different ways of reading in CDC fashion. A commonly used pattern is to read from the Write-Ahead Logs (WAL). The WAL contains changes at the storage level, so this needs to be converted into application specific data through a process called logical decoding. Here, data is read from a replication slot which represents a stream of ordered changes that can be replayed for a client. At Datacoral, we offer a couple of different ways of retrieving changes from the replication slot.
- Through SQL interface to read the changes and after reading, deleting the changes from the replication slot. The sql interface uses file-based shipping of WAL which means every time we issue a peek/get command, the WAL sender process reads the WAL files on the server and serves them to the client.
- Through streaming replication which provides a way to publish logs for only the tables that we are interested in replicating. The logical replication slot can be subscribed to the publisher and get the changes in a streaming way. The streaming replication is much faster and in most cases the delay between a transaction being written onto the primary and it being sent to the subscriber is 1 second. This is supported for PostgreSQL version 10 and above.
Features and Capabilities
- Historical sync :
- Full historical sync of your entire data
- When new tables and columns are detected at source and added to the destination, historical syncs start automatically
- Manual triggers of historical sync
- Data Extraction Modes:
- Data is extracted in an incremental way from the WAL
- Data Load Modes:
- Data can be loaded in append and merge modes
- Soft deletes :
- Soft deletes can be applied in merge mode
- Schema Change Handling :
- All schema changes at source are automatically applied at destination
- Deleted columns at source can be renamed in the destination
- Tables and Columns selection:
- Individual schemas can be selected along with tables and columns for replication
- Syncing multiple schemas is supported
- Data Quality Checks:
- Supports automated and customized data quality checks between source and destination
- Highly flexible scheduling system with support for near real-time data replication
- Monitoring and Alerting :
- Additional monitoring at slot-management
Once PostgreSQL CDC slice has been set up, data will start flowing from Write Ahead Logs to the data warehouse -
- Change data (Insert / Update / Delete) from the WAL will flow into S3 on a real-time basis
- On the schedule specified by the user, data will be stored in the
- Final changes will be applied to corresponding tables in the destination
Monitoring and Failure Handling
Datacoral monitors for slot size and data freshness
- When slot size starts to grow and it is consuming too much disk space in the PG database, Datacoral can automatically recover using a long-running process in AWS Batch
- When slot size becomes too large or irrecoverable, we delete the slot, perform historical sync from replica, and then start capturing changes in a CDC manner again.
As this is an action on Master database, any reduction in free space can have an adverse impact and hence a close monitoring and failure recovery is critical. Customers can also set up alerts around slot sizes so that when there is an increase in slot size, it can be managed appropriately
Data stored in Amazon S3 is partitioned by date and time in the following bucket
Slot table -
Individual tables -
Data Warehouse: Schema name will be same as the connector name. Tables produced by the connector will have the same name as the source database.
- PostgreSQL Logical Decoding 9.6.10 and above
- Hosted flavors