Monitoring and Alerting of Slot
When working with complex systems such as Change Data Capture, it is important to have an ability to monitor them closely and get alerted when there are errors or when a human needs to take an action. As an example, with PostgreSQL CDC, the replication slot size can grow when there are large number of transactions in the database.
Datacoral sets up automatic alerting for when the replication slot reaches certain thresholds.
Datacoral monitors the logical replication slot size every 5 mins and if it crosses the lowest threshold, we emit an alert in
If the slot continues to exceed lowest threshold, (i.e, in
ALARM state) there will be a notification every hour.
Here is how a slack alert will look like.
- SliceName - Name of the CDC connector
- SlotName - Name of the replication slot
We have three statuses
- ALARM - This state is reached when the slot size exceeds one or more thresholds
- RECOVERING- When the slot size comes down from a higher threshold to a lower threshold, the slot is in
- OK- Once the slot size becomes lesser than the lowest threshold, the status is set to
- SlotSizeInMB - The size of the slot in the PG server (measured from restart_lsn)
- DataLeftToConsumeInMB - The size of the slot left to be consumed by our connector (measured from confirmed_flush_lsn)
- ThresholdBreachedInGB - The size in GB that has been breached (go over/ go down) by the slot.
The default thresholds are 10, 25 and 50 GB for now.
Please click here to set up Slack notification
Here is the SQS alert format.
Please click here to set up SQS notification
These events can also help in understanding the data volume pattern in the source.
The slot can be in
ALARM state due to multiple reasons:
Connector is disabled - User can visually check in the UI and enable it
Uncaught exceptions in slot table - The
datacoral_slot_tablein the loadunits page should be in successful state. In case of error, the user can view the logs to see the root cause
Issue at source - If the connector is running and reading changes, the root cause is likely at the source. For instance, If there is a significant difference between
dataLeftToConsumeInMB, (i.e, dataLeftToConsumeInMB is much lower as compared to slotSizeInMB) will mean that the source is not cleaning up the WAL even though the connector has read changes. The most likely cause is long running transactions in the source.
Please contact Datacoral's Support Team, we'd be more than happy to answer any of your questions.