PostgreSQL CDC Slot Monitoring

Monitoring and Alerting of Slot

When working with complex systems such as Change Data Capture, it is important to have an ability to monitor them closely and get alerted when there are errors or when a human needs to take an action. As an example, with PostgreSQL CDC, the replication slot size can grow when there are large number of transactions in the database.

Datacoral sets up automatic alerting for when the replication slot reaches certain thresholds.

Alert Frequency

Datacoral monitors the logical replication slot size every 5 mins and if it crosses the lowest threshold, we emit an alert in ALARM state.

If the slot continues to exceed lowest threshold, (i.e, in ALARM state) there will be a notification every hour.

Slack Alert

Here is how a slack alert will look like.

Terminology:

  • SliceName - Name of the CDC connector
  • SlotName - Name of the replication slot
Statuses

We have three statuses

  • ALARM - This state is reached when the slot size exceeds one or more thresholds
  • RECOVERING- When the slot size comes down from a higher threshold to a lower threshold, the slot is in RECOVERING status.
  • OK- Once the slot size becomes lesser than the lowest threshold, the status is set to OK
  • SlotSizeInMB - The size of the slot in the PG server (measured from restart_lsn)
  • DataLeftToConsumeInMB - The size of the slot left to be consumed by our connector (measured from confirmed_flush_lsn)
  • ThresholdBreachedInGB - The size in GB that has been breached (go over/ go down) by the slot.
note

The default thresholds are 10, 25 and 50 GB for now.

Please click here to set up Slack notification


SQS Alert

Here is the SQS alert format.

{
"category": "collect",
"eventType": "system",
"eventSubtype": "slot_management",
"executionContext": {
"slotSizeInMB": 1000,
"dataLeftToConsumeInMB": 1000,
"slotName": "datacoral_slot_table",
"thresholdBreachedInGB": 10
},
"timestamp": "2021-01-01T00:00:00+00:00",
"status": "ALARM/OK/RECOVERING",
"sliceName": "connector-name"
}

Please click here to set up SQS notification

These events can also help in understanding the data volume pattern in the source.

Troubleshooting

The slot can be in ALARM state due to multiple reasons:

  1. Connector is disabled - User can visually check in the UI and enable it

  2. Uncaught exceptions in slot table - The datacoral_slot_table in the loadunits page should be in successful state. In case of error, the user can view the logs to see the root cause

  3. Issue at source - If the connector is running and reading changes, the root cause is likely at the source. For instance, If there is a significant difference between slotSizeInMB and dataLeftToConsumeInMB, (i.e, dataLeftToConsumeInMB is much lower as compared to slotSizeInMB) will mean that the source is not cleaning up the WAL even though the connector has read changes. The most likely cause is long running transactions in the source.

Questions?

Please contact Datacoral's Support Team, we'd be more than happy to answer any of your questions.