Data Quality Checks - Troubleshooting guide
At times, we encounter data quality check failure which points to a discrepancy between source and destination. This document explains how to go about not just identifying the discrepancy but also fixing it. First step is to make use of our helper command to generate various debug queries that can be used in subsequent steps.
DQ DEBUG QUERIES
Most common invocation
help
Generation Example
Runtime Example
DQ Errors
Here is the list of DQ Errors and the troubleshooting guidelines
DQ Error | Error Message |
---|---|
Duplicate Count Check | ({dst_pk_distinct_count} == {dst_total_count}) and ({src_total_count} == {src_pk_distinct_count} # This check fails when there are duplicate records of a primary key either in the source or destination |
Checksum Mismatch | ({checksum_mismatch_count} is None or {checksum_mismatch_count} == 0) # This check fails when checksum mismatch |
Update not propagated | ({dst_older_ts_count} is None or {dst_older_ts_count} == 0) # This check fails when updates are not propagated ({src_older_ts_count} is None or {src_older_ts_count} == 0) # This check fails when updates are propagated but synopsis took time to pull |
Delete not propagated | ({dst_only_pk_count} is None or {dst_only_pk_count} == 0) # This check fails when there are more number of primary keys in destination i.e. deletes are not propagated |
Insert not propagated | ({src_only_pk_count} is None or {src_only_pk_count} == 0) # This check fails when there are missing primary keys at destination i.e. inserts are not propagated |
Duplicate records at source or destination
Why does this happen?
- This check fails when there are duplicate records of the primary key either in the source or destination.
- This could be an issue where multiple updates of the same id have not been merged at the destination.
- Total count and distinct count of Primary Keys at the source and destination are checked (independently) and an error is thrown if there is a mismatch
Where to check?
- Duplicate at the source will need to be identified by the customer and de-duped accordingly.
- Duplicate in the destination - Below is the query to find the duplicate ids in Redshift
Run this query at redshift to see the id
s with count greater than 1
How can it be fixed?
- If duplicates at the source have crept into the destination (this is usually not possible as Datacoral merges records in the destination), a clean full backfill will be ideal as the source system has been cleaned up
note
Note that the change (de-dupe activity) at the source will be captured and will propagate appropriately to the destination
- If duplicates are identified at the destination, records with older updated_at can be deleted.
Checksum mismatch
Why does this happen?
- For an absolute check of records, Datacoral compares the Checksum of the entire record between the source and the destination.
- This check fails when there is a checksum mismatch or if any column differs between source and destination for a given primary key and updated_at.
- This can happen due to :
- Certain nuances of source or destination that can lead to mismatch (e.g. precision-based float data type in Postgres does not add padded zeroes, while the same gets added in Redshift. While the data is the same, the checksum will fail)
- The parser may incorrectly handle certain data types
Where to check?
- Generate all the dq-debug-queries by executing this CLI command - DQ DEBUG QUERIES
- Execute the generated 'mismatch' query against the destination to identify the pks having a mismatch
- Now we will use those pks in all other queries to identify the root cause of the issue (Tip - You can use a subset of the pks for a faster turnaround)
- Execute the 'source' query to get column level MD5 of the identified pks in source
- Execute the 'destination' query to get column level MD5 of the identified pks in destination
- Compare the results of source and destination queries to identify the column whose MD5 is not matching
- Execute the generated loadunit filter 'athena' query to get the updates that made it to the corresponding table in S3 and to confirm that the timestamp columns are getting updated
- Execute the generated slot 'athena' query to get the updates that made it to the slot
How can it be fixed?
- Most common cause of the failure is that timestamp column is not getting correctly updated at the source. In that case, the customer needs to ensure that updates in their timestamp column are reliable.
- If the timestamp column is correctly getting updated, then please open a support ticket with the results of the queries that you executed
Updates are not propagated
Why does this happen?
- This check fails when updates have not propagated and the failure may occur when the
- The Update did not make it to the slot table - AWS issues
- The Update did not make it to S3 (Individual tables) - Infrastructure issues of data not propagating from slot table to individual tables
- The Update did not make it to Destination - Infrastructure issues of data not propagating from S3 to destination
- Replica Lag caused a discrepancy which in turn can result in dq failures(src_older_ts_count). Note - Replica lag will not result in failures for ingest_mysqlcdc
Where to check?
- Generate all the dq-debug-queries by executing this CLI command - DQ DEBUG QUERIES
- Execute the generated 'mismatch' query against the destination to identify the pks resulting in dq-failures
- Now we will use those pks in all other queries to identify the root cause of the issue (Tip - You can use a subset of the pks for a faster turnaround)
- Execute the 'source' query to get timestamps of the records in source
- Execute the 'destination' query to get timestamps of the records in destination
- This check fails where the timestamp columns at source and destination do not match - if the timestamps are matching then it means the failure might be due to replicaLag
- Execute the generated loadunit filter 'athena' query to get the updates that made it to the corresponding table in S3
- Execute the generated slot 'athena' query to get the updates that made it to the slot
How can it be fixed?
- If you CDC connector is not ingest_mysqlcdc and if the timestamps at source and destination are matching, then reprocess the synopsis loadunit to take care of the replica lag and see if the issue is getting addressed
- If previous CLI did not resolve the issue then carry out - Partial backfill for the missing update time window OR Partial backfill of individual primary keys
- Also, please open a support ticket with the results of the queries that you executed
Delete are not propagated
Why does this happen?
- This check can fail if warehouse load of CDC connector was not done by the time dq-checks ran (Not a True data error but a DQ check error)
- This check fails when there was an error while loading deleted rows to deleted rows table in redshift (Not a True data error but a DQ check error)
- Actual deletes are not applied on destination (Data Error)
Where to check?
- Generate all the dq-debug-queries by executing this CLI command - DQ DEBUG QUERIES
- Execute the generated 'mismatch' query against the destination to identify the pks that were not deleted in destination
- Now we will use those pks in all other queries to identify the root cause of the issue (Tip - You can use a subset of the pks for a faster turnaround)
- Execute the 'source' query to get column level MD5 of the identified pks in source - this should return zero records
- Execute the 'destination' query to get column level MD5 of the identified pks in destination
- Execute the 'deleted' query to to check if the deletes were correctly captured
- Execute the generated loadunit filter 'athena' query to get the updates that made it to the corresponding table in S3
- Execute the generated slot 'athena' query to get the updates that made it to the slot
How can it be fixed?
- Most common - If the failure is because redshift load did not happen when dq-checks ran, then a reprocess of mismatchMV will resolve the issue -
- If the previous CLI does not resolve then issue then it means that deletes have not propagated and a full backfill will be needed as Partial Backfill will not be sufficient
- Also, please open a support ticket with the results of the queries that you executed
Insert not propagated
Why does this happen?
Records missing Redshift that are present at Source can happen due to similar reasons as the update not propagated.
Where to check?
- Generate all the dq-debug-queries by executing this CLI command - DQ DEBUG QUERIES
- Execute the generated 'mismatch' query against the destination to identify the pks resulting in dq-failures
- Now we will use those pks in all other queries to identify the root cause of the issue (Tip - You can use a subset of the pks for a faster turnaround)
- Execute the 'source' query to get timestamps of the records in source
- Execute the 'destination' query to get timestamps of the records in destination - this should return zero records
- This check fails where certain records are found in source but not in destination
- Execute the generated loadunit filter 'athena' query to get the updates that made it to the corresponding table in S3
- Execute the generated slot 'athena' query to get the updates that made it to the slot
How can it be fixed?
- Partial backfill for the missing records time window OR Partial backfill of individual primary keys
- Also, please open a support ticket with the results of the queries that you executed
How to get DQ to succeed?
Once the data is fixed, DQ can be re-triggered by reprocessing <loadunit_name>_pks
for the corresponding day.