Data Quality Checks - Troubleshooting guide

At times, we encounter data quality check failure which points to a discrepancy between source and destination. This document explains how to go about not just identifying the discrepancy but also fixing it. First step is to make use of our helper command to generate various debug queries that can be used in subsequent steps.

DQ DEBUG QUERIES

Most common invocation

datacoral generate dq-debug-queries --warehouse-name <warehouse-name>
--cdc-slice-name <cdc-slice-name>
--loadunits <loadunits>
--timelabel <timelabel>

help

datacoral generate dq-debug-queries -h
Usage: dq-debug-queries [options]
generate dq-debug-queries for failed dq-checks
Options:
--warehouse-name <warehouse-name> Warehouse for which we need debug queries
--cdc-slice-name <cdc-slice-name> Name of the active cdc slice
--loadunits <loadunits> Loadunits for which debug queries should be generated
--timelabel <timelabel> Timelabel for which debug queries should be generated
--full-check If you want to generate queries for full check synopsis connector
--timestamp-column [timestamp-column] Default timestamp column which is used to track updates. If not found, one of the available timestamp column will be used
--synopsis-slice-name [synopsis-slice-name] If you have non-standard synopsis slice name
--synopsis-slice-mv-schema [synopsis-slice-mv-schema] If it is not same as the synopsis-slice-name
--sub-query [sub-query] The sub-query you want to generate: mismatch, summary, source, destination, deleted, athenaSlot, athenaLoadunitFilter
--skip-checksum If you want to generate debug queries without checksum comparison
-h, --help output usage information

Generation Example

datacoral generate dq-debug-queries --cdc-slice-name mysql_connector1 --warehouse-name redshift --timelabel 20210402234500 --loadunits a_table_multi_pk
2021-04-02T17:23:55-07:00 - info:
--FYI: summary 'redshift' query
SELECT dst_only_pk_count, src_only_pk_count, both_pk_count, mismatch_pk_count, checksum_mismatch_count, checksum_verified_count, checksum_invalid_count, both_same_ts_count, dst_older_ts_count, src_older_ts_count, both_combined_count, dst_total_count, src_total_count, dst_pk_distinct_count, src_pk_distinct_count, __dc_timelabel_col FROM mysql_connector1_synopsis.mv_dq_summary_a_table_multi_pk WHERE __dc_timelabel_col = '2021-04-02T23:45:00';
--mismatch 'redshift' query
SELECT pk1, pk1, src_ts, dst_ts, ts_comparison, pk_comparison, checksum_comparison, __dc_timelabel_col FROM mysql_connector1_synopsis.mv_dq_mismatch_a_table_multi_pk WHERE checksum_comparison <> 'checksum_verified' AND __dc_timelabel_col = '2021-04-02T23:45:00' ORDER BY src_ts DESC;
--source 'mysql' query
SELECT `id2`, `id`, `updated_at`, md5(md5(COALESCE(LEFT(CAST(`val1` AS CHAR), 65535), ''))) AS `val1` FROM `a_table_multi_pk` WHERE (`id2`, `id`) IN (<pks from mismatchQuery>);
--slot 'athena' query
SELECT json_extract_scalar(row, '$.id2') as "id2", json_extract_scalar(row, '$.id') as "id", json_extract_scalar(row, '$.updated_at') as "updated_at", cast(json_extract(row, '$.__dc_cdc_deleted_at') as bigint) AS "__dc_cdc_deleted_at", cast(json_extract(row, '$.__dc_cdc_modified_at') as bigint) AS "__dc_cdc_modified_at", md5(md5(to_utf8(COALESCE(SUBSTR(CAST(json_extract_scalar(row, '$.val1') AS VARCHAR), 1, 65535), '')))) AS "val1" FROM datacoral_secure_mysql_connector1.datacoral_slot_table WHERE y = '2021' AND m = '04' AND d = '02' AND tablename = 'a_table_multi_pk' AND json_extract_scalar(row, '$.id2') IN (<pks from mismatchQuery>) AND json_extract_scalar(row, '$.id') IN (<pks from mismatchQuery>);
--loadunit filter 'athena' query
SELECT "id2", "id", "updated_at", "__dc_cdc_deleted_at", md5(md5(to_utf8(COALESCE(SUBSTR(CAST("val1" AS VARCHAR), 1, 65535), '')))) AS "val1" FROM datacoral_secure_mysql_connector1.a_table_multi_pk WHERE y = '2021' AND m = '04' AND d = '02' AND "id2" IN (<pks from mismatchQuery>) AND "id" IN (<pks from mismatchQuery>);
--deleted 'redshift' query
SELECT * FROM (SELECT json_extract_path_text("ROW", 'id2') AS "id2", json_extract_path_text("ROW", 'id') AS "id", MAX(TIMESTAMP 'epoch' + __dc_cdc_deleted_at/1000000000.0 * interval '1 second') AS __dc_cdc_deleted_at_timestamp FROM mysql_connector1.cdc_deleted_rows WHERE tablename = 'a_table_multi_pk' GROUP BY "id2", "id") WHERE ("id2", "id") IN (<pks from mismatchQuery>);
--destination 'redshift' query
SELECT "id2", "id", "updated_at", __dc_cdc_modified_at, __dc_cdc_deleted_at, md5(md5(COALESCE(LEFT(CAST("val1" AS VARCHAR), 65535), ''))) AS "val1" FROM "mysql_connector1"."a_table_multi_pk" WHERE ("id2", "id") IN (<pks from mismatchQuery>);

Runtime Example

--summary
SELECT dst_only_pk_count, src_only_pk_count, both_pk_count, mismatch_pk_count, checksum_mismatch_count, checksum_verified_count, checksum_invalid_count, both_same_ts_count, dst_older_ts_count, src_older_ts_count, both_combined_count, dst_total_count, src_total_count, dst_pk_distinct_count, src_pk_distinct_count, __dc_timelabel_col FROM mysql_connector1_synopsis.mv_dq_summary_a_table_multi_pk WHERE __dc_timelabel_col = '2021-04-02T23:45:00';
--mismatch
SELECT pk1, pk1, src_ts, dst_ts, ts_comparison, pk_comparison, checksum_comparison, __dc_timelabel_col FROM mysql_connector1_synopsis.mv_dq_mismatch_a_table_multi_pk WHERE checksum_comparison <> 'checksum_verified' AND __dc_timelabel_col = '2021-04-02T23:45:00' ORDER BY src_ts DESC;
--source
SELECT `id2`, `id`, `updated_at`, md5(md5(COALESCE(LEFT(CAST(`val1` AS CHAR), 65535), ''))) AS `val1` FROM `a_table_multi_pk` WHERE (`id2`, `id`) IN (('24',1500023));
--slot
SELECT json_extract_scalar(row, '$.id2') as "id2", json_extract_scalar(row, '$.id') as "id", json_extract_scalar(row, '$.updated_at') as "updated_at", cast(json_extract(row, '$.__dc_cdc_deleted_at') as bigint) AS "__dc_cdc_deleted_at", cast(json_extract(row, '$.__dc_cdc_modified_at') as bigint) AS "__dc_cdc_modified_at", md5(md5(to_utf8(COALESCE(SUBSTR(CAST(json_extract_scalar(row, '$.val1') AS VARCHAR), 1, 65535), '')))) AS "val1" FROM datacoral_secure_mysql_connector1.datacoral_slot_table WHERE y = '2021' AND m = '04' AND d = '02' AND tablename = 'a_table_multi_pk' AND json_extract_scalar(row, '$.id2') IN ('24') AND json_extract_scalar(row, '$.id') IN ('1500023');
--loadunit filter
SELECT "id2", "id", "updated_at", "__dc_cdc_deleted_at", md5(md5(to_utf8(COALESCE(SUBSTR(CAST("val1" AS VARCHAR), 1, 65535), '')))) AS "val1" FROM datacoral_secure_mysql_connector1.a_table_multi_pk WHERE y = '2021' AND m = '04' AND d = '02' AND "id2" IN ('24') AND "id" IN (1500023);
--deleted -
SELECT * FROM (SELECT json_extract_path_text("ROW", 'id2') AS "id2", json_extract_path_text("ROW", 'id') AS "id", MAX(TIMESTAMP 'epoch' + __dc_cdc_deleted_at/1000000000.0 * interval '1 second') AS __dc_cdc_deleted_at_timestamp FROM mysql_connector1.cdc_deleted_rows WHERE tablename = 'a_table_multi_pk' GROUP BY "id2", "id") WHERE ("id2", "id") IN (('24', 1500022));
--destination
SELECT "id2", "id", "updated_at", __dc_cdc_modified_at, __dc_cdc_deleted_at, md5(md5(COALESCE(LEFT(CAST("val1" AS VARCHAR), 65535), ''))) AS "val1" FROM "mysql_connector1"."a_table_multi_pk" WHERE ("id2", "id") IN (('24',1500023));

DQ Errors

Here is the list of DQ Errors and the troubleshooting guidelines

DQ ErrorError Message
Duplicate Count Check({dst_pk_distinct_count} == {dst_total_count}) and ({src_total_count} == {src_pk_distinct_count}
# This check fails when there are duplicate records of a primary key either in the source or destination
Checksum Mismatch({checksum_mismatch_count} is None or {checksum_mismatch_count} == 0)
# This check fails when checksum mismatch
Update not propagated
  • ({dst_older_ts_count} is None or {dst_older_ts_count} == 0)
    # This check fails when updates are not propagated
  • ({src_older_ts_count} is None or {src_older_ts_count} == 0)
    # This check fails when updates are propagated but synopsis took time to pull
  • Delete not propagated({dst_only_pk_count} is None or {dst_only_pk_count} == 0)
    # This check fails when there are more number of primary keys in destination i.e. deletes are not propagated
    Insert not propagated({src_only_pk_count} is None or {src_only_pk_count} == 0)
    # This check fails when there are missing primary keys at destination i.e. inserts are not propagated

    Duplicate records at source or destination

    Why does this happen?

    • This check fails when there are duplicate records of the primary key either in the source or destination.
    • This could be an issue where multiple updates of the same id have not been merged at the destination.
    • Total count and distinct count of Primary Keys at the source and destination are checked (independently) and an error is thrown if there is a mismatch

    Where to check?

    • Duplicate at the source will need to be identified by the customer and de-duped accordingly.
    • Duplicate in the destination - Below is the query to find the duplicate ids in Redshift

    Run this query at redshift to see the ids with count greater than 1

    SELECT
    id,
    COUNT(*) as count
    from
    <schema_name>.<table_name>
    GROUP BY
    id
    HAVING
    count > 1

    How can it be fixed?

    • If duplicates at the source have crept into the destination (this is usually not possible as Datacoral merges records in the destination), a clean full backfill will be ideal as the source system has been cleaned up
      note

      Note that the change (de-dupe activity) at the source will be captured and will propagate appropriately to the destination

    • If duplicates are identified at the destination, records with older updated_at can be deleted.

    Checksum mismatch

    Why does this happen?

    • For an absolute check of records, Datacoral compares the Checksum of the entire record between the source and the destination.
    • This check fails when there is a checksum mismatch or if any column differs between source and destination for a given primary key and updated_at.
    • This can happen due to :
      • Certain nuances of source or destination that can lead to mismatch (e.g. precision-based float data type in Postgres does not add padded zeroes, while the same gets added in Redshift. While the data is the same, the checksum will fail)
      • The parser may incorrectly handle certain data types

    Where to check?

    • Generate all the dq-debug-queries by executing this CLI command - DQ DEBUG QUERIES
    • Execute the generated 'mismatch' query against the destination to identify the pks having a mismatch
    • Now we will use those pks in all other queries to identify the root cause of the issue (Tip - You can use a subset of the pks for a faster turnaround)
    • Execute the 'source' query to get column level MD5 of the identified pks in source
    • Execute the 'destination' query to get column level MD5 of the identified pks in destination
    • Compare the results of source and destination queries to identify the column whose MD5 is not matching
    • Execute the generated loadunit filter 'athena' query to get the updates that made it to the corresponding table in S3 and to confirm that the timestamp columns are getting updated
    • Execute the generated slot 'athena' query to get the updates that made it to the slot

    How can it be fixed?

    • Most common cause of the failure is that timestamp column is not getting correctly updated at the source. In that case, the customer needs to ensure that updates in their timestamp column are reliable.
    • If the timestamp column is correctly getting updated, then please open a support ticket with the results of the queries that you executed

    Updates are not propagated

    Why does this happen?

    • This check fails when updates have not propagated and the failure may occur when the
      • The Update did not make it to the slot table - AWS issues
      • The Update did not make it to S3 (Individual tables) - Infrastructure issues of data not propagating from slot table to individual tables
      • The Update did not make it to Destination - Infrastructure issues of data not propagating from S3 to destination
      • Replica Lag caused a discrepancy which in turn can result in dq failures(src_older_ts_count). Note - Replica lag will not result in failures for ingest_mysqlcdc

    Where to check?

    • Generate all the dq-debug-queries by executing this CLI command - DQ DEBUG QUERIES
    • Execute the generated 'mismatch' query against the destination to identify the pks resulting in dq-failures
    • Now we will use those pks in all other queries to identify the root cause of the issue (Tip - You can use a subset of the pks for a faster turnaround)
    • Execute the 'source' query to get timestamps of the records in source
    • Execute the 'destination' query to get timestamps of the records in destination
    • This check fails where the timestamp columns at source and destination do not match - if the timestamps are matching then it means the failure might be due to replicaLag
    • Execute the generated loadunit filter 'athena' query to get the updates that made it to the corresponding table in S3
    • Execute the generated slot 'athena' query to get the updates that made it to the slot

    How can it be fixed?

    • If you CDC connector is not ingest_mysqlcdc and if the timestamps at source and destination are matching, then reprocess the synopsis loadunit to take care of the replica lag and see if the issue is getting addressed
    datacoral collect reprocess --slice-name <slice-name>
    --loadunits <loadunit_name>_pks
    --start-timelabel <YYYYMMDDHHmmss>
    --force
    • If previous CLI did not resolve the issue then carry out - Partial backfill for the missing update time window OR Partial backfill of individual primary keys
    • Also, please open a support ticket with the results of the queries that you executed

    Delete are not propagated

    Why does this happen?

    • This check can fail if warehouse load of CDC connector was not done by the time dq-checks ran (Not a True data error but a DQ check error)
    • This check fails when there was an error while loading deleted rows to deleted rows table in redshift (Not a True data error but a DQ check error)
    • Actual deletes are not applied on destination (Data Error)

    Where to check?

    • Generate all the dq-debug-queries by executing this CLI command - DQ DEBUG QUERIES
    • Execute the generated 'mismatch' query against the destination to identify the pks that were not deleted in destination
    • Now we will use those pks in all other queries to identify the root cause of the issue (Tip - You can use a subset of the pks for a faster turnaround)
    • Execute the 'source' query to get column level MD5 of the identified pks in source - this should return zero records
    • Execute the 'destination' query to get column level MD5 of the identified pks in destination
    • Execute the 'deleted' query to to check if the deletes were correctly captured
    • Execute the generated loadunit filter 'athena' query to get the updates that made it to the corresponding table in S3
    • Execute the generated slot 'athena' query to get the updates that made it to the slot

    How can it be fixed?

    • Most common - If the failure is because redshift load did not happen when dq-checks ran, then a reprocess of mismatchMV will resolve the issue -
    datacoral organize matview-reprocess --warehouse-name <warehouse-name>
    --matview-schema <schema_name>_synopsis
    --matview-name dq_mismatch_<loadunit_name>
    --timelabel <timelable> --force
    • If the previous CLI does not resolve then issue then it means that deletes have not propagated and a full backfill will be needed as Partial Backfill will not be sufficient
    • Also, please open a support ticket with the results of the queries that you executed

    Insert not propagated

    Why does this happen?

    Records missing Redshift that are present at Source can happen due to similar reasons as the update not propagated.

    Where to check?

    • Generate all the dq-debug-queries by executing this CLI command - DQ DEBUG QUERIES
    • Execute the generated 'mismatch' query against the destination to identify the pks resulting in dq-failures
    • Now we will use those pks in all other queries to identify the root cause of the issue (Tip - You can use a subset of the pks for a faster turnaround)
    • Execute the 'source' query to get timestamps of the records in source
    • Execute the 'destination' query to get timestamps of the records in destination - this should return zero records
    • This check fails where certain records are found in source but not in destination
    • Execute the generated loadunit filter 'athena' query to get the updates that made it to the corresponding table in S3
    • Execute the generated slot 'athena' query to get the updates that made it to the slot

    How can it be fixed?

    • Partial backfill for the missing records time window OR Partial backfill of individual primary keys
    • Also, please open a support ticket with the results of the queries that you executed

    How to get DQ to succeed?

    Once the data is fixed, DQ can be re-triggered by reprocessing <loadunit_name>_pks for the corresponding day.