Data Quality Checks - Troubleshooting guide

Here is a troubleshooting guide to detect the root cause of DQ failure.

DQ Errors

Here is the list of DQ Errors and the troubleshooting guidelines

DQ ErrorError Message
Duplicate Count Check({dst_pk_distinct_count} == {dst_total_count}) and ({src_total_count} == {src_pk_distinct_count}
# This check fails when there are duplicate records of a primary key either in the source or destination
Checksum Mismatch({checksum_mismatch_count} is None or {checksum_mismatch_count} == 0)
# This check fails when checksum mismatch
Update not propagated
  • ({dst_older_ts_count} is None or {dst_older_ts_count} == 0)
    # This check fails when updates are not propagated
  • ({src_older_ts_count} is None or {src_older_ts_count} == 0)
    # This check fails when updates are propagated but synopsis took time to pull
  • Delete not propagated({dst_only_pk_count} is None or {dst_only_pk_count} == 0)
    # This check fails when there are more number of primary keys in destination i.e. deletes are not propagated
    Insert not propagated({src_only_pk_count} is None or {src_only_pk_count} == 0)
    # This check fails when there are missing primary keys at destination i.e. inserts are not propagated

    Duplicate records at source or destination

    Why does this happen?

    • This check fails when there are duplicate records of the primary key either in the source or destination.
    • This could be an issue where multiple updates of the same id have not been merged at the destination.
    • Total count and distinct count of Primary Keys at the source and destination are checked (independently) and an error is thrown if there is a mismatch

    Where to check?

    • Duplicate at the source will need to be identified by the customer and de-duped accordingly.
    • Duplicate in the destination - Below is the query to find the duplicate ids in Redshift

    Run this query at redshift to see the ids with count greater than 1

    SELECT
    id,
    COUNT(*) as count
    from
    <schema_name>.<table_name>
    GROUP BY
    id
    HAVING
    count > 1

    How can it be fixed?

    • If duplicates at the source have crept into the destination (this is usually not possible as Datacoral merges records in the destination), a clean full backfill will be ideal as the source system has been cleaned up
      note

      Note that the change (de-dupe activity) at the source will be captured and will propagate appropriately to the destination

    • If duplicates are identified at the destination, records with older updated_at can be deleted.

    Checksum mismatch

    Why does this happen?

    • For an absolute check of records, Datacoral compares the Checksum of the entire record between the source and the destination.
    • This check fails when there is a checksum mismatch or if any column differs between source and destination for a given primary key and updated_at.
    • This can happen due to :
      • Certain nuances of source or destination that can lead to mismatch (e.g. precision-based float data type in Postgres does not add padded zeroes, while the same gets added in Redshift. While the data is the same, the checksum will fail)
      • The parser may incorrectly handle certain data types

    Where to check?

    Identify mismatched records - Query 1

    SELECT
    pk,
    src_ts,
    dst_ts,
    ts_comparison,
    pk_comparison,
    checksum_comparison,
    __dc_timelabel
    FROM
    <schema_name>_synopsis.mv_dq_mismatch_<table_name>
    WHERE
    checksum_comparison = 'checksum_mismatch'
    ORDER BY
    src_ts DESC
    • Once the records in question are identified, the corresponding records in Source and Destination to be queried to find the mismatch.

    Destination (Redshift) - Query 2

    Run the below query in the destination warehouse

    SELECT
    *
    FROM
    <schema_name>.<table_name>
    WHERE
    id IN
    (
    < id_list_from_first_query >
    )

    Source (Postgres) - Query 3

    Run the below query at source

    SELECT
    *
    FROM
    public.<table_name>
    WHERE
    id IN
    (
    < id_list_from_first_query >
    )
    note

    public to be replaced with corresponding schema for multi-schema database

    Compare the results of Query 2 and Query 3 to identify the mismatch

    S3 (Athena query on Slot Table) - Query 4

    In case the data had multiple updates and the right / latest update did not make it to destination, check if there were multiple rows in the slot and corresponding table in S3

    select
    tablename,
    json_extract_scalar(row, '$.id') as id,
    json_extract_scalar(row, '$.updated_at') as updated_at,
    cast(json_extract(row, '$.__dc_cdc_deleted_at') as bigint) AS __dc_cdc_deleted_at,
    cast(json_extract(row, '$.__dc_cdc_modified_at') as bigint) AS __dc_cdc_modified_at
    from
    datacoral_ < installation_name > _ < schema_name > .datacoral_slot_table
    where
    tablename = '<table_name>'
    and json_extract_scalar(row, '$.id') in
    (
    < pk_list_from_first_query >
    )
    and y = '<YYYY_of_mismatch>'
    and m = '<MM_of_mismatch>'
    and d = '<DD_of_mismatch>';

    S3 (Athena query on Individual Table) - Query 5

    SELECT
    id,
    updated_at
    from
    datacoral_ < installation_name > _ < schema_name > . < table_name >
    where
    y = '<YYYY_of_mismatch>'
    and m = '<MM_of_mismatch>'
    and d = '<MM_of_mismatch>'
    and id in
    (
    < pk_list_from_first_query >
    )

    How can it be fixed?

    • Based on the comparison results, Data Quality may have to be changed (This is the most common scenario)
    • If data parsed is incorrect, there may be a need for a code fix.
    • In either case, please open a support ticket with query results of Query 2, Query 3, Query 4 and Query 5

    Updates are not propagated

    Why does this happen?

    • This check fails when updates have not propagated and the failure may occur where the
      • The Update did not make it to the slot table - AWS issues
      • The Update did not make it to S3 (Individual tables) - Infrastructure issues of data not propagating from slot table to individual tables
      • The Update did not make it to Destination - Infrastructure issues of data not propagating from S3 to destination

    Where to check?

    Identify mismatched records - Query 1

    SELECT
    pk,
    src_ts,
    dst_ts,
    ts_comparison,
    pk_comparison,
    checksum_comparison,
    __dc_timelabel
    FROM
    <schema_name>_synopsis.mv_dq_mismatch_<table_name>
    WHERE
    ts_comparison = 'older_dst'
    ORDER BY
    src_ts DESC

    Destination (Redshift) - Query 2

    Run the below query on redshift to get id and updated_at columns for the primary keys from query 1 (to cross verify the id are there in destination with latest timestamp)

    SELECT
    id,
    updated_at
    FROM
    <schema_name>.<table_name>
    WHERE
    id IN
    (
    < id_list_from_first_query >
    )

    Source (Postgres) - Query 3

    SELECT
    id,
    updated_at
    FROM
    public.< table_name >
    WHERE
    id IN
    (
    < id_list_from_first_query >
    )

    S3 (Athena query on Slot Table) - Query 4

    • In case the data did not make it to the destination, check if the data made it to the slot and corresponding table in S3

    If there are are ids missing in Step 2 or Step 3, query athena to check if those ids are pulled or not

    select
    tablename,
    json_extract_scalar(row, '$.id') as id,
    json_extract_scalar(row, '$.updated_at') as updated_at,
    cast(json_extract(row, '$.__dc_cdc_deleted_at') as bigint) AS __dc_cdc_deleted_at,
    cast(json_extract(row, '$.__dc_cdc_modified_at') as bigint) AS __dc_cdc_modified_at
    from
    datacoral_ < installation_name > _ < schema_name > .datacoral_slot_table
    where
    tablename = '<table_name>'
    and json_extract_scalar(row, '$.id') in
    (
    < pk_list_from_first_query >
    )
    and y = '<YYYY_of_mismatch>'
    and m = '<MM_of_mismatch>'
    and d = '<DD_of_mismatch>'

    S3 (Athena query on Individual Table) - Query 5

    SELECT
    id,
    updated_at
    from
    datacoral_ < installation_name > _ < schema_name > . < table_name >
    where
    y = '<YYYY_of_mismatch>'
    and m = '<MM_of_mismatch>'
    and d = '<MM_of_mismatch>'
    and id in
    (
    < pk_list_from_first_query >
    )

    How can it be fixed?

    • Partial backfill for the missing update time window
    • Partial backfill of individual primary keys
    • Also, please open a support ticket with query results of Query 2, Query 3, Query 4, and Query 5 for further analysis

    Delete are not propagated

    Why does this happen?

    • This check fails when there was an error while loading deleted rows to deleted rows table in redshift (Not a True data error but a DQ check error)
    • Actual deletes are not applied on destination (Data Error)

    Where to check?

    Identify missing deleted records - Query 1

    Run the below query at the Redshift (destination warehouse)

    SELECT
    pk,
    src_ts,
    dst_ts,
    ts_comparison,
    pk_comparison,
    checksum_comparison,
    __dc_timelabel
    FROM
    <schema_name>_synopsis.mv_dq_mismatch_<table_name>
    WHERE
    pk_comparison = 'in_dst'
    ORDER BY
    src_ts DESC

    Destination (Redshift) - Query 2

    Run the below query on redshift to verify that the records are truly present in the destination

    SELECT
    id,
    updated_at
    FROM
    <schema_name>.<table_name>
    WHERE
    id IN
    (
    (<pk_list_from_first_query>)
    )

    Source (Postgres) - Query 3

    SELECT
    id,
    updated_at
    FROM
    public.<table_name>
    WHERE
    id IN
    (
    (<pk_list_from_first_query>)
    )
    note

    The above query should return zero records

    Destination (Verifying in Deleted Rows Table) - Query 4

    • To ensure deleted records are captured appropriately for DQ
    SELECT
    json_extract_path_text("row", 'id') as "id",
    TIMESTAMP 'epoch' + __dc_cdc_deleted_at / 1000000000.0 * interval '1 second' AS __dc_cdc_deleted_at_timestamp --converting from nanoseconds to timestamp
    FROM
    < schema_name > .cdc_deleted_rows
    WHERE
    tablename = 'table_name'
    AND id in
    (
    < pk_list_from_first_query >
    )

    S3 (Athena query on Slot Table) - Query 5

    • This verifies that the delete did propagate to the slot but not beyond
    select
    tablename,
    json_extract_scalar(row, '$.id') as id,
    json_extract_scalar(row, '$.updated_at') as updated_at,
    cast(json_extract(row, '$.__dc_cdc_deleted_at') as bigint) AS __dc_cdc_deleted_at,
    cast(json_extract(row, '$.__dc_cdc_modified_at') as bigint) AS __dc_cdc_modified_at
    from
    datacoral_ < installation_name > _ < schema_name > .datacoral_slot_table
    where
    tablename = '<table_name>'
    and json_extract_scalar(row, '$.id') in
    (
    < pk_list_from_first_query >
    )
    and y = '<YYYY_of_mismatch>'
    and m = '<MM_of_mismatch>'
    and d = ' < DD_of_mismatch >

    How can it be fixed?

    • If deletes have not propagated, a full backfill will be needed as Partial Backfill will not be sufficient
    • Also, please open a support ticket with query results of Query 2, Query 3, Query 4, and Query 5 for further analysis

    Insert not propagated

    Why does this happen?

    Records missing Redshift that are present at Source can happen due to similar reasons as the update not propagated.

    Where to check?

    Identify missing records - Query 1

    SELECT
    pk,
    src_ts,
    dst_ts,
    ts_comparison,
    pk_comparison,
    checksum_comparison,
    __dc_timelabel
    FROM
    `<schema_name>_synopsis.mv_dq_mismatch_<table_name>`
    WHERE
    pk_comparison = 'in_src'
    ORDER BY
    src_ts DESC

    Destination (Redshift) - Query 2

    This should return zero records as these IDs are not present at the source

    SELECT
    id,
    updated_at
    FROM
    <schema_name>.<table_name>
    WHERE
    id IN
    (
    (<pk_list_from_first_query>)
    )

    Source (Postgres) - Query 3

    To verify that the records are truly present at the source

    SELECT
    id,
    updated_at
    FROM
    public.<table_name>
    WHERE
    id IN
    (
    (<pk_list_from_first_query>)
    )

    S3 (Athena query on Slot Table) - Query 4

    In case the data did not make it to the destination, check if the data made it to the slot and corresponding table in S3.

    select
    tablename,
    json_extract_scalar(row, '$.id') as id,
    json_extract_scalar(row, '$.updated_at') as updated_at,
    cast(json_extract(row, '$.__dc_cdc_deleted_at') as bigint) AS __dc_cdc_deleted_at,
    cast(json_extract(row, '$.__dc_cdc_modified_at') as bigint) AS __dc_cdc_modified_at
    from
    datacoral_ < installation_name > _ < schema_name > .datacoral_slot_table
    where
    tablename = '<table_name>'
    and json_extract_scalar(row, '$.id') in
    (
    < pk_list_from_first_query >
    )
    and y = '<YYYY_of_mismatch>'
    and m = '<MM_of_mismatch>'
    and d = '<DD_of_mismatch>'

    S3 (Athena query on Individual Table) - Query 5

    SELECT
    id,
    updated_at
    from
    datacoral_ < installation_name > _ < schema_name > . < table_name >
    where
    y = '<YYYY_of_mismatch>'
    and m = '<MM_of_mismatch>'
    and d = '<MM_of_mismatch>'
    and id in
    (
    < pk_list_from_first_query >
    )

    How can it be fixed?

    • Partial backfill for the missing records time window
    • Partial backfill of individual primary keys
    • Also, please open a support ticket with query results of Query 2, Query 3, Query 4, and Query 5 for further analysis

    How to get DQ to succeed?

    Once the data is fixed, DQ can be re-triggered by reprocessing <table_name>_pks for the corresponding day.