Data Quality Checks - Setup guide

Introduction

One of the core values of Datacoral is Data Quality (DQ) assurance. This document will address the different aspects of data quality checks along with methods for the customer to set it up for themselves.

The entire setup can be divided into following steps -

Generate dq-pipeline

Most common invocation

datacoral generate dq-pipeline --slice-name <CDC-slice-ame>
--output-directory <output_directory>
--timezone US/Eastern

help

datacoral generate dq-pipeline -h
Usage: dq-pipeline [options]
generate dq-pipeline for synopsis slice
Options:
--slice-name <slice-name> Name of the active cdc slice
--output-directory <output-directory> Full directory path to write out generated config file(s)
--full-check How conservatively you want to check. Default is incremental check
--timezone [timezone] Timezone for dpl files. Defaults to UTC
--timestamp-column [timestamp-column] Default timestamp column which should be used to track updates. If not found, one of the available timestamp column will be used
--cob-interval [cob-interval] optional interval in minutes for close of books: 5,10,15,20,30,60,120,180,240,360,480,720,1440. Defaults to daily (1440 minutes)
--include-datalayout If datalayout should be included in deployParams. By default we let the slice create the datalayout based on extraction query
--loadunits [loadunits] Loadunits for which you want to generate 'dqcheck' dpl files. By default it is generated for all the loadunits
--synopsis-slice-name [synopsisSliceName] SynopsisSliceName default will be <sliceName>_synopsis
--synopsis-slice-mv-schema [synopsis-slice-mv-schema] If it is not same as the synopsis-slice-name
--sub-operation [sub-operation] The sub-operation you want to perform: synopsis, closeofbooks, deletedrows, dqchecks
--retention-in-days [retentionInDays] RetentionInDays for regular mv. Defaults to 3 days
--overwrite Overwrite the contents of the output directory
--skip-checksum If you want to skip checksum comparison. Usually used for high frequency checks
-h, --help output usage information

Commonly used options

OptionDetails
timezoneMany times, the replica database is lagging behind the main database and to get around that customer can provide 'US/Eastern' for this option. This gives them a buffer of at least three hours to ensure that replica database to catch up and will ensure that replica lag is not resulting in DQ failures
loadunitsThis option is used when data quality is required to be setup for only a subset of loadunits
full-checkWhen users want to carry out data quality check of all the historical data that has been fetched. NOTE: This is a more expensive data quality check as it does checksum comparison of all the records at source and destination and thus it is critical that it does not keep running on schedule. To ensure that, always pause the closeofbooks MV(closeofbooks_<synopsisSliceName>) once the MV is added
cob-intervalBy default, the generated dq-pipeline will be configured to run once a day. If due to a business requirement, you need to run it at a higher frequency then you can specify interval in minutes using this option
skip-checksumThis option is usually used along with higher frequency DQ checks where one is more interested in ensuring that data is getting propagated to the destination rather than doing entire checksum comparison of source and destination
timestamp-columnDatacoral will look at the source columns to figure out the timestamp column that should be used to track updates. However, at times there might be multiple timestamp columns and the one that datacoral picks might not be the right one. In such a case, the user can provide the exact timestamp column that should be used to track updates and carry out the data quality checks. e.g. - When the source table has columns like - updated_at, last_modified_at
synopsis-slice-nameSynopsis slice name is generated from the CDC slice name - <cdc_slice_name>_synopsis ( or in case of '--full-check': <cdc_slice_name>_synopsis_full). However at times, users want to set up multiple dq-checks for a given CDC slice like when they want checksum based DQ checks to run daily but non-checksum based dq-checks to run at a higher frequency. In those cases, they can provide a custom synopsis-slice-name like <cdc_slice_name>_syn_highfreq
synopsis-slice-mv-schemaThis is another optional parameter whose value is same as '--synopsis-slice-name', unless user provides a custom value due to some business reason

Examples

  • betterment-mysql
datacoral generate dq-pipeline --slice-name <mysqlcdc-slice-name>
--output-directory <output_directory>
--timezone US/Eastern
--timestamp-column last_modified_datetime
--include-datalayout
  • betterment-pg
datacoral generate dq-pipeline --slice-name <pgcdc-slice-name>
--output-directory <output_directory>
--timezone US/Eastern
  • drchrono-
datacoral generate dq-pipeline --slice-name <mysqlcdc-slice-name>
--output-directory tmp
--timestamp-column date
--timezone US/Eastern
  • flexport -
datacoral generate dq-pipeline --slice-name <pgcdc-slice-name>
--output-directory <output_directory>
--timezone US/Eastern

High frequency dq-checks

When users want to add high frequency dq-checks without checksum then they usually invoke in the following manner(note - here we are showing interval of 240 minutes) -

datacoral generate dq-pipeline --slice-name <CDC-slice-ame>
--output-directory <output_directory>
--skip-checksum
--cob-interval 240
--synopsis-slice-name <cdc_slice_name>_syn_highfreq
--loadunits <comma separated list of loadunits>

Sample output

datacoral generate dq-pipeline --slice-name tl_pgcdc21 --output-directory tmp
2021-06-15T17:22:45-07:00 - info: Getting the datalayout of all the loadunits in your cdc slice
2021-06-15T17:22:47-07:00 - info: wrote: tmp/collect/tl_pgcdc21_synopsis.json
2021-06-15T17:22:47-07:00 - info: wrote: tmp/redshift/tl_pgcdc21/closeofbooks.dpl
2021-06-15T17:22:47-07:00 - info: wrote: tmp/redshift/tl_pgcdc21/cdc_deleted_rows.dpl
2021-06-15T17:22:47-07:00 - info: wrote: tmp/redshift/tl_pgcdc21_synopsis/dq_mismatch_table_113.dpl
2021-06-15T17:22:47-07:00 - info: wrote: tmp/redshift/tl_pgcdc21_synopsis/dq_summary_table_113.dpl
2021-06-15T17:22:47-07:00 - info: wrote: tmp/redshift/tl_pgcdc21_synopsis/dq_status_table_113.dpl
2021-06-15T17:22:47-07:00 - info: wrote: tmp/dag.yaml

Add synopsis slice

  • Look at the generated synopsis deploy params to get the slice-type(pg/mysql when CDC slice is pgcdc/mysqlcdc respectively)
  • Trigger the add of synopsis slice in the following manner -
datacoral collect add --slice-type <slice-type-in-generated-deploy-params>
--slice-name <generated-synopsis-slice-name>
--parameters-file <output-directory>/collect/<generated-synopsis-slice-name>.json

Add DQ MVs

  • Once the synopsis slice has been added, add DQ MVs by running the following CLI-
datacoral organize matview-create --dag <output-directory>/dag.yaml

Additional steps for full check

  • For full dq-pipeline, we do not want the checks to run on a schedule but on-demand.
  • As a result, we need to pause closeofbooks_full MV and then trigger data quality checks by processing the synopsis loadunits

Pausing of closeofbooks_full MV

  • The following command will pause your MV immediately
datacoral organize matview-pause --matview-schema <synopsis-slice-mv-schema>
--matview-name <full closeofbooks MV name>

Triggering data quality for a particular loadunit

  • The following command can be used to trigger DQ checks for one or more loadunits
  • Keep in mind that for each loadunit in CDC slice there are two loadunits in synopsis that need to be triggered: count and checksum loadunit
datacoral collect reprocess --slice-name <synopsis-slice-name>
--loadunits <loadunit-name>,<loadunit-name>_pks
--start-timelabel <timelabel>