Update Transformations

Checking updates to Transformations

Given that Datacoral makes it very easy to build transformations by just writing SQL, analysts can just focus on business logic of the data rather than the underlying plumbing. The most complicated part of managing the plumbing is when transformations have to be changed causing the transformation graph to get modified. When workflow managers are used to build pipelines, it is the responsibility of the analyst to make sure that changes to one transformation do not break downstream transformations.

Datacoral automatically verifies that changes to transformations wont cause any issues downstream. There are several areas where Datacoral's guardrails catch potentially critical bugs from being pushed to even testing environments.

The most critical check done statically is the one for downstream transformations. Datacoral provides the exact set of transformations that might get affected by the changes being performed. Analsyts can make changes to the affected transformations as well, which might in turn affect other downstream transformations. This check in and of itself catches several critical errors.

When updating a transformation, an analyst could be modifying different attributes of the transformation. Datacoral provides guardrails to make sure that the modifications dont break the pipeline.

The actual details of the CLI commands used to update the transformations are here. In this document, we will consider an example and go over the different guardrails provided by Datacoral.

Compile time checks

First off, Datacoral offers

  1. Compile-time type checks for changes to transformations, catching bugs early
  2. Ability to apply all related changes to the transformation graph in one shot
  3. This means that analysts can be confident that the changes they are making are consistent and wont break the pipeline.

Furthermore, with Datacoral's Data Quality Check feature, runtime checks can also be used to make sure that changes dont break anything.

For example the following command will provide visibility to analyst about potential downstream transformations that might need modifications based on the changes already made.

$ datacoral organize matview-update-stage
[2019-03-21 20:39 GMT] Materialized view update started.
[2019-03-21 20:39 GMT] The following downstream MVs might get affected because of this update.
public.mv_customer_health
If you still want to continue the update, please use --skip-downstream-checks flag
[2019-03-21 20:39 GMT] Materialized view update failed.

Steps to perform updates

In this section, we will go over an example of how Datacoral makes it easy for analysts to make consistent changes to their transformations. The details of the CLI commands are here.

Let us consider the following example where salesforce.opportunity_cohort is downstream of salesforce.opportunity_latest to understand how matview update works.

  • salesforce.opportunity_latest transformation definition
/**
* @datacoral O-1.0.0
* @slice-name redshift
* @matview-schema salesforce
* @matview-name opportunity_latest
* @interval-in-minutes 1440
* @update-mode replace
* @table-type regular
* @sort-keys compound sortkey(id,systemmodstamp)
* @dist-key diststyle key distkey(id)
*/
select
id,
isdeleted,
accountid,
name,
description,
amount,
systemmodstamp,
bant__c
from
salesforce.opportunity
  • salesforce.opportunity_cohort transformation definition
/**
* @datacoral O-1.0.0
* @slice-name redshift
* @matview-schema salesforce
* @matview-name opportunity_cohort
* @description Opportunity data for cohort analysis
* @interval-in-minutes 1440
* @update-mode replace
* @table-type regular
* @sort-keys compound sortkey(id,date)
*/
select
id,
isdeleted,
accountid,
name,
description,
amount,
systemmodstamp,
bant__c
from
salesforce.mv_opportunity_latest
where
systemmodstamp >= '{YYYY}-{MM}-{DD}T00:00:00' :: TIMESTAMP
AND systemmodstamp < '{YYYY}-{MM}-{DD}T00:00:00' :: TIMESTAMP + interval '1 day';

Step 1: Modify the transformation definitions

We are required to perform the following update operations

  1. Add fiscalquarter, fiscalyear and fiscal columns to salesforce.opportunity_latest
  2. Remove isdeleted column from salesforce.opportunity_cohort
  3. Modify dependencies of salesforce.opportunity_latest to include salesforce.account in the from clause.

The transformation definitions can be modified as shown

  • New salesforce.opportunity_latest transformation definition
/**
* @datacoral O-1.0.0
* @slice-name redshift
* @matview-schema salesforce
* @matview-name opportunity_latest
* @interval-in-minutes 1440
* @update-mode replace
* @table-type regular
* @sort-keys compound sortkey(id,systemmodstamp)
* @dist-key diststyle key distkey(id)
*/
select
id,
isdeleted,
accountid,
name,
description,
amount,
systemmodstamp,
bant__c
from
salesforce.opportunity
union
salesforce.account
  • New salesforce.opportunity_cohort transformation definition
/**
* @datacoral O-1.0.0
* @slice-name redshift
* @matview-schema salesforce
* @matview-name opportunity_cohort
* @description Opportunity data for cohort analysis
* @interval-in-minutes 1440
* @update-mode replace
* @table-type regular
* @sort-keys compound sortkey(id,date)
*/
select
id,
accountid,
name,
description,
amount,
systemmodstamp,
bant__c
from
salesforce.mv_opportunity_latest
where
systemmodstamp >= '{YYYY}-{MM}-{DD}T00:00:00' :: TIMESTAMP
AND systemmodstamp < '{YYYY}-{MM}-{DD}T00:00:00' :: TIMESTAMP + interval '1 day';

Step 2: Add the updated definitions to the local staging area

Add these dpl files to local staging area by running following commands. Think of this step as similar to git add <file-path> in the git development flow, where are all the updated dpl files are staged in a local manifest file.

datacoral organize matview-update-local --dpl-file file://<path-to-opportunity_latest-dpl>
datacoral organize matview-update-local --dpl-file file://<path-to-opportunity_cohort-dpl>

You can also inspect the files that have been added to the local staging area by running the following command

datacoral organize matview-list-local-updates
update <path-to-opportunity_latest-dpl>
update <path-to-opportunity_cohort-dpl>

Step 3: Stage the updated tranformations to be pushed to production

Push the updated changes from local staging area to a staging area in production by running the following command. Datacoral computes a safe timelabel at which these changes can be applied.

datacoral organize matview-update-stage

Matview update stage has multiple options. Please refer to CLI help section.

datacoral organize matview-update-stage --help
Usage: matview-update-stage [options]
Stage materialized view update-set manifest
Options:
--update-set [ update-set-manifest ] Path to update-set manifest. If not specified, the default path to update-set manifest will be used.
--timelabel [ timelabel ] Timelabel to stage materialized view update-set manifest. If not specified, the timelabel will be automatically computed.
.
--skip-downstream-checks Skip validation checks for downstream matviews.
--overwrite-staged-update Update the materialized view definition of the most recent version that is not yet active.
.
--update-current-timelabel Update the active materialized view definition immediately.
-h, --help output usage information

That's it. You don't have to worry about an

Restrictions on MV updates

There are several restrictions on the kinds of updates that are allowed in MVs. These restrictions are essential to protect the integrity of data in the MV being updated as well as all of its downstream MVs. Currently update-set can be used to do following updates to MV:

  • Only MVs with update-mode of replace can be updated by changing the SQL query. MVs with append and merge update modes cannot be changed.
  • Schedule changes are not allowed
  • Changing the update-mode is not allowed
  • Changing the table-type is not allowed
  • Changing the table configuration (like dist key, sort keys) is not allowed
  • Changing dependencies which cause invalid DAGs are not allowed

There are several restrictions on the kinds of updates that are allowed in MVs. These restrictions are essential to protect the integrity of data in the MV being updated as well as all of its downstream MVs.