Update Transformations
Checking updates to Transformations
Given that Datacoral makes it very easy to build transformations by just writing SQL, analysts can just focus on business logic of the data rather than the underlying plumbing. The most complicated part of managing the plumbing is when transformations have to be changed causing the transformation graph to get modified. When workflow managers are used to build pipelines, it is the responsibility of the analyst to make sure that changes to one transformation do not break downstream transformations.
Datacoral automatically verifies that changes to transformations wont cause any issues downstream. There are several areas where Datacoral's guardrails catch potentially critical bugs from being pushed to even testing environments.
The most critical check done statically is the one for downstream transformations. Datacoral provides the exact set of transformations that might get affected by the changes being performed. Analsyts can make changes to the affected transformations as well, which might in turn affect other downstream transformations. This check in and of itself catches several critical errors.
When updating a transformation, an analyst could be modifying different attributes of the transformation. Datacoral provides guardrails to make sure that the modifications dont break the pipeline.
The actual details of the CLI commands used to update the transformations are here. In this document, we will consider an example and go over the different guardrails provided by Datacoral.
Compile time checks
First off, Datacoral offers
- Compile-time type checks for changes to transformations, catching bugs early
- Ability to apply all related changes to the transformation graph in one shot
- This means that analysts can be confident that the changes they are making are consistent and wont break the pipeline.
Furthermore, with Datacoral's Data Quality Check feature, runtime checks can also be used to make sure that changes dont break anything.
For example the following command will provide visibility to analyst about potential downstream transformations that might need modifications based on the changes already made.
Steps to perform updates
In this section, we will go over an example of how Datacoral makes it easy for analysts to make consistent changes to their transformations. The details of the CLI commands are here.
Let us consider the following example where salesforce.opportunity_cohort
is downstream of salesforce.opportunity_latest
to understand how matview update works.
salesforce.opportunity_latest
transformation definition
salesforce.opportunity_cohort
transformation definition
Step 1: Modify the transformation definitions
We are required to perform the following update operations
- Add
fiscalquarter
,fiscalyear
andfiscal
columns tosalesforce.opportunity_latest
- Remove
isdeleted
column fromsalesforce.opportunity_cohort
- Modify dependencies of
salesforce.opportunity_latest
to includesalesforce.account
in the from clause.
The transformation definitions can be modified as shown
- New
salesforce.opportunity_latest
transformation definition
- New
salesforce.opportunity_cohort
transformation definition
Step 2: Add the updated definitions to the local staging area
Add these dpl files to local staging area by running following commands. Think of this step as similar to git add <file-path>
in the git development flow, where are all the updated dpl files are staged in a local manifest file.
You can also inspect the files that have been added to the local staging area by running the following command
Step 3: Stage the updated tranformations to be pushed to production
Push the updated changes from local staging area to a staging area in production by running the following command. Datacoral computes a safe timelabel at which these changes can be applied.
Matview update stage has multiple options. Please refer to CLI help section.
That's it. You don't have to worry about an
Restrictions on MV updates
There are several restrictions on the kinds of updates that are allowed in MVs. These restrictions are essential to protect the integrity of data in the MV being updated as well as all of its downstream MVs. Currently update-set can be used to do following updates to MV:
- Only MVs with
update-mode
ofreplace
can be updated by changing the SQL query. MVs withappend
andmerge
update modes cannot be changed. - Schedule changes are not allowed
- Changing the
update-mode
is not allowed - Changing the
table-type
is not allowed - Changing the table configuration (like dist key, sort keys) is not allowed
- Changing dependencies which cause invalid DAGs are not allowed
There are several restrictions on the kinds of updates that are allowed in MVs. These restrictions are essential to protect the integrity of data in the MV being updated as well as all of its downstream MVs.