Organize Commands

For now, the organize slices we support are Managed Redshift, Athena, and Snowflake. And within there, we provide the ability to create and manipulate multiple levels of transformations.

Type of transformations

Transformations are written as SQL queries in the dialect of the underlying query engine. The results of the transformations are written into a destination table. The destination table can be of two types, defined by the table-type attribute.

  1. Regular table
  2. Partitioned table

The way the results of the transformation are written into the destination table are determined by the update-mode attribute. Update mode can be on of

  1. replace - this mode replaces all the rows of the destination table with the results of the transformation query. This is a wipe and load operation.
  2. append - in this mode, the transformation query result rows are inserted into the destination table, rows already in the destination table are not updated.
  3. merge - in this mode, the transformation query can have rows that indicate that the destination table rows have to be inserted, updated, or even deleted. This mode allows for efficient incremental updates to destination tables.

For regular tables, all update modes are allowed. But, for partitioned tables, only append mode is allowed.

Click here for more information. You will also need to define the frequency of the transformation, and the schema and name of the destination table.
Destination table configuration like sort keys and distribution keys are optional parameters.

Create a transformation

Use a DPL file

Creating a MV is as easy as writing an SQL statement. Use DPL file to create a MV with the command below.

datacoral organize matview-create --dpl-file <mv_file_name.dpl>

A DPL file has following format

/**
* @datacoral
* @matview-schema schemaname
* @matview-name tablename
* @table-type regular
* @update-mode merge
* @interval-in-minutes 5
* @primary-keys id
*/
select * from mysqltest.mv_regular;

Note: You can only specify an existing <schema name> as the value for @matview-schema.

You can choose to create your own schemas as well. You will need to ensure that Datacoral platform can manage the materialized views created in that schema. This will require you to grant specific priviliges to the datacoral user (which is created when the Datacoral platform was installed in your AWS account).

Once you have finished the above steps, you will now be able to specify the created schema name as a value for @matview-schema

Use CLI args

datacoral organize matview-create --help
Usage: matview-create [options]
Create a materialized view
Options:
--matview-schema <matview-schema>
Schema in which to create the materialized view
If not specified the view will get created in the public schema
--matview-name <matview-name>
Name of the materialized view.
If events is specified as the name, an mv with the name mv_events is created in the specified schema
Name should not contain uppercase letter
--table-type <table-type>
Type of table materialized view is stored in - {regular, partitioned}
--update-mode <update-mode>
Update mode of data within a materialized view - {append, replace, merge}
--frequency <frequency>
(Deprecated: Use '--interval-in-minutes' for better control of schedule)
How frequently to update the materialized view - {hourly, daily}
--interval-in-minutes <interval-in-minutes>
How frequently to update the materialized view. The interval must be a multiple of
5, between 5 (inclusive) and 1440 (inclusive). If the interval is less than or
equal to 60, then it must divide into 60 evenly. If the interval is less than or
equal to 1440, then it must divide into 1440 evenly.
--query <query-or-file>
Query that needs to be materialized. Specify a query either on the command-line or in a file.
Query should start with 'WITH' or 'SELECT'.
To specify a file containing the query, prefix file path with file://
--sort-keys <sort-key-column-list>
Optional expression to specify sort keys for mv e.g.
'interleaved sortkey(column1, column2)'
'compound sortkey(column1, column2)'
--dist-key <dist-key-column>
Optional expression to specify dist key for mv e.g.
'diststyle all distkey(column1)'
'distkey(column1)'
--primary-keys <primaryKeys>
Comma seperated values of primary keys
This is only required for upsert types of materialized views - {regularupsert, timeseriesupsert}
--update-ts-col <updateTsCol>
Timestamp column being updated
This is only required for upsert types of materialized views - {regularupsert, timeseriesupsert}
Data from this column will be used for the latest version of a key
--delete-ts-col <deleteTsCol>
Soft delete column for applying upsert logic
This is only required for upsert types of materialized views - {regularupsert, timeseriesupsert}
This is the column used when marking rows as soft delete. Not null implies row was deleted
--compute-engine <batch compute slice>
Name of the batch compute slice used to execute the python/nodejs user defined functions
--slice-name <name of the organize slice where the materialized view exists>
This will be a required field for Athena or Snowflake materialized views.
If not provided, then by default materialized view will be created in redshift
--storage-format <storageFormat>
Output storage format of materialized view
This is only applicable for Athena materialized view
If not provided, then default will be csv format
--input-warehouse <inputWarehouse>
Input Warehouse Name for a UDF based materialized view
This is only applicable for UDF materialized views
--output-warehouse <outputWarehouse>
Output Warehouse Name for a UDF based materialized view
This is only applicable for UDF materialized views
--function-name <functionName>
UDF Function Name
This is only applicable for UDF materialized views
--sequential-refresh
Specify if mv is sequential refresh
--dpl-file <data-programming-language file>
-h, --help output usage information

Updating MVs

File-based matview management flow now allows users to update one or more matview definitions in one shot using update-set. An update-set is created using matview update-set manifest file. An update-set manifest file is a text file stored on a local disk. It contains following information for each matview being updated:

  • action type to update matview: “update” is the only action type currently supported, more action types will be added in future releases;
  • path to DPL file (DPL stands for data programming language) describing updates which need to be done to each matview.

Create an update set locally

matview-update-local [--update-set <update-set-manifest>] --dpl-file <dpl-file>

NOTES:

  • update-set - Path to update-set manifest. If this option not specified, following default update-set manifest file will be used: $HOME/.datacoral/matview-update-set-manifest.
  • dpl-file - Path to DPL file describing matview update.
  • each update-set manifest can contain only one update action for each MV.
  • each update-set manifest can contain multiple update actions for different MVs.

See Updating Transformations for a detailed description of the different types of updates that are allowed and the types that are not.

Stage updates

$ datacoral organize matview-update-stage --help
Usage: matview-update-stage [options]
Stage materialized view update-set manifest
Options:
--update-set [ update-set-manifest ] Path to update-set manifest. If not specified, the default path to update-set manifest will be used.
--timelabel [ timelabel ] Timelabel to stage materialized view update-set manifest. If not specified, the timelabel will be automatically computed.
--skip-downstream-checks Skip validation checks for downstream matviews.
--overwrite-staged-update Update the materialized view definition of the most recent version that is not yet active.
--update-current-timelabel Update the active materialized view definition immediately.
-h, --help output usage information

Describe MVs

Describe one MV

datacoral organize matview-describe --matview-schema <matview-schema> --matview-name <matview-name>

List all MVs

Get the list of all the materialized views in redshift

datacoral organize matview-list

Download all MVs

datacoral organize matview-download --download-dir <download-dir> [--overwrite]

NOTES:

  • download-dir - Path to matview download directory. This directory should be empty unless --overwrite option has been specified.
  • overwrite - Overwrite content of matview download directory if it is not empty.
  • Each MV will be downloaded into individual DPL file.

Delete MV

datacoral organize matview-delete --matview-schema <matview-schema> --matview-name <matview-name>

Understanding MV dependencies

List upstream dependencies

datacoral organize matview-describe --matview-schema <matview-schema> --matview-name <matview-name> --dependencies

List downstream dependents

datacoral organize matview-describe --matview-schema <matview-schema> --matview-name <matview-name> --dependents

List dependents in graph form

datacoral organize matview-describe --matview-schema <matview-schema> --matview-name <matview-name> --dependents-graph

Describe a version of MV

datacoral organize matview-describe --matview-schema <matview-schema> --matview-name <matview-name> --version <version>

Pause and resume

Get Status of MV

Get the status of the refreshes on the MVs over a given time period.

datacoral organize matview-stats --matview-schema <matview-schema> --matview-name <matview-name> [--from <start-time>] [--to <start-time>]

Pause a MV

datacoral organize matview-pause
Usage: matview-pause [options]
Pauses refreshes of materialized view
Options:
--matview-schema <matview-schema> Schema in which the materialized view lives. If not specified the schema is assumed to be 'public'
--matview-name <matview-name> Name of the materialized view
--from <start-timelabel> Set a future pause timelabel YYYYMMDDHHmmss
--slice-name [sliceName] Name of the warehouse where this materialized view exists.
This will be a required field for Athena materialized view
If not provided, then by default we assume sliceName to be redshift.
-h, --help output usage information

Resume a MV

datacoral organize matview-resume
Usage: matview-resume [options]
Resumes refreshes of Materialized View, from the timelabel it was paused. Consider using --from resume refresh from current timelabelResumes refreshes of materialized view
Options:
--matview-schema <matview-schema> Schema in which the materialized view lives. If not specified the schema is assumed to be 'public'
--matview-name <matview-name> Name of the materialized view
--from <start-timelabel> Set a future resume timelabel YYYYMMDDHHmmss
--slice-name [sliceName] Name of the warehouse where this materialized view exists
This will be a required field for Athena materialized view
If not provided, then by default we assume sliceName to be redshift.
-h, --help output usage information

Pause refreshes of all MVs

A good brute force tool to use if your Redshift cluster is in imminent danger of being overloaded.

datacoral organize matview-pause-all

Resume refreshes of all MVs

Go back to normal operations.

datacoral organize matview-resume-all

Pause and resume downstream MVs

There are cases where you are trying to debug/update an MV that is an internal node in your DAG of MVs. In these cases, the safest way to apply the updates is by first pausing refreshes of all the downstream MVs and publishers of a the MV being updated. With this command a given MV or a list of MVs can first be brought to a 'good shape' before resuming their downstream MVs or publishers.

Note: For redshift MVs, you will need to specify the mv_ prefix in the MV name unlike in other commands.

Pause all downstream MVs and publishers of a given list of tables

The given list of tables can be either loadunits or MVs themselves. This command pauses all MVs and publishers that are downstream to the list of tables provided.

datacoral organize pause-downstream
Usage: pause-downstream [options]
Pauses the downstream MVs for the specified table(s) in a warehouse
Options:
--warehouse-name [warehouse name] Warehouse the specified tables reside in
--table-names [table list] Name(s) of the tables separated by comma (ex. schemaA.tableA,schemaB.tableB)
--timelabel [timelabel] Set a future pause timelabel YYYYMMDDHHmmss
-h, --help output usage information

Resume all downstream MVs and publishers of a given list of tables

The given list of tables can be either loadunits or MVs themselves. This command resumes all MVs and publishers that are downstream to the list of tables provided.

Note: When resume is done on a future timelabel, the missing timelabels will not get reprocessed.

datacoral organize resume-downstream
Usage: resume-downstream [options]
Resumes the downstream MVs for the specified table(s) in a warehouse
Options:
--warehouse-name [warehouse name] Warehouse the specified tables reside in
--table-names [table list] Name(s) of the tables separated by comma (ex. schemaA.tableA,schemaB.tableB)
--timelabel [timelabel] Set a future resume timelabel YYYYMMDDHHmmss
-h, --help output usage information

For example, when an MV mysqltest.mv_leveltwo depends on an MV mysqltest.mv_levelone, pausing the downstream of mysqltest.mv_levelone does the following:

datacoral organize pause-downstream --warehouse-name redshift --table-names mysqltest.mv_levelone
Successfully paused downstream materialized views
{
"mysqltest.mv_leveltwo": "SUCCESS"
}

This can be verified by checking the currentStatus of the downstream MV like below. pauseTimelabel indicates the timelabel from which the MV is paused.

datacoral organize matview-describe --matview-schema mysqltest --matview-name leveltwo
{
"definition-view-name": "def_mv_5_regular_leveltwo",
"interval-in-minutes": 5,
"slice-name": "redshift",
"table-type": "regular",
"update-mode": "replace",
"matview-schema": "mysqltest",
"matview-name": "leveltwo",
"matview-definition": "SELECT * FROM mysqltest.mv_levelone",
"sequential-refresh": false,
"pauseTimelabel": "20200612193000",
"currentStatus": "Paused",
"redshift-table-name": "mysqltest.mv_leveltwo"
}

When you resume the downstreams of an MV, the status can be similarly verified.

datacoral organize resume-downstream --warehouse-name redshift --table-names mysqltest.mv_levelone
Successfully resumed downstream materialized views
{
"mysqltest.mv_leveltwo": "SUCCESS"
}
datacoral organize matview-describe --matview-schema mysqltest --matview-name leveltwo
{
"definition-view-name": "def_mv_5_regular_leveltwo",
"interval-in-minutes": 5,
"slice-name": "redshift",
"table-type": "regular",
"update-mode": "replace",
"matview-schema": "mysqltest",
"matview-name": "leveltwo",
"matview-definition": "SELECT * FROM mysqltest.mv_levelone",
"sequential-refresh": false,
"pauseTimelabel": "20200612193000",
"resumeTimelabel": "20200612193500",
"currentStatus": "Running",
"redshift-table-name": "mysqltest.mv_leveltwo"
}

Reprocess MVs

Reprocess MVs using --parameters-file

datacoral organize matview-reprocess --parameters-file file://reprocess.json

This can be used to reprocess timelabel of one or more MVs. Here reprocess.json file looks like the following:

[
{
"materializedView1": "<schema-name1>.<matview-name1>",
"timelabel": "<timelabel1>"
},
{
"materializedView2": "<schema-name2>.<matview-name2>",
"timelabel": "<timelabel2>"
},
...
]

Reprocess MVs without --parameters-file

  1. Command expects either --timelabel OR both (--start-timelabel and --end-timelabel).
  2. Combination of --start-timelabel and --end-timelabel can be used to reprocess multiple timelabels. Providing same value to --start-timelabel and --end-timelabel means that only one timelabel will be reprocessed.
  3. Providing --timelabel is equilvalent to providing same value to both --start-timelabel & --end-timelabel (--timelabel will be deprecated in future.)
  4. Command defaults --slice-name to redshift unless specified. (Other possible values: athena, snowflake etc)

Generic command

datacoral organize matview-reprocess --slice-name <warehouseName> --matview-schema <schema-name> --matview-name <matview-name> --start-timelabel <startTimelabel> --end-timelabel <endTimelabel>

Simplified Command

datacoral organize matview-reprocess --matview-schema <schema-name> --matview-name <matview-name> --timelabel <timelabel>
  1. Not providing --slice-name defaults it to redshift.
  2. Providing --timelabel is the same as providing the same input to --start-timelabel + --end-timelabel.

The above command is equivalent to:

datacoral organize matview-reprocess --slice-name redshift --matview-schema <schema-name> --matview-name <matview-name> --start-timelabel <timelabel> --end-timelabel <timelabel>

Version Management

List versions of a MV

datacoral organize matview-list-versions --matview-schema <matview-schema> --matview-name <matview-name>

List update set versions

datacoral organize matview-list-update-sets