Batch Compute Quick Start Guide

Datacoral's Batch Compute UDFs allows you to perform complex operations within your transformations. As a motivating example, assume there is a table in Amazon Redshift with Amazon Product Reviews.

redshift> SELECT * FROM amazon_product_reviews.reviews;
+----------------------+-------------------+---------------------------------------------------------------------------------------------------------+
| product_id | product_name | review_text |
+----------------------+-------------------+---------------------------------------------------------------------------------------------------------+
| AVpe7AsMilAPnD_xQ78G | Kindle Paperwhite | My previous kindle was a DX, this is my second kindle in years. Love the form factor and all but I d... |
| AVpe7AsMilAPnD_xQ78G | Kindle Paperwhite | Just got mine right now. Looks the same as the previous generation except for the Kindle logo (it's ... |
| AVpe7AsMilAPnD_xQ78G | Kindle Paperwhite | I am enjoying it so far. Great for reading. Had the original Fire since 2012. The Fire used to make ... |
| AVpe7AsMilAPnD_xQ78G | Kindle Paperwhite | I initially had trouble deciding between the paperwhite and the voyage because reviews more or less ... |
| AVpe7AsMilAPnD_xQ78G | Kindle Paperwhite | I am enjoying it so far. Great for reading. Had the original Fire since 2012. The Fire used to make ... |
| AVpe7AsMilAPnD_xQ78G | Kindle Paperwhite | I have to say upfront - I don't like coroporate, hermetically closed stuff like anything by Apple or... |
| AVpe7AsMilAPnD_xQ78G | Kindle Paperwhite | Allow me to preface this with a little history. I am (was) a casual reader who owned a Nook Simple T... |
| AVpe7AsMilAPnD_xQ78G | Kindle Paperwhite | I initially had trouble deciding between the paperwhite and the voyage because reviews more or less ... |
+----------------------+-----------------------------------------------------------------------------------------------------------------------------+

You are trying to compute the sentiment of each of the reviews. In order to do that, you might wish to use Amazon's Comprehend Service. You can use Datacoral's Batch Compute UDF as a way to do this by following the steps below.

Prerequisites

Currently, in order to work with Datacoral Batch Compute, you will need the Datacoral CLI and the AWS CLI.

  • Datacoral CLI - Click on Get Datacoral CLI in the Datacoral Webapp to get the CLI.
  • AWS CLI - See AWS CLI Installation Instructions. Please configure the CLI to use the AWS Account where your Datacoral platform is deployed.

The rest of the steps require that your Datacoral installation has the Batch Compute feature enabled. Email support@datacoral.co if you dont have it enabled already.

Step 1: Download the UDF template code

Contact us at support@datacoral.co to get access to the template code!

Step 2: Implement the UDF

Step 2.1: Implement the transform method

All you need to do is implement the transform method, which takes as an input a Pandas DataFrame, and we expect it to return a Pandas DataFrame.

note

The UDF in the previous tab uses AWS comprehend actions. To add the action to Batchrole in your installation, please follow the below steps:

Step 1 : Go to CloudFormation and search for BatchRole and click Update

set datacoral password

Step 2: Click on Update Nested Stack and on Update Stack

set datacoral password

Step 3: Click on Edit Template in Designer and View in Designer

set datacoral password

Step 4: In the text window at the bottom-left, search for BatchJobRole

set datacoral password
  • The Comprehend:DetectSentiment policy needs to be added to the batchjobpolicy under this Role.
  • Click here to attach permissions policies to IAM identities and grant permissions to perform Amazon Comprehend actions.
  • Once you make the update, you'll be able to hit the "Validate Template" to check the template and "Create Stack" buttons to update the BatchJobRole

Step 5: Update the Cloudformation Stack, click here for the guidelines

Step 2.2: Include additional packages

You can add any arbitrary libraries into a requirements.txt file. It is already populated with the most often used packages.

### These dependencies with the given versions will already be available ####
# Note: If you require a different version of these core packages, you can change the version
git+https://github.com/datacoral/aws-data-wrangler
boto3==1.15.18
botocore==1.18.18
dask==0.18.2
distributed==1.22.1
numpy==1.18.2
pandas==1.0.3
psycopg2==2.7.5
s3fs==0.4.2
scipy==1.1.0
scikit-learn==0.21.2
snowflake-connector-python[pandas]
snowflake-sqlalchemy
uuid==1.30
##### Add any other Python packages below ####

If you have any Python libraries you wish to use in your UDF code, you can add wheel files for them under a dist directory in the same folder. We will install those and make them available to the UDF.

Step 2.3 Add input and output datalayout files

The template code has a datalayout with two files:

  1. input_datalayout.json - contains the schema of the input data frame passed to the transform method
{
"product_id": {
"type": "integer",
"position": 1,
"pkPosition": 1
},
"product_name": {
"type": "string",
"size": 1024,
"position": 2
},
"review_text": {
"type": "string",
"size": "65535",
"position": 3
}
}
  1. output_datalayout.json - contains the schema of the output of the UDF. When a transformation is created with this UDF, the destination table will have the schema specified in this file.
{
"product_id": {
"type": "integer",
"position": 1,
"pkPosition": 1
},
"product_name": {
"type": "string",
"size": 1024,
"position": 2
},
"sentiment": {
"type": "string",
"size": 256,
"position": 3
},
"review_text": {
"type": "string",
"size": "65535",
"position": 4
}
}

Step 2.4 Create a zip file

Zip the contents of the udf and all of libraries it depends on.

zip -qq -r udf.zip *

Step 3: Register the UDF

Create a file with the compute and memory requirements for the UDF. For example, create a file compute-resources.json with the content below:

{
"vcpus": 4,
"memory": 4096
}

Register the UDF. Note that the AWS_PROFILE is needed since the UDF zip file is being uploaded to S3.

AWS_PROFILE=<your_aws_profile> datacoral organize udf-create \
--udf-name sentiment \ # Name of the UDF which will be used in the SQL transformation
--module-path ./udf.zip \ # Path to the udf.zip file
--language Python \ # Language of the UDF
--resources ./compute-resources.json # Path to the resource specification for the UDF

Step 4: Create a materialized view with the UDF

Create the DPL file

-- review_sentiment.dpl
/**
* @datacoral O-1.0.0
* @slice-name redshift
* @matview-schema amazon_product_reviews
* @matview-name review_sentiment
* @interval-in-minutes 5
* @update-mode replace
* @table-type regular
* @compute-engine batchcompute
* @function-name sentiment
* @input-warehouse redshift
* @output-warehouse redshift
*/
SELECT
product_id,
product_name,
review_text
FROM
amazon_product_reviews.reviews
datacoral organize matview-create --dpl-file review_sentiment.dpl

The same Materialized View can also be created through the Datacoral webapp.

This results in a table amazon_product_reviews.review_sentiment in Redshift with a sentiment value assigned to each review.

redshift> SELECT * FROM amazon_product_reviews.review_sentiment;
+--------------------+-------------------+-------------------------------------------------------------------------------------------------------+-----------+--------------+
| product_id | product_name | review_text | sentiment | _dc_timelabel|
+--------------------+-------------------+-------------------------------------------------------------------------------------------------------+-----------+--------------+
|AVpe7AsMilAPnD_xQ78G| Kindle Paperwhite |Just got mine right now. Looks the same as the previous generation except for the Kindle logo (it's ...| NEUTRAL |20201215102500|
|AVpe7AsMilAPnD_xQ78G| Kindle Paperwhite |I initially had trouble deciding between the paperwhite and the voyage because reviews more or less ...| MIXED |20201215102500|
|AVpe7AsMilAPnD_xQ78G| Kindle Paperwhite |I have to say upfront - I don't like coroporate, hermetically closed stuff like anything by Apple or...| NEGATIVE |20201215102500|
|AVpe7AsMilAPnD_xQ78G| Kindle Paperwhite |I initially had trouble deciding between the paperwhite and the voyage because reviews more or less ...| MIXED |20201215102500|
|AVpe7AsMilAPnD_xQ78G| Kindle Paperwhite |My previous kindle was a DX, this is my second kindle in years. Love the form factor and all but I d...| POSITIVE |20201215102500|
|AVpe7AsMilAPnD_xQ78G| Kindle Paperwhite |I am enjoying it so far. Great for reading. Had the original Fire since 2012. The Fire used to make ...| POSITIVE |20201215102500|
|AVpe7AsMilAPnD_xQ78G| Kindle Paperwhite |I am enjoying it so far. Great for reading. Had the original Fire since 2012. The Fire used to make ...| POSITIVE |20201215102500|
|AVpe7AsMilAPnD_xQ78G| Kindle Paperwhite |Allow me to preface this with a little history. I am (was) a casual reader who owned a Nook Simple T...| NEUTRAL |20201215102500|
+--------------------+-------------------+-------------------------------------------------------------------------------------------------------+-----------+--------------+