Datacoral's Batch Compute UDFs allows you to perform complex operations within your transformations. As a motivating example, assume there is a table in Amazon Redshift with Amazon Product Reviews.
You are trying to compute the sentiment of each of the reviews. In order to do that, you might wish to use Amazon's Comprehend Service. You can use Datacoral's Batch Compute UDF as a way to do this by following the steps below.
Currently, in order to work with Datacoral Batch Compute, you will need the Datacoral CLI and the AWS CLI.
- Datacoral CLI - Click on
Get Datacoral CLIin the Datacoral Webapp to get the CLI.
- AWS CLI - See AWS CLI Installation Instructions. Please configure the CLI to use the AWS Account where your Datacoral platform is deployed.
The rest of the steps require that your Datacoral installation has the Batch Compute feature enabled. Email email@example.com if you dont have it enabled already.
Step 1: Download the UDF template code
Contact us at firstname.lastname@example.org to get access to the template code!
Step 2: Implement the UDF
Step 2.1: Implement the transform method
All you need to do is implement the
transform method, which takes as an input a Pandas DataFrame, and we expect it to return a Pandas DataFrame.
- Python UDF
The UDF in the previous tab uses AWS comprehend actions. To add the action to Batchrole in your installation, please follow the below steps:
Step 1 : Go to CloudFormation and search for BatchRole and click Update
Step 2: Click on Update Nested Stack and on Update Stack
Step 3: Click on Edit Template in Designer and View in Designer
Step 4: In the text window at the bottom-left, search for
Comprehend:DetectSentimentpolicy needs to be added to the
batchjobpolicyunder this Role.
- Click here to attach permissions policies to IAM identities and grant permissions to perform Amazon Comprehend actions.
- Once you make the update, you'll be able to hit the "Validate Template" to check the template and "Create Stack" buttons to update the BatchJobRole
Step 5: Update the Cloudformation Stack, click here for the guidelines
Step 2.2: Include additional packages
You can add any arbitrary libraries into a
requirements.txt file. It is already populated with the most often used
If you have any Python libraries you wish to use in your UDF code, you can add wheel files for them under a
dist directory in the same folder. We will install those and make them available to the UDF.
Step 2.3 Add input and output datalayout files
The template code has a
datalayout with two files:
input_datalayout.json- contains the schema of the input data frame passed to the transform method
output_datalayout.json- contains the schema of the output of the UDF. When a transformation is created with this UDF, the destination table will have the schema specified in this file. Note: If the output datalayout changes, please drop the destination table and recreate it manually.
Step 2.4 Create a zip file
Zip the contents of the udf and all of libraries it depends on.
Step 3: Register the UDF
Create a file with the compute and memory requirements for the UDF. For example, create a file
compute-resources.json with the content below:
Register the UDF. Note that the
AWS_PROFILE is needed since the UDF zip file is being uploaded to S3.
Step 4: Create a materialized view with the UDF
Create the DPL file
The same Materialized View can also be created through the Datacoral webapp.
This results in a table
amazon_product_reviews.review_sentiment in Redshift with a sentiment value assigned to each review.