Batchcompute - Tips, Tricks and Troubleshooting

Using secret tokens in Python transformations

When creating a UDF code using Datacoral’s Batchcompute feature and using secret tokens (such as API keys) that you'd prefer to not have bundled in the code, we recommend saving these secret tokens in AWS Secrets Manager. These can then be fetched during the runtime of the code (during Batchcompute MV execution).

There are three steps, as seen below.

Step 1: Add secret to Secrets Manager

First, add the secret token to AWS Secrets Manager using the AWS CLI:

> aws secretsmanager create-secret --name MyTestSecret --secret-string '{"api_key": "test_api_key"}'
{
  "ARN": "arn:aws:secretsmanager:us-west-2:000000000000:secret:MyTestSecret-yLWMkY",
  "Name": "MyTestSecret",
  "VersionId": "e80594f1-3239-420d-a5e0-2e230b339be6"
}

Step 2: Write Python code to fetch secret from Secrets Manager

Next, write the Python code inside your UDF to fetch the secret token from Secrets Manager:

import boto3

client = boto3.client('secretsmanager')
client.get_secret_value(SecretId='MyTestSecret')

secret_token = client.get_secret_value(SecretId='MyTestSecret')['SecretString']
# secret_token now holds '{"api_key": "test_api_key"}'

Step 3: Update permissions for AWS Batch to read from Secrets Manager

Finally, update the BatchRole in your Datacoral installation. As we discuss in our security architecture document, we assume privileges based on the Principle of Least Privilege. This means that for the successful execution of the AWS Secrets Manager code in AWS Batch, we will need to provide the correct permissions to the role that is assumed by AWS Batch. This means that we will have to update the CloudFormation stack that created the Batch Role.

Go to the CloudFormation AWS console, and search for stacks that contain the string "BatchRole". When you find the appropriate stack, click on "Update".

Now, click on "Update Nested Stack".

Click on "Edit template in Designer" and then click on "View in Designer" to open up the CoundFormation designer.

Now, find the BatchJobRole resource in the CF Template, and add the following object to the list of existing Policy Statements:

{
  "Effect": "Allow",
  "Action": [
    "secretsmanager:GetResourcePolicy",
    "secretsmanager:GetSecretValue",
    "secretsmanager:DescribeSecret",
    "secretsmanager:ListSecretVersionIds"
  ],
  "Resource": ["*"]
}

Using a Dummy Table to Send Triggers to Batchcompute MVs

Typically, Datacoral MVs get triggered when something upstream to them finishes successfully. For Batchcompute MVs, however, one might want to run them on a schedule (say, every 5 minutes), where this is nothing upstream to them. In this case, we recommend setting up a non-datacoral connector with a dummy loadunit that is configured for the desired schedule. The table representing the dummy loadunit does need to exist in the input warehouse, however (the table can be empty though).

This means that on the schedule specified, the nondatacoral connector will emit a SUCCESS event, which will trigger the downstream Batchcompute MV, and voila, we now have a Batchcompute MV running on our desired schedule.

Here are the steps to accomplishing this:

Step 1: Create dummy table in input warehouse (say, Redshift)

We would need to run the following SQL commands to create the appropriate tables in Redshift.

create schema if not exists triggers;
create table if not exists triggers.five_mins (i int);

This would create an empty table with a solitary column in a new schema called triggers.

Step 2: Setup a nondatacoral slice

Follow the instructions here to setup a non-datacoral connector. If you want the dummy loadunit (and therefore the downstream Batchcompute MV) to run every 5 minutes, you can use the the following input params when creating the connector from the CLI:

{
  "sliceName": "triggers",
  "datasource": {
    "type": "nondatacoral",
    "id": "triggers",
    "inputParams": {
      "loadunits": {
        "five_mins": {
          "datalayout": {
            "i": {
              "type": "integer"
            }
          }
        }
      },
      "schedule": "*/5 * * * *"
    },
    "loaderConfig": {
      "DataFormat": "JSON",
      "loadunits": {
        "five_mins": {}
      }
    }
  }
}

This will setup the non-datacoral connector appropriately.

Step 3: Use the dummy table in the input query to the Batchcompute MV

When creating a Batchcompute MV, you need to specify an inout query that reads data from the input warehouse. This is used by us to infer upstream dependencies. In the example above, you can now use a query such as the following:

select * from triggers.five_mins limit 1;

This will ensure that the Batchcompute MV runs every five minutes.

You can always add more loadunits (and correspondingly, more dummy tables in the warehouse) to specify triggers at other schedules. For example, adding a new dummy loadunit that sends a trigger every 10 minutes involves running the SQLs as in Step 1, followed by updating the connector with the following deploy parameters:

{
  "sliceName": "triggers",
  "datasource": {
    "type": "nondatacoral",
    "id": "triggers",
    "inputParams": {
      "schedule": "*/5 * * * *",
      "loadunits": {
        "five_mins": {
          "schedule": "*/5 * * * *",
          "datalayout": {
            "i": {
              "type": "integer"
            }
          }
        },
        "ten_mins": {
          "schedule": "*/10 * * * *",
          "datalayout": {
            "i": {
              "type": "integer"
            }
          }
        }
      }
    },
    "loaderConfig": {
      "DataFormat": "JSON",
      "loadunits": {
        "ten_mins": {},
        "five_mins": {}
      }
    }
  }
}

Installing Java inside the UDF

This section describes how to have Java available inside the Docker container when the User Defined Function gets created. This is needed when communicating with a database over JDBC (using the JayDeBeApi library), for example. The version of Java installed is OpenJDK 8. This can be done by adding one additional option when calling the udf-create CLI command. This is the --base-image option, which has to be set to datacoral/python-base-java, as can be seen in the example below:

AWS_PROFILE=<your_aws_profile> datacoral organize udf-create \
      --udf-name <udf_name> \
      --module-path /path/to/udf.zip \
      --language Python \
      --resources /path/to/compute-resources.json \
      --base-image datacoral/python-base-java

All Python dependencies that are specified in requirements.txt will be installed as usual.

Reach out to us at support@datacoral.co if you have any questions!

Using secret tokens in Python transformations#

Step 1: Add secret to Secrets Manager#

Step 2: Write Python code to fetch secret from Secrets Manager#

Step 3: Update permissions for AWS Batch to read from Secrets Manager#

Using a Dummy Table to Send Triggers to Batchcompute MVs#

Step 1: Create dummy table in input warehouse (say, Redshift)#

Step 2: Setup a nondatacoral slice#

Step 3: Use the dummy table in the input query to the Batchcompute MV#

Installing Java inside the UDF#