Amazon S3 Connector UI Setup Guide

Prerequisites

The steps to launch your slice are:

1. Configure your S3 bucket permissions

2. Configure access to encrypted data is S3

Configure your S3 bucket permissions

There are two ways to configure your S3 permissions - using the AWS Console or using the AWS CLI.

Using the AWS Console

  1. Log into AWS and open your S3 bucket
  2. Click on Permissions then Bucket Policy
  3. Modify the policy template below with your bucket name & datacoral installation account number and Save
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowListBucketToDatacoralAwsAccount",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::ACCOUNT_NUMBER:root"
},
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::BUCKET_NAME"
]
},
{
"Sid": "AllowGetObjectToDatacoralAwsAccount",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::ACCOUNT_NUMBER:root"
},
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::BUCKET_NAME/*"
]
}
]
}

Using the AWS CLI

  • Create an S3 policy file based on the template below. A convenient file name would be 's3policy.json'
  • Modify the policy template with your BUCKET_NAME & ACCOUNT_NUMBER
    • BUCKET_NAME is the name of the S3 bucket. (Example: prod_events)
    • ACCOUNT_NUMBER is the AWS Account number where Datacoral is installed. (Your AWS Account number can be found on the top right of the console as given below)
  • At the command line, enter
    aws s3api put-bucket-policy --bucket BUCKET_NAME --policy file:/s3policy.json

Configure access to encrypted data is S3

If your data on S3 in a different AWS account is also encrypted using KMS keys, you have to allow the Datacoral connector code to decrypt it. The following steps lets you enable this access.

  1. Login to your AWS account where Datacoral is installed. Also, login to the AWS account that owns the encrypted S3 bucket.

    You may need two browser windows on the two AWS consoles to effectively navigate these steps.

  2. Go to the Key Management Service (KMS) dashboard in your account that owns the S3 bucket. Click on the Customer Managed Keys section, which is where you should find your KMS keys used for encrypting your S3 data.

  3. Select your KMS key. Note down and copy the ARN. We will have to come back to the same page later to the "Key policy" section.

  4. Switch to another browser window and go to the AWS account that has Datacoral installed.

  • Open the Identity and Access Management (IAM) service page
  • Click on Roles and search for LambdaExecRole.
  • You will see Datacoral Lambda-exec role in the format datacoral-{INSTALLATION}-datacoralRoles-LambdaExecRole-{ID}

    Please ignore other Datacoral role having "API",in the format - datacoral-{INSTALLATION}-datacoralRoles-APILambdaExecRole-{ID}

  1. Click on the Datacoral Lambda-exec role.
  • Note down and copy the role ARN
  • Click on Add inline policy
  1. Add the following policy. This is required because the Datacoral Lambdas have minimal permissions out of the box. Replace the YOUR_KMY_KEY below with the key ARN from step 3 above.

    {
    "Version": "2012-10-17",
    "Statement": [
    {
    "Action": [
    "kms:Decrypt",
    "kms:DescribeKey",
    “kms:Encrypt”,
    “kms:GenerateDataKey”,
    “kms:GenerateDataKeyWithoutPlaintext”,
    “kms:ReEncryptFrom”,
    “kms:ReEncryptTo”
    ],
    "Resource": "YOUR_KMS_KEY_ARN"
    "Effect": "Allow"
    }
    ]
    }
  2. Go back to your AWS KMS browser tab. Click on your key and scroll down to the "Key policy" section. Add or edit to include the following policy. Replace the DATACORAL_EXEC_LAMBDA_ROLE_ARN below with the ARN from step 5 above.

    {
    "Version": "2012-10-17",
    "Id": "sse-master-key",
    "Statement": [
    {
    "Sid": "Enable IAM User Permissions",
    "Effect": "Allow",
    "Principal": {
    "AWS": "DATACORAL_EXEC_LAMBDA_ROLE_ARN"
    },
    "Action": [
    "kms:Decrypt",
    "kms:DescribeKey",
    “kms:Encrypt”,
    “kms:GenerateDataKey”,
    “kms:GenerateDataKeyWithoutPlaintext”,
    “kms:ReEncryptFrom”,
    “kms:ReEncryptTo”
    ],
    "Resource": "*"
    }
    ]
    }
  3. Save all your changes in both accounts and the connector should now be able to read the encrypted data on S3.

Step 1: Select S3 Ingest connector

  • Go to Datacoral Webapp
  • From the main menu, click on Add connector
  • In the drop down list, find and select S3 Ingest connector

Step 2: Configure connection parameters

  • Enter the name of the connector and click on the destination warehouse and click Next
  • Enter the sounce bucket name and click on Check Connection and click Next

Step 3: Configure source information

  • Interval : Set the frequency of data extraction
  • Click on Fetch Source Metadata and click Next

Step 4: Configure loadunits information

  • Add the loadunits by clicking on "Add Loadunit"
  • To add a loadunit click on "Add loadunit", the fields are the same for editing a loadunit
  • Loadunit name - set the name of the loadunit
  • Interval - set the extraction frequency for the loadunit
  • S3 Source Path Prefix List : Enter the S3 source path in a regex format. Please refer to the note below.This is applicable only to the loadunit being added. Please ensure that it has the files in the valid format. Dynamic variable substitution can be used to load date partitioned folders
note

S3 Source Path Prefix: The expected input for S3 Source Path Prefix parameter is: <S3 Prefix Path>/<Supplantable Parts>/<filename|filefilter|filesuffix>

The above format should be converted to a regex and provided as an input in the S3 Ingest Loadunit UI. For example if source path prefix provided is test\/somefile\.csv The input is internally parsed into two separate parameters:

  • S3 Prefix Path : test/
  • File Filter : somefile\.csv

A general rule of thumb is, S3 Prefix path cannot have wildcard characters (the special characters such as + ,*, /,. and so on), while the file filter can. And this rule is applied while parsing the segments of S3 Source Path Prefix input parameter.

Invalid InputValid Input
test/anotherfolder/thirdfolder/*.csvtest\/anotherfolder\/thirdfolder\/.*\.csv
*.csv.*\.csv

Following are the different examples for S3 source path prefix input parameter:

Desired inputS3 Source Path Prefix (input to be provided)
*.csv.*\.csv
+/nested2/+.csv.+\/nested2\/.+\.csv
s3ingest/s3ingesttest/*csvs3ingest\/s3ingesttest\/.*csv
s3ingest/supplantable/{YYYY}-{MM}-{DD}/test/*csvs3ingest\/supplantable\/{YYYY}-{MM}-{DD}\/test\/.*csv
abc/pqrs/*-abc.csvabc\/pqrs\/.*-abc\.csv
abc-*/pqr.csvabc-.*\/pqr\.csv
abc/pqrs/{YYYY}/{MM}/*-abc.csvabc\/pqrs\/{YYYY}\/{MM}\/.*-abc\.csv
*/nested/+.csv.*\/nested2\/.+\.csv
/nested2/*.csv.+\/nested2\/.*\.csv
  • Paginate - check if the data volume is high
  • Extraction mode - can be set as incremental or snapshot
  • Uncompressed file - is set as 'true', uncheck if you want to ingest gzip files

    We only support gzip format

  • Data Format - select one of the supported formats ("JSON", "CSV", "AVRO", or "PARQUET")
note

Formats can vary within loadunits of same bucket, please make sure that the right format is selected

Parameters for dataformat as CSV:

  • CSV delimiter - define the delimiter for the CSV file. Default is pipe - |
  • CSV record delimiter - define the record delimiter for the CSV file. Default is newline (\n).
  • Number of lines to skip - number of lines to skip as part of the data load in CSV. The header line usually has the column names in a CSV file.
    note

    The default value for this field is 0, please set it to 1 for CSV as s3 connector will fail to load without column names



  • Filtering options
    • Include columnList - add the list of columns (one by one) that needs to be included
    • Exclude columnList - add the list of columns that need to be excluded.
      note

      Please note this field takes in both regex and exact names

Step 5: Edit data layouts

Update data type as needed and click on Next to add the connector.

Step 6: Configure warehouse

  • Load Mode - select replace, append or merge based on the use case
    • The merge mode will need Primary key as a manadatory input, it is highly recommended to provide the Update timestamp column as well
    • In case of recording soft deletes please provide Delete Timestamp Column name.

When done with the configuration changes, please click on Update and Next on the top right.

Step 7: Confirm the configuration

You will see the below dialog, click Next to confirm addition of the connector.

Connector Added

You have successfully added the connector once you have landed on the below page.

Questions?

Please contact Datacoral's Support Team, we'd be more than happy to answer any of your questions.