PostgreSQL Collect Slice

Overview

PostgreSQL is a popular database used in applications and analytics. The Datacoral PostgreSQL slice collects data from a PostgreSQL database and enables data flow into a data lake or data warehouse, such as Amazon Athena & Amazon Redshift.

Capabilities

Please refer to the Database Collect Slices Overview to understand the overall capabilities of the database collect slices.

In addition to the default capabilities of a standard database slice, the Postgres slice supports:

  • Replication of JSON columns including extraction of embedded attributes from a JSON object.
  • Extraction of JSON arrays into their child tables for easier analysis

Steps to add this slice to your installation

The steps to launch your slice are:

  1. Setup your database credentials
  2. Make sure your database is addressable from your Datacoral VPC
  3. Specify the slice config
  4. Add the PostgreSQL collect slice

1. Identify your database credentials

  • Use an existing database user OR create a new user for Datacoral. If creating a new user, please run the following statement on your PostgreSQL database, replacing datacoral-password with a password of your choice.

Create user

CREATE USER datacoral PASSWORD 'datacoral-password';

Encrypt password (recommended)

All customer data is encrypted at rest as well as in motion using customer managed KMS keys. Datacoral’s cross-account role does not have decrypt permissions on the KMS keys. However, we do recommend using pre-encrypted credentials, as it allows for you to source control configuration files to automate installations for Data Ops and share configurations without compromising security. Please refer to the instructions on encrypting your password before adding them to encrypt your passwords.

Grant permissions

  • Then grant the datacoral user read-only access to all tables you need to replicate. If the tables you want to replicate are in a separate schema other than public, please change the commands below appropriately. The last command makes sure that any future tables will be accessible to datacoral. If you have several schemas and you wish to grant access to all of them, you will need to repeat these three commands for each schema.
GRANT USAGE ON SCHEMA "public" TO datacoral;
GRANT SELECT ON ALL TABLES IN SCHEMA "public" TO datacoral;
ALTER DEFAULT PRIVILEGES IN SCHEMA "public" GRANT SELECT ON TABLES TO datacoral;

Restricted table permissions (optional)

To grant access only to a subset of tables in a schema for security reasons, follow the below instructions

  1. Revoke access to the schema and ensure datacoral user has access to the schema itself:
GRANT USAGE ON SCHEMA "schema_name" TO datacoral;
ALTER DEFAULT PRIVILEGES IN SCHEMA "schema_name" REVOKE SELECT ON TABLES FROM datacoral;
REVOKE SELECT ON ALL TABLES IN SCHEMA "schema_name" FROM datacoral;
  1. Repeat this command for each table you wish to include:
GRANT SELECT ON "schema_name"."table_name" TO datacoral;

Restricted column-level permissions (optional)

To grant access only to some columns in a table, first make sure you have removed any previously given permission to all columns in that table:

REVOKE SELECT ON "schema_name"."table_name" FROM datacoral;

Then give permission to the specific columns, e.g. columnA and columnB:

GRANT SELECT ("column_one", "column_two") ON "schema_name"."table_name" TO datacoral;

Database accessible via the internet

If your database is hosted on AWS and is not publicly accessible, need to authorize ingress for Datacoral's elastic IP to the db security group. You can do this in the AWS RDS console or via AWS CLI:

aws ec2 authorize-security-group-ingress \
--group-id <security group associated with rds instance> \
--protocol tcp \
--port <db port> \
--cidr <elastic ip>

Database accessible via VPC Peering

If you don't want any traffic from your database to go over the internet, you can setup VPC Peering between the Datacoral Installation's VPC and your database's VPC. See VPC peering for how to set it up.

If your Postgres Database is in AWS, you would have to modify the subnet group and security groups so that network connections are allowed to the Datacoral VPC.

  • For the RDS subnet group, the Network ACL should allow inbound and outbound connection to the Datacoral VPC.
  • For the RDS security group, we would need an inbound and outbound rule to/from the Datacoral VPC.

Other

If your database is hosted outside of AWS, please follow the appropriate networking setup to allow for ingress traffic for Datacoral to access your database.

3. Specify the slice config

To get a template for the PostgreSQL slice configuration save the output of the describe --input-parameters command as follows:

datacoral collect describe --slice-type pg \
--input-parameters > pg_parameters_file.json

Necessary input parameters:

  • database - name of your PostgreSQL database
  • host - database host (or IP address)
  • port - database port
  • username - database user name
  • secretKey - database password

Optional input parameters:

At the slice level, these optional parameters are available:

  • tableWhiteList - a comma separated list of tables you want to pull. You can also set it to ["public.%"] for all tables in the public schema or ["%"] for all tables from all the schemas. The default is set to ["public.%"].
  • tableBlackList - list of tables that you do not want to pull.
  • schedule - in cron format

Advanced configuration parameters:

After a slice is added and bootstrapped, additional options are made available after schema discovery. You can describe the full slice along with the advanced options using the

datacoral describe --slice-name name_of_your_slice

These advanced configurations can be set the individual table level for fine grained optimization to match your use case.

Parameters

  • columnList - List of columns to be replicated from the PostgreSQL table. Helpful to Column-level SQL based transformation at the source for Snapshot & Incremental modes. Example:

    "columnList" : [
    "CONVERT_TZ(created_at,'Europe/Paris', 'UTC') as created_at",
    "CONVERT_TZ(updated_at,'Europe/Paris', 'UTC') as updated_at",
    "company_id",
    "content_type",
    "id",
    "path",
    "size"
    ]
  • columnBlacklist - List of columns which need to be excluded from the PostgreSQL table. The columns that are in this list will not be a part of the data warehouse table. Example:

    "columnBlacklist" : [
    "password"
    ]
  • extractionQuery - Custom data extraction query.

Important Note: Only one of columnList, columnBlacklist and extractionQuery parameters can be provided.

  • paginate - Set to "true" to fetch data using individual pages. Helpful to ingest large tables in either "Snapshot" or "Incremental" modes. When enabled, pageSortCol & pageSize are required.
  • pageSortCol - Columns to use for paging. Ex. id
  • pageSize - Number of records to fetch per page that minimizes the load on the source database while efficiently fetching records into the warehouse. Ex. 100000
  • pageStartIndex - To fetch large tables that have sequential ids without timestamp columns, utilize pageStartIndex as a marker to denote the last id fetched. This attribute will be automatically be set by the slice the first time and after each extract to allow for subsequent incremental fetches.
  • orderBy - Columns to be used in order by clause.
  • filterExpression - Filter records from the source table before extraction
  • Supplant logic to specify additional date criteria through the use of dynamic variable substitution. Supported variables are: {YYYY} - Year {MM} - Month {DD} - Day {HH} - Hour {mm} - Minute {ss} - Second

Example template:

{
"database": "my_pg_database",
"host": "rds_...aws.com",
"port": 5432,
"username": "martha",
"secretKey": "marthas_password_123",
"schedule": "0 0 * * *",
"tableWhiteList": ["table1", "table2"],
"tableBlackList": ["table3", "table4"]
}

4. Add the Slice

datacoral collect add --slice-type pg --slice-name <slice-name> --parameters-file <params-file>
  • slice-name Name of your slice. A schema with your slice-name is automatically created in your warehouse
  • params-file File path to your input parameters file. Ex. pg_parameters_file.json

5. Advanced configuration

Advanced configuration can be enabled through describing and updating a slice

6. Execution modes

Pg connector supports snapshot, snapshot paginate, incremental and incremental paginated execution modes. By default every loadunit within Pg connector will be added in snapshot or snapshot paginate mode based on the table size and structure. To change the execution mode of any connector

Step 1 :

Go to Datacoral UI and navigate to the Pg connector that you would like to update Step 1

Step 2 :

Click on Update Connector, it will redirect to update page with complete configuration. Scroll to the desired loadunit for incremental mode add/replace below configuration :

"executionMode": "incremental"

and for incremental paginate mode add/replace below configuration :

"executionMode": "incremental"
"paginate": "true"
"pageSize": "<PAGE_SIZE>"

and click Update Connector at the bottom. Step 2

Slice output

Amazon S3: Data stored in Amazon S3 is partitioned by date and time in the following bucket s3://datacoral-data-bucket/<sliceName>

Amazon Redshift: Schema - schema name will be same as a slice-name. Tables produced by the slice will have the same name as the source database.

Supported Versions

  • Snapshot & Incremental: PostgreSQL 8.0 and above
  • CDC: PostgreSQL Logical Decoding 9.6.10 and above
  • Hosted flavors
    • Generic PostgreSQL
    • Amazon Aurora
    • Amazon RDS
    • Google Cloud MySQL and Azure databases addressable from your AWS VPC

Questions? Interested?

If you have questions or feedback, feel free to reach out at hello@datacoral.co or Request a demo