PostgreSQL Collect Slice
Overview
PostgreSQL is a popular database used in applications and analytics. The Datacoral PostgreSQL slice collects data from a PostgreSQL database and enables data flow into a data lake or data warehouse, such as Amazon Athena & Amazon Redshift.
Capabilities
Please refer to the Database Collect Slices Overview to understand the overall capabilities of the database collect slices.
In addition to the default capabilities of a standard database slice, the Postgres slice supports:
- Replication of JSON columns including extraction of embedded attributes from a JSON object.
- Extraction of JSON arrays into their child tables for easier analysis
Steps to add this slice to your installation
The steps to launch your slice are:
- Setup your database credentials
- Make sure your database is addressable from your Datacoral VPC
- Specify the slice config
- Add the PostgreSQL collect slice
1. Identify your database credentials
- Use an existing database user OR create a new user for Datacoral. If creating a new user, please run the following statement on your PostgreSQL database, replacing
datacoral-password
with a password of your choice.
Create user
Encrypt password (recommended)
All customer data is encrypted at rest as well as in motion using customer managed KMS keys. Datacoral’s cross-account role does not have decrypt permissions on the KMS keys. However, we do recommend using pre-encrypted credentials, as it allows for you to source control configuration files to automate installations for Data Ops and share configurations without compromising security. Please refer to the instructions on encrypting your password before adding them to encrypt your passwords.
Grant permissions
- Then grant the
datacoral
user read-only access to all tables you need to replicate. If the tables you want to replicate are in a separate schema other thanpublic
, please change the commands below appropriately. The last command makes sure that any future tables will be accessible todatacoral
. If you have several schemas and you wish to grant access to all of them, you will need to repeat these three commands for each schema.
Restricted table permissions (optional)
To grant access only to a subset of tables in a schema for security reasons, follow the below instructions
- Revoke access to the schema and ensure
datacoral
user has access to the schema itself:
- Repeat this command for each table you wish to include:
Restricted column-level permissions (optional)
To grant access only to some columns in a table, first make sure you have removed any previously given permission to all columns in that table:
Then give permission to the specific columns, e.g. columnA and columnB:
Database accessible via the internet
If your database is hosted on AWS and is not publicly accessible, need to authorize ingress for Datacoral's elastic IP to the db security group. You can do this in the AWS RDS console or via AWS CLI:
Database accessible via VPC Peering
If you don't want any traffic from your database to go over the internet, you can setup VPC Peering between the Datacoral Installation's VPC and your database's VPC. See VPC peering for how to set it up.
If your Postgres Database is in AWS, you would have to modify the subnet group and security groups so that network connections are allowed to the Datacoral VPC.
- For the RDS subnet group, the Network ACL should allow inbound and outbound connection to the Datacoral VPC.
- For the RDS security group, we would need an inbound and outbound rule to/from the Datacoral VPC.
Other
If your database is hosted outside of AWS, please follow the appropriate networking setup to allow for ingress traffic for Datacoral to access your database.
3. Specify the slice config
To get a template for the PostgreSQL slice configuration save the output of the describe --input-parameters
command as follows:
Necessary input parameters:
database
- name of your PostgreSQL databasehost
- database host (or IP address)port
- database portusername
- database user namesecretKey
- database password
Optional input parameters:
At the slice level, these optional parameters are available:
tableWhiteList
- a comma separated list of tables you want to pull. You can also set it to ["public.%"] for all tables in the public schema or ["%"] for all tables from all the schemas. The default is set to ["public.%"].tableBlackList
- list of tables that you do not want to pull.schedule
- in cron format
Advanced configuration parameters:
After a slice is added and bootstrapped, additional options are made available after schema discovery. You can describe the full slice along with the advanced options using the
These advanced configurations can be set the individual table level for fine grained optimization to match your use case.
Parameters
columnList
- List of columns to be replicated from the PostgreSQL table. Helpful to Column-level SQL based transformation at the source for Snapshot & Incremental modes. Example:columnBlacklist
- List of columns which need to be excluded from the PostgreSQL table. The columns that are in this list will not be a part of the data warehouse table. Example:extractionQuery
- Custom data extraction query.
Important Note: Only one of columnList, columnBlacklist and extractionQuery parameters can be provided.
paginate
- Set to "true" to fetch data using individual pages. Helpful to ingest large tables in either "Snapshot" or "Incremental" modes. When enabled, pageSortCol & pageSize are required.pageSortCol
- Columns to use for paging. Ex.id
pageSize
- Number of records to fetch per page that minimizes the load on the source database while efficiently fetching records into the warehouse. Ex. 100000pageStartIndex
- To fetch large tables that have sequential ids without timestamp columns, utilize pageStartIndex as a marker to denote the last id fetched. This attribute will be automatically be set by the slice the first time and after each extract to allow for subsequent incremental fetches.orderBy
- Columns to be used in order by clause.filterExpression
- Filter records from the source table before extraction- Supplant logic to specify additional date criteria through the use of dynamic variable substitution. Supported variables are: {YYYY} - Year {MM} - Month {DD} - Day {HH} - Hour {mm} - Minute {ss} - Second
Example template:
4. Add the Slice
slice-name
Name of your slice. A schema with your slice-name is automatically created in your warehouseparams-file
File path to your input parameters file. Ex. pg_parameters_file.json
5. Advanced configuration
Advanced configuration can be enabled through describing and updating a slice
6. Execution modes
Pg connector supports snapshot, snapshot paginate, incremental and incremental paginated execution modes. By default every loadunit within Pg connector will be added in snapshot or snapshot paginate mode based on the table size and structure. To change the execution mode of any connector
Step 1 :
Go to Datacoral UI and navigate to the Pg connector that you would like to update
Step 2 :
Click on Update Connector
, it will redirect to update page with complete configuration. Scroll to the desired loadunit
for incremental mode add/replace below configuration :
and for incremental paginate mode add/replace below configuration :
and click Update Connector
at the bottom.
Slice output
Amazon S3:
Data stored in Amazon S3 is partitioned by date and time in the following bucket
s3://datacoral-data-bucket/<sliceName>
Amazon Redshift: Schema - schema name will be same as a slice-name. Tables produced by the slice will have the same name as the source database.
Supported Versions
- Snapshot & Incremental: PostgreSQL 8.0 and above
- CDC: PostgreSQL Logical Decoding 9.6.10 and above
- Hosted flavors
- Generic PostgreSQL
- Amazon Aurora
- Amazon RDS
- Google Cloud MySQL and Azure databases addressable from your AWS VPC
Questions? Interested?
If you have questions or feedback, feel free to reach out at hello@datacoral.co or Request a demo