PostgreSQL CDC Connector Prerequisites

Prerequisites

Set up database credentials and access

Use an existing database user OR create a new user for Datacoral. If creating a new user, please run the following statement on your PostgreSQL database, replacing datacoral-password with a password of your choice.

Create user

CREATE USER datacoral PASSWORD 'datacoral-password';

Encrypt password (recommended)

All customer data is encrypted at rest as well as in motion using customer managed KMS keys. Datacoral’s cross-account role does not have decrypt permissions on the KMS keys. However, we do recommend using pre-encrypted credentials, as it allows for you to source control configuration files to automate installations for Data Ops and share configurations without compromising security. Please refer to the instructions on encrypting your password before adding them to encrypt your passwords.

Grant permissions

Grant the datacoral user read-only access to all tables you need to replicate. If the tables you want to replicate are in a separate schema other than public, please change the commands below appropriately. The last command makes sure that any future tables will be accessible to datacoral. If you have several schemas and you wish to grant access to all of them, you will need to repeat these three commands for each schema.

GRANT USAGE ON SCHEMA "public" TO datacoral;
GRANT SELECT ON ALL TABLES IN SCHEMA "public" TO datacoral;
ALTER DEFAULT PRIVILEGES IN SCHEMA "public" GRANT SELECT ON TABLES TO datacoral;

Restricted table permissions (optional)

To grant access only to a subset of tables in a schema for security reasons, follow the below instructions

  1. Revoke access to the schema and ensure datacoral user has access to the schema itself:
GRANT USAGE ON SCHEMA "schema_name" TO datacoral;
ALTER DEFAULT PRIVILEGES IN SCHEMA "schema_name" REVOKE SELECT ON TABLES FROM datacoral;
REVOKE SELECT ON ALL TABLES IN SCHEMA "schema_name" FROM datacoral;
  1. Repeat this command for each table you wish to include:
GRANT SELECT ON "schema_name"."table_name" TO datacoral;

Restricted column-level permissions (optional)

To grant access only to some columns in a table, first make sure you have removed any previously given permission to all columns in that table:

REVOKE SELECT ON "schema_name"."table_name" FROM datacoral;

Then give permission to the specific columns, e.g. columnA and columnB:

GRANT SELECT ("column_one", "column_two") ON "schema_name"."table_name" TO datacoral;

Enable logical decoding at source

Amazon RDS/Aurora PostgreSQL instructions

a. Set logical replication parameters

  • Edit parameter group to modify parameters (such as rds.logical_replication) directly in RDS Console (Assuming Postgres is an AWS RDS Instance as well).
    • set rds.logical_replication to 1
    • set statement_timeout to 0 in order to prevent SQL queries from timing out.
    • set idle_in_transaction_session_timeout to less than an hour in order to prevent runaway/hanging transactions.
    • set wal_sender_timeout to 0 in order to prevent the WAL sender process on the server from disconnecting to the PostgreSQL CDC connector in case of a lag.
  • Note that default parameter group cannot be modified. So, you will need to create a new parameter group for your Postgres version, and then edit the parameter group to set the parameters listed above
  • Modify Postgres instance to use this new parameter group, and check “apply immediately” box.
  • Reboot the instance to have the parameter take effect.

Note : As part of applying rds.logical_replication parameter, RDS internally sets the wal_level, max_wal_senders, max_replication_slots and max_connections parameters.

b. Provide rds_replication privileges to Datacoral user

The database user (e.g. “datacoral”) who will access the data should have access to rds_replication.

grant rds_replication to datacoral;

c. Check Datacoral user privilege

  • Login to source database using same user credentials that is used by slice e.g, datacoral
  • Execute the following query and verify that datacoral is a member of rds_replication
SELECT
r.rolname,
r.rolsuper,
r.rolinherit,
r.rolcreaterole,
r.rolcreatedb,
r.rolcanlogin,
r.rolconnlimit,
r.rolvaliduntil,
ARRAY
( SELECT b.rolname
FROM pg_catalog.pg_auth_members m
JOIN pg_catalog.pg_roles b
ON (m.roleid = b.oid)
WHERE m.member = r.oid
)as memberof
,r.rolreplication
,r.rolbypassrls
FROM pg_catalog.pg_roles r
WHERE r.rolname = 'datacoral'
ORDER BY 1

d. Make sure your database is addressable from your Datacoral VPC

i. Database accessible via the internet

First, you need to locate Datacoral's Elastic IP for your installation. Please follow the steps listed here to get the Elastic IP.

If your database is hosted on AWS and is not publicly accessible, need to authorize ingress for Datacoral's Elastic IP to the db security group. You can do this in the AWS RDS console or via AWS CLI:

aws ec2 authorize-security-group-ingress \
--group-id <security group associated with rds instance> \
--protocol tcp \
--port <db port> \
--cidr <elastic ip>

Here are the steps to add Datacoral's elastic IP to the db security group through the AWS console

  1. Click on this link to access AWS RDS home and select the appropriate AWS region click on the source PostgreSQL database
  2. You will land on the below page Scroll down to find the Security group rules and click on the Security Group
  3. On Security group page, go to Inbound rules tab and click on Edit inbound rules
  4. On the Edit inbound rule page, click on Add rule
  • Add the Type as PostgreSQL or Custom TCP
  • Add the port as 5432
  • Source as custom
  • Copy paste the elastic IP
    note

    A CIDR block is expected here, so please add '/32' at the end of Datacoral elastic IP before pasting. For example, if the elastic IP address is 35.165.137.20, then please paste 35.165.137.20/32

and click on Save rules

ii. Database accessible via VPC Peering

If you don't want any traffic from your database to go over the internet, you can setup VPC Peering between the Datacoral Installation's VPC and your database's VPC. See VPC peering for how to set it up.

note

If you are using PostgreSQL from Heroku, please refer to the instructions here on how to set up VPC peering.

If your Postgres Database is in AWS, you would have to modify the subnet group and security groups so that network connections are allowed to the Datacoral VPC.

  • For the RDS subnet group, the Network ACL should allow inbound and outbound connection to the Datacoral VPC.
  • For the RDS security group, we would need an inbound and outbound rule to/from the Datacoral VPC.

Generic PostgreSQL instructions

a. Enable logical decoding on the server

  • Log into your PostgreSQL server as a superuser.
  • Edit postgresql.conf file by adding the parameters below
    wal_level=logical
    wal_sender_timeout = 0
    max_replication_slots = <max_replication_slots>
    max_wal_senders = <max_wal_senders>
    • Set <max_replication_slots> value to be equal to or higher than the number of PostgreSQL connectors that use WAL plus the number of other replication slots your database uses.
    • Set <max_wal_senders> value to be equal to be at least twice the number of logical replication slots
  • Add the following to the pg_hba.conf file
    host replication datacoral <datacoral_elastic_ip> md5
  • Restart the PostgreSQL service
    • After editing postgresql.conf and pg_hba.conf files, restart PostgreSQL service to ensure the changes take effect.
  • Verify logical replication configuration on the Server
    • To check whether the above changes have been applied, run the following command on a PostgreSQL terminal:
      SHOW wal_level;

b. Make sure your database is addressable from your Datacoral VPC

If your database is hosted outside of AWS, please follow the appropriate networking setup to allow for ingress traffic for Datacoral to access your database. Please note that you will have to do this for both master as well as the replica.

New feature: Streaming replication using publication

PostgreSQL introduced publish/subscribe model-based streaming replication starting version 10 onwards. The new model provides a way to publish logs for only the tables that we are interested in replicating. The logical replication slot can be subscribed to the publisher and get the changes in a streaming way. Streaming replication is much faster, and in most cases, the delay between a transaction being written onto the primary and it being sent to the subscriber is 1 second.

If you would like to use publication/subscription way to replicate your data, please execute the steps below

Create a publication by the name datacoral_publication for all tables by executing the query below

create publication datacoral_publication for all tables;
  • Please note that the above query can only be executed by a superuser. If you do not have superuser credentials, we will only be able to use wal2json decoder which will be less efficient than the publication model.

  • If you'd like to create publication for the selected list of tables that the you want to replicate, you can use the command below to create the publication

create publication datacoral_publication for tables schema1.table_1, schema2.table_2, schema3.table_3;
  • If the user doesn’t have superuser, we will be use wal2json decoder, which can’t use publication and will be less efficient that the publication model.

  • Also, when you delete PostgreSQL CDC connector for any reason, please make sure to delete the subscription by running the command below:

drop publication datacoral_publication;

Set the replica identity for tables with/without Primary Keys and indices

The replica identity of a table determines what gets written to the WAL for any change (INSERT/UPDATE/DELETE) happening to a table. If we have NOTHING as replica identity, the WAL won’t have any details on it, so the logical decoding will not be able to get the relevant changes.

  • Make sure the replica identity is set correctly for the tables to be replicated. Replica identity should be as below:
    - For tables with primary key - DEFAULT, USING INDEX, FULL
    - For tables without PK but an index - DEFAULT, USING INDEX, FULL
    - For tables without PK or an index - FULL
    Note: The connector filters out the tables not having the required replica identity set. For example: A table without primary key or index having a Default replica identity will not be included in the list of tables to be replicated.

Make sure that SSL has not been enforced

  • Make sure force ssl has been turned off as there is no support for it at the moment. So we need to ensure that our connection to the source is not dropped because of it. We are working on supporting SSL and it will be made available soon.

Handling TOAST columns (Optional)

TOAST is a mechanism in PostgreSQL that allows database administrators to keep physical data rows from exceeding the size of a data block. Having tables with "TOASTed" columns does have implications on data replication using Change Data Capture (as has been noted before).

If the actual value of "TOASTed" columns is important for replication into the data warehouse, then the "replica identity" for the table has to be set to FULL. This can be done by running:

ALTER TABLE <schema_name.table_name> REPLICA IDENTITY FULL;

Alternatively, if the "TOASTed" column is part of an index (or can be used in an index), the data can be replicated successfully by setting the "replica identity" to INDEX. Commands for this:

ALTER TABLE <schema_name.table_name> ALTER COLUMN <toasted_column> SET NOT NULL;
CREATE UNIQUE INDEX <toast_col_idx_name> ON <schema_name.table_name>(<toasted_column>);
ALTER TABLE <schema_name.table_name> REPLICA IDENTITY USING INDEX <toast_col_idx_name>;

If this is not performed, we will have NULL or placeholder values show up in the column instead. As a warning, setting "replica identity" to FULL might increase the disk space consumed, since all the columns are being written to WAL all the time.

note

If you are using publication for CDC, please note that all the JSON column values that are huge integers in the PostgreSQL DB will be made NULL. It is not a problem in wal2json decoder.

Next steps