PostgreSQL CDC is Datacoral’s PostgresSQL Change Data Capture (CDC) slice that leverages the Write-Ahead-Logs (WAL) records and derive the changes happening at the source. This is particularly useful to capture changes including Hard Deletes of rows from tables.
The slice fetches inserted and updated rows as well as IDs of hard deleted rows. The slice DELETES all rows with the IDs in the changed row set and copies all the changed rows into the table (Note that while IDs of Deleted rows are present in the changed rows set, they do not get inserted back).
- Identify your database credentials
- Make sure your database is addressable from your Datacoral VPC
- Enable logical decoding at the source
- Provide rds_replication privileges to Datacoral user
- Connect to Master
- Connect to Read-Replica
1. Identify your database credentials
- Use an existing database user OR create a new user for Datacoral. If creating a new user, please run the following statement on your PostgreSQL database, replacing
datacoral-passwordwith a password of your choice.
CREATE USER datacoral PASSWORD 'datacoral-password';
Encrypt password (recommended)
All customer data is encrypted at rest as well as in motion using customer managed KMS keys. Datacoral’s cross-account role does not have decrypt permissions on the KMS keys. However, we do recommend using pre-encrypted credentials, as it allows for you to source control configuration files to automate installations for Data Ops and share configurations without compromising security. Please refer to the instructions on encrypting your password before adding them to encrypt your passwords.
- Then grant the
datacoraluser read-only access to all tables you need to replicate. If the tables you want to replicate are in a separate schema other than
public, please change the commands below appropriately. The last command makes sure that any future tables will be accessible to
datacoral. If you have several schemas and you wish to grant access to all of them, you will need to repeat these three commands for each schema.
GRANT USAGE ON SCHEMA "public" TO datacoral; GRANT SELECT ON ALL TABLES IN SCHEMA "public" TO datacoral; ALTER DEFAULT PRIVILEGES IN SCHEMA "public" GRANT SELECT ON TABLES TO datacoral;
Restricted table permissions (optional)
To grant access only to a subset of tables in a schema for security reasons, follow the below instructions
- Revoke access to the schema and ensure
datacoraluser has access to the schema itself:
GRANT USAGE ON SCHEMA "schema_name" TO datacoral; ALTER DEFAULT PRIVILEGES IN SCHEMA "schema_name" REVOKE SELECT ON TABLES FROM datacoral; REVOKE SELECT ON ALL TABLES IN SCHEMA "schema_name" FROM datacoral;
- Repeat this command for each table you wish to include:
GRANT SELECT ON "schema_name"."table_name" TO datacoral;
Restricted column-level permissions (optional)
To grant access only to some columns in a table, first make sure you have removed any previously given permission to all columns in that table:
REVOKE SELECT ON "schema_name"."table_name" FROM datacoral;
Then give permission to the specific columns, e.g. columnA and columnB:
GRANT SELECT ("column_one", "column_two") ON "schema_name"."table_name" TO datacoral;
2. Make sure your database is addressable from your Datacoral VPC
Database accessible via the internet
If your database is hosted on AWS and is not publicly accessible, need to authorize ingress for Datacoral's elastic IP to the db security group. You can do this in the AWS RDS console or via AWS CLI:
aws ec2 authorize-security-group-ingress \ --group-id <security group associated with rds instance> \ --protocol tcp \ --port <db port> \ --cidr <elastic ip>
Database accessible via VPC Peering
If you don't want any traffic from your database to go over the internet, you can setup VPC Peering between the Datacoral Installation's VPC and your database's VPC. See VPC peering for how to set it up.
If your Postgres Database is in AWS, you would have to modify the subnet group and security groups so that network connections are allowed to the Datacoral VPC.
- For the RDS subnet group, the Network ACL should allow inbound and outbound connection to the Datacoral VPC.
- For the RDS security group, we would need an inbound and outbound rule to/from the Datacoral VPC.
If your database is hosted outside of AWS, please follow the appropriate networking setup to allow for ingress traffic for Datacoral to access your database. Please note that you will have to do this for both master as well as the replica.
3. Enable logical decoding at source
- Edit parameter group to modify parameters (such as
rds.logical_replication) directly in RDS Console (Assuming Postgres is AWS RDS Instance).
- Default parameter group cannot be modified, you will need to create a new parameter group for your Postgres version, and then edit the parameter group (rds.logical_replication)
- Modify Postgres instance to use this new parameter group, and check “apply immediately” box.
- Reboot the instance to have the parameter take effect.
Note : As part of applying rds.logical_replication parameter, RDS internally sets the
4. Provide rds_replication privileges to Datacoral user
The database user (e.g. “datacoral”) who will access the data should have access to rds_replication.
grant rds_replication to datacoral;
Check Datacoral user privilege
- Login to source database using same user credentials that is used by slice e.g,
- Execute the following query and verify that
datacoralis a member of
SELECT r.rolname, r.rolsuper, r.rolinherit, r.rolcreaterole, r.rolcreatedb, r.rolcanlogin, r.rolconnlimit, r.rolvaliduntil, ARRAY ( SELECT b.rolname FROM pg_catalog.pg_auth_members m JOIN pg_catalog.pg_roles b ON (m.roleid = b.oid) WHERE m.member = r.oid) as memberof , r.rolreplication , r.rolbypassrls FROM pg_catalog.pg_roles r WHERE r.rolname = 'datacoral' ORDER BY 1
5. Connect to Master
Below details will be needed to connect to Master Database:
- Encrypted Password
- List of tables to be synced
6. Connect to Replica
Below details will be needed to connect to the Read-Replica:
- Encrypted Password
- List of tables to be synced
Add the Connector
This connector can be added from the UI.
Advanced configuration can be enabled through describing and updating a slice
Once PG CDC slice has been set up, data will start flowing from Write Ahead Logs to the data warehouse-
- Change data (Insert / Update / Delete) will first flow into S3 on a continuous basis
- On the schedule specified by the user, data will be stored in the datacoral_slot_table in Athena
- On completion of storing in slot table, final changes will be applied to corresponding tables in the destination data warehouse
In parallel, we will do One Time Historical sync from the read-replica for all historical data. This action is performed on Replica so that there is no additional load on Master.
Monitoring and Failure Handling
Datacoral monitors for slot size and data freshness-
- When slot size starts to grow and it is consuming too much disk space in the PG database, Datacoral can automatically recover using a long-running process in AWS Batch (instead of using AWS Lambda, which is what our connectors use)
- When slot size becomes too large, we delete slot, perform historical sync from replica, and then start capturing changes in a CDC fashion again.
Customers can also set up alerts around slot sizes so that when there is an increase in slot sizes, you are alerted appropriately too. As this is an action on Master database, a close monitoring and failure recovery will be key.
Other reference queries:
List all replication slots:
SELECT * FROM pg_replication_slots
How to drop a replication slot:
Data stored in Amazon S3 is partitioned by date and time in the following bucket
Data Warehouse: Schema name will be same as the connector name. Tables produced by the connector will have the same name as the source database.
- PostgreSQL Logical Decoding 9.6.10 and above
- Hosted flavors