MySQL Change Data Capture

Overview

Change Data Capture (CDC) is a technique that captures database inserts, updates and deletes (along with DDL changes) and replays it in the target data warehouse.

Datacoral's MySQL Change Data Capture (CDC) Slice reads the Row-Based-Replication log of MySQL, allows you to track data changes within MySQL and store them in a data warehouse. CDC can be implemented for various tasks such as auditing, copying data to another system or processing events.

The primary goal of MySQL CDC is to ensure reliable capture of data changes in MySQL and store them in a data warehouse. We utilize MySQL binary logs, which is the most efficient way to track data changes .

MySQL data changes can be captured in near-real time. Once the data is captured, it is then passed over to another process which ensures reliable storage in a data warehouse and availability for further data analysis. MySQL CDC also allows you to recover from failures such that we can pick up from when the failure happened. Our paging mechanism allows quick recovery for each failed page which results in not having to re-read the whole binary log when a failure is encountered. Datacoral TimeLabel Clock ensures that we can reprocess the exact set of changes which were lost in order to ensure accuracy of data in the warehouse.

Since a MySQL binary log contains change records from hundreds of tables, we use a complex mix of streams and fan-out processing which allows us to upload multiple tables into the warehouse in parallel. This gives you the ability to capture and store multiple schemas with hundreds of tables and millions of records immediately after you install MySQL CDC.

Supported Versions

The following flavors of MySQL are supported

Flavor of MySQLSupported versions
MySQL5.5, 5.6, 5.7, 8.0
Amazon Aurora MySQL5.6, 5.7
MySQL on Amazon RDS5.5, 5.6, 5.7, 8.0

Capabilities

Please refer to the Database Collect Slices Overview to understand the overall capabilities of the MySQL Collect slice.

Steps to add this slice to your installation

The steps to launch your slice are:

  1. Enable binary logging on each MySQL slave
  2. Setup your database credentials
  3. Make sure your database is addressable from your Datacoral VPC
  4. Specify the slice config
  5. Add the MySQL collect slice

1. Enable binary logging on each MySQL slave

You will need to enable the following on your MySQL slaves [We recommend that you setup a read replica and then perform the below actions on the read replica itself]: Note: You will need privileges sufficient to set Global system variables.

  • Ensure you have read replica setup
  • In the MySQL read replicas and master where CDC needs to be enabled, set binlog format to ROW
mysql> set GLOBAL binlog_format = 'ROW';
  • Set the binlog row image setting to FULL, this ensures all columns in the before and after image are logged
mysql> set GLOBAL binlog_row_image = 'FULL';

Note: If you're using Amazon RDS and the default RDS DB parameter group, then create a new custom DB parameter group. In the custom parameter group change the value for the binlog_format parameter to ROW and binlog_row_image parameter to FULL on both the master and replica.

  • Setup retention period to 48 hours. This ensures that binlogs will be retained for 48 hours in the replica, which is then helpful to recover in case of any downtime of the replica. Ensure you have enough disk space for 48 hours worth of bin logs
mysql> call mysql.rds_set_configuration('binlog retention hours', 48);
  • Grant permissions to the mysql user that will be used in the slice input params the permissions to be able to read bin logs
mysql> grant select, replication client, replication slave on *.* TO 'username'@'%';

Note: You may have to restart the slave after making the above changes

2. Identify your database credentials

  • Use an existing database user OR create a new user for Datacoral. If creating a new user, please run the following statement on your MySQL database, replacing datacoral-password with a password of your choice.

Create user

CREATE USER datacoral@'%' IDENTIFIED BY 'datacoral-password';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO datacoral@'%';

Encrypt password (recommended)

All customer data is encrypted at rest as well as in motion using customer managed KMS keys. Datacoral’s cross-account role does not have decrypt permissions on the KMS keys. However, we do recommend using pre-encrypted credentials, as it allows for you to source control configuration files to automate installations for Data Ops and share configurations without compromising security. Please refer to the instructions on encrypting your password before adding them to encrypt your passwords.

3. Make sure your database is addressable from your Datacoral VPC

Database accessible via the internet

If your database is hosted on AWS and is not publicly accessible, need to authorize ingress for Datacoral's elastic IP to the db security group. You can do this in the AWS RDS console or via AWS CLI:

aws ec2 authorize-security-group-ingress \
--group-id <security group associated with rds instance> \
--protocol tcp \
--port <db port> \
--cidr <elastic ip>

Database accessible via VPC Peering

If you don't want any traffic from your database to go over the internet, you can setup VPC Peering between the Datacoral Installation's VPC and your database's VPC. See VPC peering for how to set it up.

If your MySQL Database is in AWS, you would have to modify the subnet group and security groups so that network connections are allowed to the Datacoral VPC.

  • For the RDS subnet group, the Network ACL should allow inbound and outbound connection to the Datacoral VPC.
  • For the RDS security group, we would need an inbound and outbound rule to/from the Datacoral VPC.

Other

If your database is hosted outside of AWS, please follow the appropriate networking setup to allow for ingress traffic for Datacoral to access your database. Please note that you will have to do this for both master as well as replica db.

4. Specify the slice config

To get a template for the MySQL CDC slice configuration save the output of the describe --input-parameters command as follows:

datacoral collect describe --slice-type mysql \
--input-parameters > mysql_parameters_file.json

Necessary input parameters:

  • database - name of your MySQL database

  • host - database host (or IP address)

  • port - database port

  • username - database user name

  • secretKey - database password

  • enableCDC - Set this to true to enable CDC

    If you need to connect to the mysql server through ssh tunnel, please specify the input parameters below

  • sshTunnelHost - SSH server host name

  • sshUser - SSH server user name

  • privateKey - SSH server private key (We support only RSA keys)

  • passphrase - SSH server passphrase

    In order to generate SSH privateKey and passphrase, please run the following command in a terminal

    ssh-keygen -t rsa -m PEM -b 4096

    When prompted for passphrase, please do select a passphrase.

Optional input parameters:

  • tableWhiteList - a comma separated list of tables you want to pull. You can also set it to ["public.%"] for all tables in the public schema or ["%"] for all tables from all the schemas. The default is set to ["public.%"].

  • tableBlackList - list of tables that you do not want to pull.

  • columnList - List of columns to be replicated from the MySQL table. Helpful to Column-level SQL based transformation at the source for Snapshot & Incremental modes. Ex. conversion of timestamp columns to standard format

  • schedule - in cron format. Note that for CDC, any setting of schedule will be ignored. Schedule is pre-set to run hourly. Even though Datacoral CDC streams changes from binlogs, it only replays the changes on the warehouse on an hourly schedule

Example template:

{
"database": "my_mysql_database",
"host": "rds_...aws.com",
"port": 5432,
"username": "martha",
"secretKey": "marthas_password_123",
"enableCDC": "true",
"tableWhiteList": ["table1", "table2"],
"tableBlackList": ["table3", "table4"]
}

5. Add the Slice

datacoral collect add --slice-type mysql --slice-name <slice-name> --parameters-file <params-file>
  • slice-name Name of your slice. A schema with your slice-name is automatically created in your warehouse
  • params-file File path to your input parameters file. Ex. mysql_parameters_file.json

Datacoral generated column names

MySQL CDC adds a couple of extra columns to the tables in your warehouse. This is so that you can effectively track the changes to your data. The two columns are:

  • __dc_cdc_modified_at : When we row is read for the first time from source or is modified in source, then we update this column with the UTC timestamp of when that row was synced
  • __dc_cdc_deleted_at : When a row gets deleted at source and we update this column with the UTC timestamp

The above 2 columns are of type BIGINT in the warehouse and are populated with the UTC timestamp with nanoseconds precision at the time of sync.

Tables with primary keys

Any changes to source tables with primary keys will be merged into the warehouse as per the following rules:

  • A new row in the source table will result in a new row in the warehouse. The __dc_cdc_modified_at column will be set as well
  • An update to an existing row in the source table will result in the update to that row in the warehouse. The __dc_cdc_modified_at column will be set as well
  • Deletion of an existing row in the source table will result in deletion of that row in the warehouse. Note that since we first store your source changes in S3 before loading it into your warehouse, the row will not be deleted in S3. Instead in S3, the row will be marked for deletion i.e. we will store the row in S3 with the __dc_cdc_deleted_at column set with the current UTC timestamp.

You can also change the above behavior and convert to an append mode i.e. tables with primary keys will behave similar to tables without primary keys (see below). This can be achieved by setting MergeTargetTable: false in the loader config and updating the slice

Tables without primary keys

Any changes to source tables without primary keys will be appended into the warehouse as per the following rules:

  • A new row in the source table will result in a new row in the warehouse. The __dc_cdc_modified_at column will be set as well
  • An update to an existing row in the source table will result in a new row in the warehouse. The __dc_cdc_modified_at column will be set as well
  • Deletion of an existing row in the source table will result in the __dc_cdc_deleted_at column set with the current UTC timestamp.

Datacoral Slot Table

As part of the normal running of the MySQL CDC connector, Datacoral creates a table in Athena and the warehouse (such as Redshift) in which you are replicating data. The purpose of this table is to contain all the records (inserts, updates or deletes) that the connector has read from the MySQL binlogs. This table is useful for maintaining an audit log, or analyzing deletes that usually get replicated in the warehouses. In Athena, this is a partitioned table, where it is partitioned based on the timelabel for which data was read from the binlogs.

By default, this table is not synced to the warehouse. This can be changed if you'd also like to have the data from the slot table in your warehouse.

Feel free to create an Athena Materialized View to analyze the data in this table!

Handling deletes

Regular deletes are synced to the warehouse since those deletes are recorded in the binlogs. However, any of the below commands will not be recorded in the binlogs, hence won't be deleted. We recommend use of the regular MySQL Collect slice for databases and tables where these behaviors occur normally.

  • TRUNCATE
  • CASCADING DELETES
  • DROP and recreate table using CREATE
  • LOAD
  • RENAME

One-time historical data sync

When setting up data a MySQL CDC connector, Datacoral will need to perform a one-time historical sync of the data. The historical sync is performed using a regular MySQL connector where a full sync of all tables from source is performed (through a query against the source database). The order of operations are:

  1. Start fetching change logs from MySQL binlogs.
  2. Perform the historical sync of data for all tables from source to warehouse.
  3. When historical sync is complete, start applying the changes fetched from binlogs to tables in the destination warehouse.

This order of operations ensures that the data in the warehouse is always correct and complete, while ensuring that you get the benefits of Change Data Capture into the future.

Questions? Interested?

If you have questions or feedback, feel free to reach out at hello@datacoral.co or Request a demo