Schema Change Handling

Datacoral provides no-code connectors to fetch data from databases such as MySQL and PostgreSQL and load it into data warehouses such as Redshift and Snowflake. The data can be fetched either by making database queries or using Change Data Capture (CDC).

These databases back critical applications, store important data and help our customers run their business more effectively. This means that having reliable connectors to accurately replicate the data into data warehouses for analytics purposes is a necessity for our customers. Aside from having a high volume of transactions in the source databases, a common pattern that arises is that of schema for different tables being updated as use cases change. For example, a company might decide to separate the data for different customers into separate tables for security purposes and now there are more tables in the source database. Similarly, a company may decide that they don’t need to collect a particular field about their users anymore and delete a column in their users table. It is important that the schema changes get replicated alongside the changes in the data itself.

Schema changes can take many forms, and Datacoral handles each of them appropriately. At a high-level:

  1. Datacoral has separate metadata sensors for detecting when table-level or column-level schemas change. Once detected at the source, such a change gets applied at the destination warehouse and the data syncs respect the new schema at the destination.
  2. Datacoral never performs a destructive action in the destination data warehouse, even if the same action has been performed on the source database. This is to ensure that we never delete data that might still be useful to a user.

Different schema changes are described in the sections below.

Table-level Changes

Adding a new table at source

If a new table gets added at source, then Datacoral will automatically detect it and create its replica in the new warehouse and start syncing data to it.

Deleting a table at source:

If a table gets deleted at source, then Datacoral will stop syncing the table to the warehouse, but the table itself will not get deleted in the data warehouse.

Renaming a table at source:

Our sensors do not capture rename commands run at the source. Instead, table renames are detected as one table getting deleted and another table getting added. Historical syncs will be triggered for the table with the new name.

Column-level Changes

Adding a new column to a table at source:

If a new column gets added to a table in the source database, then Datacoral will automatically detect it and add it to the table at the destination as well. Data will start being replicated into the new column from that point on. Please note that the ordinal position of the column could be different in the destination compared to the source.

Deleting a column in a table at source:

If a column gets deleted from a source table, Datacoral will stop replicating data into this new table, but won’t delete the column from the destination table. They will simply appear as NULLs from that point on.

Modifying the data type of a column in a source table:

Typically source databases are a lot more permissive than destination data warehouses wrt the restrictions on the data type changes allowed.

Redshift specifies restrictions for an alter table statement such as ALTER COLUMN column_name TYPE new_data_type;. When changing a VARCHAR column, there are the following limitations:

  • You can't alter a column with compression encodings BYTEDICT, RUNLENGTH, TEXT255, or TEXT32K.
  • You can't decrease the size less than the maximum size of existing data.
  • You can't alter columns with default values.
  • You can't alter columns with UNIQUE, PRIMARY KEY, or FOREIGN KEY.

Snowflake specifies the following restrictions:

  • You can't change a column data type to a different type (e.g. STRING to NUMBER).
  • You can't decrease the length of a text/string column (e.g. VARCHAR(50) to VARCHAR(25)).
  • You can't change the scale of a number column (e.g. NUMBER(10,2) to NUMBER(10,4)).
  • You can't add a default or change the default for a column, unless the default is a sequence.

Athena also has restrictions on the ALTER TABLE statements allowed, which might depend on the data storage format.

In the cases where the underlying data warehouse supports the type change, Datacoral will detect the change and apply it to the destination warehouse. Data will then successfully sync from the source table to the destination table.

In case the underlying data warehouse does not support the type change, after detecting the change, Datacoral will not apply it at the destination and instead, notify our operations team and the customer for further triage.

Renaming a column in a source table:

Our sensors do not capture actual rename commands run at the source. Column renames are detected as one column getting deleted and another column getting added. Historical syncs of new columns with the values from the older column will be performed.