Concepts

Introduction

Datacoral is a cloud-based data pipeline platform. It provides an infrastructure for ingesting and integrating data from a variety of data sources as the data gets generated by different operational systems. Various transformations can be defined that will combine or aggregate the data from the different sources and publish it to different target systems that are dedicated to Analytics, Machine Learning, or Data Warehousing.

A data pipeline typically has three major stages, ingestion, transformation, and publishing.


Ingestion

During ingestion, data is extracted from the data sources in order to populate the pipeline. These sources could be anything from databases like MySQL or Aurora to SaaS services like Salesforce or Netsuite. The pipeline must be able to understand the mechanisms for extracting the data from the sources including how to reconcile different data formats. This capability requires connectors for the individual data sources but also mechanisms for scheduling extractions and adequately identify and track the data that has changed, change data capture (CDC). It includes loading the extracted data into a staging area on S3.


Transformation

Transformations act on the ingested data and performs operations like filtering, joins and aggregation. Datacoral uses database engines to allow transformations to be defined declaratively in SQL. Currently supported engines are AWS Athena, AWS Redshift, and Snowflake. Defining transformations in SQL and using established database engines to perform them have several advantages:

  1. SQL is a powerful declarative language that is relatively easy to learn and use even for people who are not hardcore software engineers. It has a very large skill pool and there are numerous tools that can generate SQL for a variety of purposes, from simple query tools to data visualization products.

  2. SQL statements can be used as data sources for other SQL statements allowing for the composition of multiple levels of complex transformations. Moreover, the output of a commonly used transformation can easily be materialized into a table so that the transformation doesn't have to be recomputed for each consumer of its data.

  3. Database vendors have invested large amounts of R&D in their products in areas like functionality, performance, and scalability. By utilizing established database technology, Datacoral can leverage the effort that has gone into creating it.


Publishing

After the data has gone through the relevant transformations, the publishing phase makes the data available to target systems that are used to derive value from the data through Analytics, Machine Learning, etc. Eventual target system could be databases optimized for analytic workloads, Jupyter Notebooks, or SaaS products.


Important data pipeline functionality

There are a number of different types of functionality that are crucial to data pipelines.

Flexible scheduling functionality

New data or modified data from the source system must be propagated through the pipeline either in batches or through streams. A flexible scheduling functionality is needed to ensure that the propagation happens at desired intervals, whether they be daily, hourly, or every 5 minutes. The data pipeline must also be able to handle the possibility that some sources may suffer occasional delays in providing the data that is supposed to go into a batch for a specific time interval. If the delayed data is supposed to be combined with data that that has already arrived in a timely manner, there is a need to define policies for how to deal with such batches. Datacoral provides such functionality in the context of timelabels.

In addition to operations that are scheduled to occur at certain points in time, there is a need to trigger operation based on events. For instance, if a transformation depends on input data from some other transformations, the completion of the computation of the input transformation may trigger the start of the execution of the transformation that consumes their data.

Timelabels

Timelabels are tags (format: YYYYMMDDHHmm00) associated with operations that would result in data changes, like loading data, refreshing materialized views, or changing Glue catalog data. Associated functionality includes the ability to define a "close of books" for when to consider a batch to be complete.

Dependency graphs

As data is supposed to be propagated through the pipeline, possibly from multiple sources, through multiple transformations, and to multiple targets, it's important to understand where the data is supposed to go and to know what all the inputs to an operation are so that it can be checked that they have been completed. Additionally, for the result of a complex transformation, it is often desirable to understand what original source data was part in the calculation, the notion of "data lineage." For that purpose, a data pipeline needs dependency graphs that represent how different operations depend on getting input data from other operations. Datacoral has that functionality and is able to recognize and track dependencies even through transformations that are defined by complex SQL statements.

Monitoring and alerting

Monitoring the progress of the dataflow through a data pipeline is a crucial capability in order to understand what data is available at various stages to ensure that SLAs can be met. Datacoral's webapp provides monitoring of the pipelines

Serverless

While using a serverless architecture is not an absolute requirement for a data pipeline, it is a highly desirable feature in cloud computing for tasks that are transient in nature. Maintaining a server has administrative costs and whenever a server isn't fully utilized, there is computing capacity that is being wasted. AWS introduced Lambda as a platform for event-driven, serverless computing, which is a good fit for the ingestion phase of a pipeline. Datacoral takes advantage of this functionality in order to minimize the cost and administrative tasks associated with maintaining a pipeline.


Materialized views

As data flows through the pipeline, it is often materialized into tables, either as intermediate results or at the target of the pipeline. SQL has long had the concept of a view, which is essentially a query and as such can be used to express a transformation. If the view is materialized, the result of the query is stored in a table; otherwise, the query will be reevaluated every time the view is accessed. Materialization has the benefit of avoiding the recomputation, which can be substantial if the view is accessed by a large number of queries. Datacoral uses materialized view as a mechanism store the result of a transformation in a table where it can be used as input to other transformations.

On the other hand, if the data in the underlying tables change, a materialized view must be updated to reflect those changes, i.e., the materialized view needs to be refreshed. There are two major types of refresh, full and incremental.

  • A full refresh recomputes the materialized view from scratch based on the current data in the underlying tables.

  • In an incremental refresh, the materilized view is only partially modified based on changes to the underlying data. Whether an incremental refresh is possible may depend on the materialized view definition, but if possible, may be far more efficient than a full one.


Modes for modifying data

As data flows through the pipeline it may frequently be materialized in tables, either as intermediate results or in the final targets. As new data comes in, existing data can be affected in different ways depending on the mode that is used to update the table.

  • REPLACE mode replaces all the rows of a table with a new set of rows.

  • APPEND mode adds additional rows to a table. Once a row has been inserted, it is never updated or deleted.

  • MERGE mode updates, inserts and deletes rows in the table that are captured via change data capture from the source systems.


Different types of tables

Related to the different modes for modyfing data, Datacoral supports two types of tables with different characteristics, regular tables and partitioned tables.

  • Regular Tables are what most users are familiar with in a warehouse or a query engine. Regular tables are good for small and medium-sized tables, supporting efficient REPLACE and MERGE semantics in addition to APPEND semantics and when they don’t need periodic pruning of old rows via retention rules.

  • Partitioned Tables are more interesting and unique than regular tables. They are good for supporting large tables, tables supporting the APPEND semantics and those that require efficient, periodic pruning of old rows via retention rules. A single partitioned table can contain multiple partitions, each of which corresponds to a particular column value in each row.