Datacoral provides integrations to extract data from different data sources like databases, file systems, event streams, and APIs, and load them into data warehouses like Snowflake, Amazon Redshift, and Amazon Athena. But once the raw data is available in the destinations, the typical thing to do is transform the raw data into tables that are easy to analyze.
These transformations could be doing several things including
- filtering out malformed data and removing duplicates
- fixing up null values
- standardizing column types and formats
- combining data from different sources into denormalized tables
- summarization and aggregation
In fact, in the ELT (Extract-Load-Transform) model, even actual analytics and model building operations can be treated as transformations performed on data once it has landed in the data warehouse.
In most cases, the best way to build these transformations is in multiple "steps" in a modular way so that it is easier
to understand the logic of transformations as well as debug issues. So, one transformation may depend on the
output of more than one transformation. One can then imagine that a directed graph of dependencies between transformations
(transformation graph) where each
node is a transformation and a directed edge points to a dependency, i.e., if
transform3 --> transform1
transform3 -> transform1,
transform3 depends on the outputs of
These transformations are typically representable fairly easily in SQL in the underlying data warehouse (we will also talk about how to handle transformations that are not representable in SQL via Python UDFs).
The most simplistic way to implement these transformations is via SQL Logical Views. One can imagine creating
views, one on top of another by using view names in the
FROM clause of other views.
So, view is a single node in the transformation graph and the edges point to views (or tables) that are in
FROM clause of the view. And in such a graph, the "leaf nodes" or nodes with no dependencies are
typically tables ingested from different sources.
However, in a live pipeline, the hardest problem is to figure out when to run these transformations in order so that the input data is consistent and fresh before the transformations are run.
SQL has made some really interesting choices around abstraction that we find really compelling.
- SQL is declarative. So, it needs the analyst to only specify what computation needs to be performed, not how.
- SQL needs the data dependencies to be explicit, think
FROMclause. It needs the analyst to be very clear on what input data is needed to perform the operation.
The combination of these two choices has allowed multiple generations of systems to be built that are getting smarter and smarter at optimization, and can compute the answers more quickly and efficiently.
Data pipelines written with workflow managers have made the exact opposite choice of abstraction:
- Workflow managers need the analyst to be imperative, as in, very prescriptive, about the the steps that need to be taken to transform data.
- Workflow managers allow each step to be written as scripts, which then means that there is no way a system can automatically figure out what the data dependencies are.
The above two choices means that it is impossible for a system to be built that can optimize the data pipeline. A system cannot reliably optimize away steps coded up in a data pipeline, neither can it automatically detect data dependencies. So, it is up to the pipeline author to plumb the dependencies between the different steps based on their knowledge about the data dependencies between the different steps - this is typically a technically taxing ask. In addition, is is next to impossible to build any static checks on data pipelines. So, the only option to check whether a data pipeline is built properly is to actually run it and see the results.
At Datacoral, we have tackled the problem of building data pipelines head on by dramatically simplifying how they are built, checked, tested, deployed, and monitored. Datacoral has chosen SQL's abstraction model and empowered its users to build data pipelines with complex transformations that can be automatically orchestrated and optimized.
Users build transformations in SQL using the dialect of the underlying query engine. These transformations can be named (just like
views) and can be used in the
FROM clause of other SQL transformations. Datacoral compiles all transformations and automatically
infers all the dependencies.
In addition, given that all transformations ultimately need data either ingested from a Datacoral connector or some other table already in the warehouse, Datacoral automatically orchestrates these transformations as well.
Orchestrated Transformations Examples
For example, consider the case of a B2B company trying to compute the health of their customers by looking at signals from different sources like
- their PostgreSQL production database for billing information
- Zendesk for the complaints and support tickets they have filed
- Usage events from the company website and application
Datacoral provides connectors for the company to integrate data from the following into a data warehouse like Amazon Redshift.
- Event streams
Datacoral's UI can be used to create the SQL transformation to compute the customer health.
The SQL transformation can include the interval of data that is needed to compute the answer and a few other powerful features that give an analyst full control of how the output of the transformation is written to the warehouse. This transformation may actually be dependent on other transformations until ultimately the transformations depend on the source raw data being ingested into the warehouse by Datacoral connectors. Datacoral automatically computes the transformation dependency graph. These graphs can some times have 100s of nodes and 10s of levels.
In addition, depending on how big of a time window of data is needed for the computation, Datacoral automatally figures out how many data intervals to wait for before triggering the each of the transformations.
Proactive change verification
Given that all Datacoral transformations are written declaratively as SQL, Datacoral offers a compile step to catch problems quickly. Analysts can be assured that the changes they are making to the transformations will not break any downstream transformations.
Datacoral provides an analyst friendly way to build out pipelines using SQL. In addition when there might be some transformations that are not representable in SQL, analysts can write them in Python as User Defined Functions. These UDFs can be incorporated into SQL transformations for seamless integration of Python-based transformations into the the transformation graph.