SQL is awesome for transformations in pipelines. It is a declarative language that is known to many practitioners. It abstracts away the complexity of the actual processing while allowing the users to specify data dependencies explicitly (the
FROM clause). With this combination, it is possible for a system (the database) to automatically optimize the processing of data to produce answers. In addition, SQL is also extensible with the support for user defined functions (UDFs).
However, there are several limitations to SQL. It is not fully expressive - it cannot easily represent transformations that require iteration and other complex processing. SQL is also not modular. Queries can become complex pretty quickly and sometimes become unreadable. These limitations are resulting in the growing popularity of Python and R in data science.
Datacoral offers a solution via its Batch Compute UDFs.
Datacoral Batch Compute UDFs
Datacoral provides a way for UDFs to be run outside the database runtime in containers (using AWS Batch). These UDFs can be implemented as functions in any language (although we are starting with Python) using arbitrary libraries. Datacoral provides a simple way to package these UDFs into containers and register them to become available for transformations. Here, users can specify the compute environment that they would like their UDF to have (number of vcpus, memory etc). Users can provide SQL as the input to the transformations done inside the container. When the transformation is run, Datacoral's runtime goes through the following steps:
- pushdown the input query to the input warehouse
- transform the query results using the UDF by spinning up a container
- write the results of the UDF into the output warehouse
Users can mix and match the input and output warehouses. Datacoral currently supports AWS Athena and Redshift as either the input or output warehouses.
Check out the Batch Compute Quick Start Guide to try it out.