Batch Compute Overview


SQL is awesome for transformations in pipelines. It is a declarative language that is known to many practitioners. It abstracts away the complexity of the actual processing while allowing the users to specify data dependencies explicitly (the FROM clause). With this combination, it is possible for a system (the database) to automatically optimize the processing of data to produce answers. In addition, SQL is also extensible with the support for user defined functions (UDFs).

However, there are several limitations to SQL. It is not fully expressive - it cannot easily represent transformations that require iteration and other complex processing. SQL is also not modular. Queries can become complex pretty quickly and sometimes become unreadable. These limitations are resulting in the growing popularity of Python and R in data science.

Some databases offer the ability to build complex transformations via UDFs. But, these UDFs can only be built in blessed languages - typically Python and Javascript. But, even within these languages there are severe limitations on the kinds and number of libraries that are allowed. The reason for that is the UDFs themselves are run within the database runtime. And for similar reasons, the UDFs cannot use specific kinds of compute resources (such as high memory containers or GPUs) if not provided by the databases.

Datacoral offers a solution via its Batch Compute UDFs.

Datacoral Batch Compute UDFs

Datacoral provides a way for UDFs to be run outside the database runtime in containers (using AWS Batch). These UDFs can be implemented as functions in any language (although we are starting with Python) using arbitrary libraries. Datacoral provides a simple way to package these UDFs into containers and register them to become available for transformations. Here, users can specify the compute environment that they would like their UDF to have (number of vcpus, memory etc). Users can provide SQL as the input to the transformations done inside the container. When the transformation is run, Datacoral's runtime goes through the following steps:

  1. pushdown the input query to the input warehouse
  2. transform the query results using the UDF by spinning up a container
  3. write the results of the UDF into the output warehouse

Users can mix and match the input and output warehouses. Datacoral currently supports AWS Athena and Redshift as either the input or output warehouses.

Check out the Batch Compute Quick Start Guide to try it out.