Athena Overview

The AWS Glue Data Catalog is a managed Apache Hive Metastore service provided by AWS that lets you store, annotate, and share metadata.

Each AWS account has one AWS Glue Data Catalog. It provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data.

Having metadata in AWS Glue Data Catalog allows you to query data in S3 using SQL via Amazon Athena, Amazon EMR (Hive and Spark SQL), and any Apache Hive Metastore-compatible application.

When an installation has the Managed Glue Organize Slice, Datacoral's Collect Slices automatically create database and tables in Glue Data Catalog (similar to schema and tables in Redshift) for each of their loadunits. These tables in Glue will be partitioned on time (timelabel), and new partitions will be automatically created as soon as data for a timelabel is completely synced to S3.

Partitions for Events Collect Slice

Let's take the example of an events collect slice (name: events, env: prod) in an installation named myinstallation. As the data is ingested on 2018-03-23, files will be created under S3 locations as below...

s3://myinstallation.datacoral/events_prod/events/2018/03/23/00/
s3://myinstallation.datacoral/events_prod/events/2018/03/23/01/
s3://myinstallation.datacoral/events_prod/events/2018/03/23/02/
s3://myinstallation.datacoral/events_prod/events/2018/03/23/03/
s3://myinstallation.datacoral/events_prod/events/2018/03/23/04/

This data will be loaded into the Redshift table events_prod.events

In addition, after all files for the first hour have arrived in S3, a new partition, y=2018,m=03,d=23,h=00, will be created in GDC events table in datacoral_myinstallation_events_prod database. At the end of second hour, the next partition, y=2018,m=03,d=23,h=01, will be created and so on.

Partitions for Datasource Collect Slice

In case of datasource collect slices, there will be an additional partition key corresponding to minutes (n=*). So the S3 locations for a loadunit named orders with hourly schedule in a slice named mydb will look like below:

s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=00/n=00/
s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=01/n=00/
s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=02/n=00/
s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=03/n=00/
s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=04/n=00/

In GDC, metadata will be created like below:

  • database - datacoral_myinstallation_mydb,
  • table - orders and,
  • partitions - y=2018,m=03,d=23,h=00,n=00, y=2018,m=03,d=23,h=01,n=00, y=2018,m=03,d=23,h=02,n=00, and so on.

Querying Collect Slice data using Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing data immediately. Amazon Athena can be accessed via the AWS Management Console, an API, or a JDBC driver.

Below is a sample command to query above data using AWS CLI:

aws athena start-query-execution --region us-west-2 \
--query-string "select count(*) from datacoral_myinstallation_events_prod.events where h='03'" \
--result-configuration OutputLocation=s3://myinstallation.datacoral/athena/

This command returns a query-execution-id that can be used to to check the status of the query as well as retrieve the results of the query:

aws athena get-query-results --query-execution-id <output of above command> --region us-west-2

Here are few links on other ways to query using GDC: