The AWS Glue Data Catalog is a managed Apache Hive Metastore service provided by AWS that lets you store, annotate, and share metadata.
Each AWS account has one AWS Glue Data Catalog. It provides a uniform repository where disparate systems can store and find metadata to keep track of data in data silos, and use that metadata to query and transform the data.
Having metadata in AWS Glue Data Catalog allows you to query data in S3 using SQL via Amazon Athena, Amazon EMR (Hive and Spark SQL), and any Apache Hive Metastore-compatible application.
When an installation has the Managed Glue Organize Slice, Datacoral's Collect
Slices automatically create database and tables in Glue Data Catalog
(similar to schema and tables in Redshift) for each of their loadunits.
These tables in Glue will be partitioned on time (
timelabel), and new
partitions will be automatically created as soon as data for a timelabel
is completely synced to S3.
Partitions for Events Collect Slice
Let's take the example of an events collect slice (name: events, env: prod) in
an installation named
myinstallation. As the data is ingested on 2018-03-23,
files will be created under S3 locations as below...
s3://myinstallation.datacoral/events_prod/events/2018/03/23/00/ s3://myinstallation.datacoral/events_prod/events/2018/03/23/01/ s3://myinstallation.datacoral/events_prod/events/2018/03/23/02/ s3://myinstallation.datacoral/events_prod/events/2018/03/23/03/ s3://myinstallation.datacoral/events_prod/events/2018/03/23/04/
This data will be loaded into the Redshift table
In addition, after all files for the first hour have arrived in S3, a new
y=2018,m=03,d=23,h=00, will be created in GDC events table in
datacoral_myinstallation_events_prod database. At the end of second hour, the
y=2018,m=03,d=23,h=01, will be created and so on.
Partitions for Datasource Collect Slice
In case of datasource collect slices, there will be an additional partition key
corresponding to minutes (
n=*). So the S3 locations for a loadunit named
orders with hourly schedule in a slice named
mydb will look like below:
s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=00/n=00/ s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=01/n=00/ s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=02/n=00/ s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=03/n=00/ s3://myinstallation.datacoral/mydb/orders/y=2018/m=03/d=23/h=04/n=00/
In GDC, metadata will be created like below:
- database -
- table -
- partitions -
y=2018,m=03,d=23,h=02,n=00, and so on.
Querying Collect Slice data using Amazon Athena
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to setup or manage, and you can start analyzing data immediately. Amazon Athena can be accessed via the AWS Management Console, an API, or a JDBC driver.
Below is a sample command to query above data using AWS CLI:
aws athena start-query-execution --region us-west-2 \ --query-string "select count(*) from datacoral_myinstallation_events_prod.events where h='03'" \ --result-configuration OutputLocation=s3://myinstallation.datacoral/athena/
This command returns a
query-execution-id that can be used to to check the
status of the query as well as retrieve the results of the query:
aws athena get-query-results --query-execution-id <output of above command> --region us-west-2
Here are few links on other ways to query using GDC: