Timelabels

What is timelabel?

Datacoral does micro-batch processing and each micro-batch has a label that is associated approximately with the data time within that batch. There are several caveats to this statement and will be discussed separately.

Timelabel is a tag (format: YYYYMMDDHHmm00) associated with operations that would result in data changes. These changes could be one of the following:

  • collect data sync to s3
  • collect data load to warehouse
  • collect data partition creation in glue
  • rotation of a timeseries table
  • materialized view refresh
  • harness (publisher) connector data sync to external systems

Timelabel does NOT reflect the wall clock time of the data change operation - instead, it represents the state of underlying data at a given time (barring the caveats). Specifically, timelabel is represented as the top of the schedule of repeated operations.

Following scenarios would provide more clarity…

  • A collect datasource loadunit with a schedule of 10 * * * * (10 mins past every hour) which get triggered at 2018-03-05 09:10 +00:00, would be represented by timelabel 20180305080000 instead of 20180305091000. So, data with timelabel 20180305080000 has data that corresponds to the hour from 2018-03-05 08:00 +00:00 UTC through 2018-03-05 08:59 +59:99 UTC. And the process starts AFTER the hour ends, i.e., 2018-03-05 09:00 +00:00 UTC.

  • A collect datasource loadunit with schedule of 10 0 * * * when backfilled from beginning of year will have timelabel corresponding to each day with values 20180101000000 for data from 2018-01-01 00:00 +00.00 UTC through 2018-01-01 23:59 +59.99 UTC. Process will be kicked off at 2018-01-02 00:10 +00.00 UTC with timelabel 20180102000000 for data from 2018-01-02 00:00 +00.00 UTC through 2018-01-02 23:59 +59.99 UTC. Process will be kicked off at 2018-01-03 00:10 +00.00 UTC and so on.

Table types and timelabels

warehouse table types and timelabels

Types of tables supported in warehouse

  • Regular - A normal table in warehouse
  • Timeseries - A collection table in warehouse created over a set of partition tables, one partition per timelabel. The collection is represented as a UNION ALL view over the set of partition tables.
    • Partition table is named _<timeseries-table>_<timelabel>
    • UNION ALL view is named <timeseries-table>_view

Timeseries tables have different conventions depending on which part of the system is creating them.

  • Events timeseries tables
    • schemaname.tablename - table with latest data that firehose writes to, which will then get archived into a partition table
    • schemaname._tablename_<timelabel> - the partition table of historical data
    • schemaname.tablename_view - the union all view.
  • Timeseries materialized views
    • schemaname.mv_<viewname> - always empty
    • schemaname.mv_<viewname>_view - UNION ALL view
    • schemaname._mv_<viewname>_<timelabel> - partition table
  • (Future) Loader timeseries tables
    • schemaname.loadunitname - always empty
    • schemaname.<loadunitname>_view - the UNION ALL view
    • schemaname._<loadunitname>_<timelabel> - partition table
Operations on different table types
  • Timeseries tables
    • ADD PARTITION - add partitions to a timeseries table (this is the same as append)
    • DROP PARTITION - drop partitions in a timeseries table
  • Regular tables support multiple operations
    • REPLACE - this is the most straight forward. You replace all the contents of the table with new content. (This is the same as snapshot)
    • INSERT - insert new rows into the same table (this is the same as incremental append)
    • UPSERT - insert new rows or update existing rows (this is the same as incremental upsert)

Glue Data Catalog table types and timelabels

All Glue Data Catalog (GDC) tables are partitionedi, i.e., they behave just like warehouse timeseries tables. Each table has the following timebased partition columns

  • y - year
  • m - month
  • d - date
  • h - hour
  • n - minute

Timelabels map to partition names in a natural way. For example, a timelabel 20180101000000 corresponds to the partition

(y = '2018', m = '01', d = '01', h = '00', n = '00')

which in turn maps to the S3 path like below:

s3://<customer-s3-bucket>/<schemaname>/<tablename>/y=YYYY/m=MM/d=DD/h=HH/n=NN/

where

  • <customer-s3-bucket> is the data bucket where all customer data gets written
  • <schemaname> is either the warehouse schema or the GDC database name
  • <tablename> is either the warehouse table name or the GDC table name

Ingest and publish connectors timelabels

As indicated in the examples above, connectors mostly try to match the data in a timelabel batch to have timestamps that correspond to the time window of the timelabel. There are several caveats that cause this to be a best effort rather than a hard constraint. Making it a hard constraint is possible via materialized views (See Close of Books) but will result in a lot more expensive queries on the database and reduced data freshness (i.e., data batches will be made available with a significant lag after waiting for all data in that batch to show up) and potentially lost data.

  1. Events connectors
    • Events may come in out-of-order
    • Batches of events are tagged with timelabels based on arrival time of events rather than on the timestamps in the events. More specifically
      • GDC table partition value is derived from the timelabel which in turn is derived from the S3 paths like yyyy/mm/dd/hh that AWS Firehose uses to stage data in S3.
      • warehouse tables that Firehose writes to are rotated periodically to include timelabel in the name. So, Datacoral renames the events table (schemaname.tablename) to a partition table (schemaname._tablename_<timelabel>) and creates a new empty events table all in a single transaction. Firehose continues to write to the events table.
    • GDC partitions dont correspond to timestamps in the events
    • warehouse partition tables are not exactly the same as the GDC table partitions
  2. API connectors
    • Most APIs allow for data to be extracted based on timestamps. In those cases, GDC partitions as well as warehouse partition tables (for timeseries tables) are equivalent.
    • In cases where warehouse tables are regular tables, timelabels are only tags on the batch processing jobs rather than tags on specific batches of data in warehouse.
    • GDC partitions are created as append-only.
  3. Database connectors
    • Database connectors can be configured to do one of REPLACE, APPEND, and MERGE operations on the final tables in the warehouse.
    • If the final tables in warehouse are regular tables, timelabels are just tags for processing steps
    • If final tables are made Timeseries Tables, then, timelabels correspond to the partition tables within the final tables
  4. Change Data Capture connectors
    • CDC connectors are configured to be MERGE operations on to the final tables in the warehouse.