MongoDB Collect Slice

+-----------------------------------------+
| Redshift |
| +-------------+ |
| +----------------+---> matview1 | |
| | | +-------------+ |
| | Table1 | +-------------+ |
| | +---> matview2 | |
+---------------+ | +----------------+ +-------------+ |
| | | |
| MongoDB +-----> +-------------+ |
| | | +----------------+---> matview3 | |
+---------------+ | | | +-------------+ |
| | Table2 | +-------------+ |
| | +---> matview4 | |
| +----------------+ +-------------+ |
| |
+-----------------------------------------+

Description

This datasource slice collects data from MongoDB database and writes it to S3 and Redshift. One Redshift table is created for each MongoDB collection with objects data in JSON format. On top of this table, materialized views are created to reflect object data to flattened form, one materialized view for each entity. All materialized views will contain data which correspond to last non-deleted (most recent) version of object in MongoDB, so after object update in MongoDB materialized views will contain data for updated object.

Prerequisites

Datasource reads periodic data updates from MongoDB oplog, so it is required to have a replica MongoDB node with oplog.rs collection and this replica connection should be specified in deploy params as a data source. Deploy params should specify source collection name, database name, host, port, auth source and user name and password for connection.

SUPPORTED CONFIGURATIONS

  • MongoDB in replica set mode
  • MongoDB Version 3.1.0 and above
  • Non-sharded MongoDB
  • MongoDB Documents < 64K

LOADUNITS

To map tree-like objects data to flattened form (which is supported by Redshift database) it is required to specify mapping using loadunits. Each MongoDB collection should correspond to one of root loadunits (those which do not have parents). Object properties which are lists should be present as child loadunits. Below is a sample data for collection food and sample loadunits configuration:

sample data:

{
"_id": "5beae02570dca22b2ce844f9",
"po_id": 1,
"id": 1,
"po": null,
"rows": [
{
"pack_size": 1,
"product_id": "5a89a30ec8f39fjs9ebd7267",
"id": 1,
"units": "infusion",
"states": [
{
"date_created": "2018-05-03T15:24:35.152516",
"amount_left": 23551,
"status": "open"
},
{
"date_created": "2018-05-03T15:25:04.412877",
"amount_left": 43093,
"status": "fulfilled"
}
],
"units_requested": 123876,
"unit_cost_price": 34548,
"unit_price": 415888,
"name": "Refined Frozen Hat",
"packs_requested": 219908,
"part_id": 6019205
}
],
"states": [
{
"date_created": "2018-05-03T15:24:35.149572",
"reason": "Some reason 1",
"status": "closed"
},
{
"date_created": "2018-05-03T15:25:04.413038",
"reason": "Some reason 2",
"status": "fulfilled"
}
],
"date_created": "2018-05-03T15:24:35.152377",
"invoice_id": "112233",
"created_by": {
"id": 1,
"fullname": "John Doe",
"username": "john@doe.com"
}
}

sample loadunits configuration:

{
"loadunits": {
"food": {
"columns": {
"_id": {
"property_path": "._id",
"redshift_type": "VARCHAR(256)"
},
"date_created": {
"property_path": ".date_created",
"transformers": {
"redshift_timestamp": {
"source_format": "YYYY-MM-DD"
}
},
"redshift_type": "TIMESTAMP"
},
"created_by_fullname": {
"property_path": ".created_by.fullname",
"redshift_type": "VARCHAR(max)",
"transformers": {
"truncate": {
"max": 65535
}
}
}
},
"children": {
"food_rows": {
"data_property_path": ".rows",
"inheritedProperties": [
{
"name": "food_id",
"value": "_id"
}
],
"columns": {
"id": {
"property_path": ".id",
"redshift_type": "VARCHAR(64)"
},
"name": {
"property_path": ".name",
"redshift_type": "VARCHAR(256)"
}
},
"children": {
"food_rows_states": {
"data_property_path": ".states",
"inheritedProperties": [
{
"name": "food_id",
"value": "food_id"
},
{
"name": "row_id",
"value": "id"
}
],
"columns": {
"id": {
"generator": "uuid4",
"redshift_type": "VARCHAR(36)"
},
"status": {
"property_path": ".status",
"redshift_type": "VARCHAR(32)"
}
}
}
}
},
"food_states": {
"data_property_path": ".states",
"inheritedProperties": [
{
"name": "food_id",
"value": "_id"
}
],
"columns": {
"status": {
"property_path": ".status",
"redshift_type": "VARCHAR(32)"
}
}
}
}
}
}
}

In this sample loadunit food is root loadunit, it have children food_rows and food_states, loadunit food_rows have child loadunit food_rows_states.

Columns

Each loadunit configuration should have columns section. It is where data accessors for each loadunit are specified. Example:

"columns": {
"_id": {
"property_path": "._id",
"redshift_type": "VARCHAR(256)"
},
"id": {
"generator": "uuid4",
"redshift_type": "VARCHAR(36)"
},
"date_created": {
"property_path": ".date_created",
"transformers": {
"redshift_timestamp": {
"source_format": "YYYY-MM-DD"
}
},
"redshift_type": "TIMESTAMP"
},
"created_by_fullname": {
"property_path": ".created_by.fullname",
"redshift_type": "VARCHAR(max)",
"transformers": {
"truncate": {
"max": 65535
}
}
}
}

For each column property_path specifies which property of the data item will be used to populate Redshift table. For above data example path .rows will return array of items which will be processed, paths can be tested using jqplay.org, just put data into JSON field and path into Filter, result will be present on a right side.

redshift_type is optional and may be used to specify table column type. If not specified it will be auto-discovered from MongoDB collection data.

Columns may have transformers property which specifies how to modify data before store it. Available transformers:

  • redshift_timestamp: this transformer allows to convert date strings to format supported by Redshift TIMESTAMP, property source_format should be used to specify source data string format using moment.js-compatible string format
  • truncate: this transformer should be used to truncate data to specified length, use property max to set max data length

Column values may be generated, for example to make unique ids for data which do not have id, this may be used to keep parent-child relationship for nested values. Available generators:

  • uuid4: this generator set value to random UUID v4 value.

To keep parent-child relations in data parent loadunit can pass any own properties to children using inheritedProperties, for example food loadunit passes its property _id as property named food_id. data_property_path parameter is used to specify which part of the data should be processed by children loadunit.

Materialized views in Redshift

Several columns will be present in all Redshift materialized views, they are:

  • txnid - this property will be same for all objects which belong to same version of the MongoDB document; this is actually timestamp in microseconds of object processing (it doesn't match with MongoDB timestamp because MongoDB timestamp uses different format) so can be used to check when document was processed at AWS
  • timelabel - correspond to datasource trigger event timestamp
  • mongo_doc_id - contain MongoDB doc _id value
  • modified_at - will be set in materialized views for objects which was updated in MongoDB to value of last txnid Unix timestamp

For above sample loadunit food configuration materialized view columns will be:

  • txnid INT8
  • timelabel TIMESTAMP
  • mongo_doc_id VARCHAR(24)
  • modified_at TIMESTAMP
  • _id VARCHAR(256)
  • date_created TIMESTAMP
  • created_by_fullname VARCHAR(max)

loadunit food_rows columns:

  • txnid INT8
  • timelabel TIMESTAMP
  • mongo_doc_id VARCHAR(24)
  • modified_at TIMESTAMP
  • food_id VARCHAR(256) - passed from food._id as food_id
  • id VARCHAR(64)
  • name VARCHAR(256)

loadunit food_rows_states columns:

  • txnid INT8
  • timelabel TIMESTAMP
  • mongo_doc_id VARCHAR(24)
  • modified_at TIMESTAMP
  • food_id VARCHAR(256) - passed from food_rows.food_id as food_id
  • row_id - passed from food_rows.id as row_id
  • id VARCHAR(36)
  • status VARCHAR(32)

loadunit food_states columns:

  • txnid INT8
  • timelabel TIMESTAMP
  • mongo_doc_id VARCHAR(24)
  • modified_at TIMESTAMP
  • food_id VARCHAR(256) - passed from food._id as food_id
  • status VARCHAR(32)

Input Parameters

To get a template for the MongoDB slice configuration save the output of the describe --input-parameters command as follows:

datacoral collect describe --slice-type mongodb \ --input-parameters > mongodb_parameters_file.json

Necessary input parameters:

  • pageSize: each lambda run datasource will fetch up to pageSize items from MongoDB and then should be able to finish processing within a lambda timeout timewindow. So depending of MongoDB object complexity speed may be different. For simple objects it is OK to set it to 1000, for more complex object this number should be reduced.
  • loadunits: specifies data processing parameters, described above
  • username: MongoDB user name credentials
  • password: MongoDB password
  • host: MongoDB host
  • authSource: MongoDB auth database (admin by default)
  • database: MongoDB database