MongoDB Collect Slice

                      +-----------------------------------------+
                      | Redshift                                |
                      |                       +-------------+   |
                      |  +----------------+--->   matview1  |   |
                      |  |                |   +-------------+   |
                      |  |     Table1     |   +-------------+   |
                      |  |                +--->   matview2  |   |
+---------------+     |  +----------------+   +-------------+   |
|               |     |                                         |
|    MongoDB    +----->                       +-------------+   |
|               |     |  +----------------+--->   matview3  |   |
+---------------+     |  |                |   +-------------+   |
                      |  |     Table2     |   +-------------+   |
                      |  |                +--->   matview4  |   |
                      |  +----------------+   +-------------+   |
                      |                                         |
                      +-----------------------------------------+

Description

This datasource slice collects data from MongoDB database and writes it to S3 and Redshift. One Redshift table is created for each MongoDB collection with objects data in JSON format. On top of this table, materialized views are created to reflect object data to flattened form, one materialized view for each entity. All materialized views will contain data which correspond to last non-deleted (most recent) version of object in MongoDB, so after object update in MongoDB materialized views will contain data for updated object.

Prerequisites

Datasource reads periodic data updates from MongoDB oplog, so it is required to have a replica MongoDB node with oplog.rs collection and this replica connection should be specified in deploy params as a data source. Deploy params should specify source collection name, database name, host, port, auth source and user name and password for connection.

SUPPORTED CONFIGURATIONS

MongoDB in replica set mode
MongoDB Version 3.1.0 and above
Non-sharded MongoDB
MongoDB Documents < 64K

LOADUNITS

To map tree-like objects data to flattened form (which is supported by Redshift database) it is required to specify mapping using loadunits. Each MongoDB collection should correspond to one of root loadunits (those which do not have parents). Object properties which are lists should be present as child loadunits. Below is a sample data for collection food and sample loadunits configuration:

sample data:

{
  "_id": "5beae02570dca22b2ce844f9",
  "po_id": 1,
  "id": 1,
  "po": null,
  "rows": [
    {
      "pack_size": 1,
      "product_id": "5a89a30ec8f39fjs9ebd7267",
      "id": 1,
      "units": "infusion",
      "states": [
        {
          "date_created": "2018-05-03T15:24:35.152516",
          "amount_left": 23551,
          "status": "open"
        },
        {
          "date_created": "2018-05-03T15:25:04.412877",
          "amount_left": 43093,
          "status": "fulfilled"
        }
      ],
      "units_requested": 123876,
      "unit_cost_price": 34548,
      "unit_price": 415888,
      "name": "Refined Frozen Hat",
      "packs_requested": 219908,
      "part_id": 6019205
    }
  ],
  "states": [
    {
      "date_created": "2018-05-03T15:24:35.149572",
      "reason": "Some reason 1",
      "status": "closed"
    },
    {
      "date_created": "2018-05-03T15:25:04.413038",
      "reason": "Some reason 2",
      "status": "fulfilled"
    }
  ],
  "date_created": "2018-05-03T15:24:35.152377",
  "invoice_id": "112233",
  "created_by": {
    "id": 1,
    "fullname": "John Doe",
    "username": "john@doe.com"
  }
}

sample loadunits configuration:

{
  "loadunits": {
    "food": {
      "columns": {
        "_id": {
          "property_path": "._id",
          "redshift_type": "VARCHAR(256)"
        },
        "date_created": {
          "property_path": ".date_created",
          "transformers": {
            "redshift_timestamp": {
              "source_format": "YYYY-MM-DD"
            }
          },
          "redshift_type": "TIMESTAMP"
        },
        "created_by_fullname": {
          "property_path": ".created_by.fullname",
          "redshift_type": "VARCHAR(max)",
          "transformers": {
            "truncate": {
              "max": 65535
            }
       }
     }
   },
   "children": {
     "food_rows": {
       "data_property_path": ".rows",
       "inheritedProperties": [
         {
           "name": "food_id",
           "value": "_id"
         }
       ],
       "columns": {
         "id": {
           "property_path": ".id",
           "redshift_type": "VARCHAR(64)"
         },
         "name": {
           "property_path": ".name",
           "redshift_type": "VARCHAR(256)"
         }
       },
       "children": {
         "food_rows_states": {
           "data_property_path": ".states",
           "inheritedProperties": [
             {
               "name": "food_id",
               "value": "food_id"
             },
             {
               "name": "row_id",
               "value": "id"
             }
           ],
           "columns": {
             "id": {
               "generator": "uuid4",
               "redshift_type": "VARCHAR(36)"
             },
             "status": {
               "property_path": ".status",
               "redshift_type": "VARCHAR(32)"
             }
           }
         }
       }
     },
     "food_states": {
       "data_property_path": ".states",
       "inheritedProperties": [
         {
           "name": "food_id",
           "value": "_id"
         }
       ],
       "columns": {
         "status": {
           "property_path": ".status",
           "redshift_type": "VARCHAR(32)"
         }
       }
     }
   }
 }
}
}

In this sample loadunit food is root loadunit, it have children food_rows and food_states, loadunit food_rows have child loadunit food_rows_states.

Columns

Each loadunit configuration should have columns section. It is where data accessors for each loadunit are specified. Example:

"columns": {
"_id": {
 "property_path": "._id",
 "redshift_type": "VARCHAR(256)"
},
"id": {
 "generator": "uuid4",
 "redshift_type": "VARCHAR(36)"
},
"date_created": {
 "property_path": ".date_created",
 "transformers": {
   "redshift_timestamp": {
     "source_format": "YYYY-MM-DD"
   }
 },
 "redshift_type": "TIMESTAMP"
},
"created_by_fullname": {
 "property_path": ".created_by.fullname",
 "redshift_type": "VARCHAR(max)",
 "transformers": {
   "truncate": {
     "max": 65535
   }
 }
}
}

For each column property_path specifies which property of the data item will be used to populate Redshift table. For above data example path .rows will return array of items which will be processed, paths can be tested using jqplay.org, just put data into JSON field and path into Filter, result will be present on a right side.

redshift_type is optional and may be used to specify table column type. If not specified it will be auto-discovered from MongoDB collection data.

Columns may have transformers property which specifies how to modify data before store it. Available transformers:

redshift_timestamp: this transformer allows to convert date strings to format supported by Redshift TIMESTAMP, property source_format should be used to specify source data string format using moment.js-compatible string format
truncate: this transformer should be used to truncate data to specified length, use property max to set max data length

Column values may be generated, for example to make unique ids for data which do not have id, this may be used to keep parent-child relationship for nested values. Available generators:

uuid4: this generator set value to random UUID v4 value.

To keep parent-child relations in data parent loadunit can pass any own properties to children using inheritedProperties, for example food loadunit passes its property _id as property named food_id. data_property_path parameter is used to specify which part of the data should be processed by children loadunit.

Materialized views in Redshift

Several columns will be present in all Redshift materialized views, they are:

txnid - this property will be same for all objects which belong to same version of the MongoDB document; this is actually timestamp in microseconds of object processing (it doesn't match with MongoDB timestamp because MongoDB timestamp uses different format) so can be used to check when document was processed at AWS
timelabel - correspond to datasource trigger event timestamp
mongo_doc_id - contain MongoDB doc _id value
modified_at - will be set in materialized views for objects which was updated in MongoDB to value of last txnid Unix timestamp

For above sample loadunit food configuration materialized view columns will be:

txnid INT8
timelabel TIMESTAMP
mongo_doc_id VARCHAR(24)
modified_at TIMESTAMP
_id VARCHAR(256)
date_created TIMESTAMP
created_by_fullname VARCHAR(max)

loadunit food_rows columns:

txnid INT8
timelabel TIMESTAMP
mongo_doc_id VARCHAR(24)
modified_at TIMESTAMP
food_id VARCHAR(256) - passed from food._id as food_id
id VARCHAR(64)
name VARCHAR(256)

loadunit food_rows_states columns:

txnid INT8
timelabel TIMESTAMP
mongo_doc_id VARCHAR(24)
modified_at TIMESTAMP
food_id VARCHAR(256) - passed from food_rows.food_id as food_id
row_id - passed from food_rows.id as row_id
id VARCHAR(36)
status VARCHAR(32)

loadunit food_states columns:

txnid INT8
timelabel TIMESTAMP
mongo_doc_id VARCHAR(24)
modified_at TIMESTAMP
food_id VARCHAR(256) - passed from food._id as food_id
status VARCHAR(32)

Input Parameters

To get a template for the MongoDB slice configuration save the output of the describe --input-parameters command as follows:

datacoral collect describe --slice-type mongodb \ --input-parameters > mongodb_parameters_file.json

Necessary input parameters:

pageSize: each lambda run datasource will fetch up to pageSize items from MongoDB and then should be able to finish processing within a lambda timeout timewindow. So depending of MongoDB object complexity speed may be different. For simple objects it is OK to set it to 1000, for more complex object this number should be reduced.
loadunits: specifies data processing parameters, described above
username: MongoDB user name credentials
password: MongoDB password
host: MongoDB host
authSource: MongoDB auth database (admin by default)
database: MongoDB database

Description#

Prerequisites#

SUPPORTED CONFIGURATIONS#

LOADUNITS#

Columns#

Materialized views in Redshift#

Input Parameters#

Necessary input parameters:#