MongoDB Collect Slice
Description
This datasource slice collects data from MongoDB database and writes it to S3 and Redshift. One Redshift table is created for each MongoDB collection with objects data in JSON format. On top of this table, materialized views are created to reflect object data to flattened form, one materialized view for each entity. All materialized views will contain data which correspond to last non-deleted (most recent) version of object in MongoDB, so after object update in MongoDB materialized views will contain data for updated object.
Prerequisites
Datasource reads periodic data updates from MongoDB oplog, so it is required to have a replica MongoDB node with oplog.rs
collection and this replica connection should be specified in deploy params as a data source. Deploy params should specify source collection name, database name, host, port, auth source and user name and password for connection.
SUPPORTED CONFIGURATIONS
- MongoDB in replica set mode
- MongoDB Version 3.1.0 and above
- Non-sharded MongoDB
- MongoDB Documents < 64K
LOADUNITS
To map tree-like objects data to flattened form (which is supported by Redshift database) it is required to specify mapping using loadunits. Each MongoDB collection should correspond to one of root loadunits (those which do not have parents). Object properties which are lists should be present as child loadunits. Below is a sample data for collection food
and sample loadunits configuration:
sample data:
sample loadunits configuration:
In this sample loadunit food
is root loadunit, it have children food_rows
and food_states
, loadunit food_rows
have child loadunit food_rows_states
.
Columns
Each loadunit configuration should have columns
section. It is where data accessors for each loadunit are specified. Example:
For each column property_path
specifies which property of the data item will be used to populate Redshift table. For above data example path .rows
will return array of items which will be processed, paths can be tested using jqplay.org, just put data into JSON
field and path into Filter
, result will be present on a right side.
redshift_type
is optional and may be used to specify table column type. If not specified it will be auto-discovered from MongoDB collection data.
Columns may have transformers
property which specifies how to modify data before store it. Available transformers:
redshift_timestamp
: this transformer allows to convert date strings to format supported by Redshift TIMESTAMP, propertysource_format
should be used to specify source data string format using moment.js-compatible string formattruncate
: this transformer should be used to truncate data to specified length, use propertymax
to set max data length
Column values may be generated, for example to make unique ids for data which do not have id, this may be used to keep parent-child relationship for nested values. Available generators:
uuid4
: this generator set value to random UUID v4 value.
To keep parent-child relations in data parent loadunit can pass any own properties to children using inheritedProperties
, for example food
loadunit passes its property _id
as property named food_id
.
data_property_path
parameter is used to specify which part of the data should be processed by children loadunit.
Materialized views in Redshift
Several columns will be present in all Redshift materialized views, they are:
txnid
- this property will be same for all objects which belong to same version of the MongoDB document; this is actually timestamp in microseconds of object processing (it doesn't match with MongoDB timestamp because MongoDB timestamp uses different format) so can be used to check when document was processed at AWStimelabel
- correspond to datasource trigger event timestampmongo_doc_id
- contain MongoDB doc_id
valuemodified_at
- will be set in materialized views for objects which was updated in MongoDB to value of lasttxnid
Unix timestamp
For above sample loadunit food
configuration materialized view columns will be:
txnid
INT8timelabel
TIMESTAMPmongo_doc_id
VARCHAR(24)modified_at
TIMESTAMP_id
VARCHAR(256)date_created
TIMESTAMPcreated_by_fullname
VARCHAR(max)
loadunit food_rows
columns:
txnid
INT8timelabel
TIMESTAMPmongo_doc_id
VARCHAR(24)modified_at
TIMESTAMPfood_id
VARCHAR(256) - passed fromfood._id
asfood_id
id
VARCHAR(64)name
VARCHAR(256)
loadunit food_rows_states
columns:
txnid
INT8timelabel
TIMESTAMPmongo_doc_id
VARCHAR(24)modified_at
TIMESTAMPfood_id
VARCHAR(256) - passed fromfood_rows.food_id
asfood_id
row_id
- passed fromfood_rows.id
asrow_id
id
VARCHAR(36)status
VARCHAR(32)
loadunit food_states
columns:
txnid
INT8timelabel
TIMESTAMPmongo_doc_id
VARCHAR(24)modified_at
TIMESTAMPfood_id
VARCHAR(256) - passed fromfood._id
asfood_id
status
VARCHAR(32)
Input Parameters
To get a template for the MongoDB slice configuration save the output of the describe --input-parameters
command as follows:
datacoral collect describe --slice-type mongodb \ --input-parameters > mongodb_parameters_file.json
Necessary input parameters:
pageSize
: each lambda run datasource will fetch up topageSize
items from MongoDB and then should be able to finish processing within a lambda timeout timewindow. So depending of MongoDB object complexity speed may be different. For simple objects it is OK to set it to1000
, for more complex object this number should be reduced.loadunits
: specifies data processing parameters, described aboveusername
: MongoDB user name credentialspassword
: MongoDB passwordhost
: MongoDB hostauthSource
: MongoDB auth database (admin
by default)database
: MongoDB database