GitHub Connector Overview

GitHub is a web-based version-control and collaboration platform for software developers. GitHub, which is delivered through a software-as-a-service (SaaS) business model, was started in 2008 and was founded on Git to make software builds faster.

The Datacoral GitHub slice collects data from a GitHub account and enables data flow of repo statistics into a data warehouse, such as Redshift or Snowflake.

Features & Capabilities

  • Backfill: Full historical sync of your entire data
  • Data Extraction Modes: snapshot, incremental with pagination
  • Data Load Modes: replace, append and merge
  • Tables and Columns selection: The ability to select individual schemas, tables and columns for replication in the Datacoral's UI.
  • Data-layout: changing the data type of your columns
  • Customizations: Update the configurations easily using the UI
  • Scheduling: Highly flexible scheduling system

Supported Loadunits

The Github connector automatically collects the following loadunits from the Github API and makes them available in your warehouse for analysis.

LoadunitDefault Extract modeDescription
clonessnapshotcaptures all the attributes for Clones which are associated with Repositories (NOTE: The auth_token should have push permission for the Repository to get this data)
collaboratorssnapshot paginatecaptures all the attributes for Collaborators which are associated with Repositories (NOTE: The auth_token should have push permission for the Repository to get this data)
commitssnapshot paginatecaptures all the attributes for Commits which are associated with Repositories
contributorssnapshot paginatecaptures all the attributes for Contributors which are associated with Repositories
issuessnapshot paginatecaptures all the attributes for Issues which are associated with all the repositories associated with your account
memberssnapshot paginatecaptures all the attributes for Members which are associated with your organizations
milestonessnapshot paginatecaptures all the attributes for Milestones which are associated with Repositories
organizationssnapshot paginatecaptures all the attributes for Organizations which are associated with your account
pullssnapshot paginatecaptures all the attributes for Pulls which are associated with Repositories. open as well as closed pulls are fetched by this loadunit
repositoriessnapshot paginatecaptures all the attributes for Repositories which are associated with your account
viewssnapshotcaptures all the attributes for Views which are associated with Repositories (NOTE: The auth_token should have push permission for the Repository to get this data)
note

Note that the loadunit Repositories will have two paramters :

  • allowedRepositories : Accepts a list of repositories to include (regex strings)
  • blockedRepositories : Accepts a list of repositories to exclude (regex strings)

Connector output

Output of this connector is stored in S3 and the data warehouse you chose.

AWS S3 Data stored in AWS S3 is partitioned by date and time s3://customer_installation.datacoral/<connector-name>

Data Warehouse: Schema - schema name will be same as <connector-name>. Tables produced by the connector are:

- schema.repositories
- schema.collaborators
- schema.contributors
- schema.milestones
- schema.commits
- schema.pulls
- schema.views
- schema.clones
- schema.issues
- schema.organizations
- schema.members

Next Steps

Additional Information

Got a question?

Please contact Datacoral's Support Team, we'd be more than happy to answer any of your questions.