SFTP Connector

Overview

The Datacoral SFTP Connector enables data flow from a SFTP server into a data warehouse, such as Redshift. You can periodically scan the SFTP server for new files and load files into your warehouse. No programming, no plumbing, no additional logic, and no scripting required.

Scenario

"There is a customer of ours who drops thier log files for us in our SFTP server. We were looking for an easy way to ingest and process that data coming in and landing in multiple folders. Thanks to Datacoral's SFTP connector, we started loading all of that data into Redshift. We could start the analysis right away."

Capabilities

  • Periodically connect to a SFTP server and fetch data in files and load them into a Redshift warehouse.
  • Use one or more "loadunits" to sync data to the warehouse, where a loadunit corresponds to a table in the warehouse
  • Replace contents of a table(s) with new data from the file(s).
  • Use these file formats:
    • Delimited, such as CSV
    • JSON
  • Automatically detect the data layout and columns from the SFTP files (except for CSV format files).

Steps to add this slice to your installation

The steps to launch your slice are:

  1. Enable SFTP access to your server. You will then add the required credentials in step 2 below when you add the SFTP connector.
  2. Specify the SFTP connector config and add the connector.

1. Enable SFTP access to your server.

Work with your system administrator and make sure the server you need data pulled from is accessible from Datacoral. Make sure SFTP access is configured correctly. We recommend you create a new username/password for access. Datacoral SFTP connector only supports password based access. SSH key based access is not supported at this moment.

2. Specify the SFTP connector config and add the connector

Login to the Datacoral webapp and click on the "Collect" section on the left. On the "Collect" page, click on the "Add Slice" button on the right-hand side of the page as shown below:

In the "Add A Collect Slice" page, you will see a search box and a list of connectors. Type SFTP in the search box to pull up the SFTP connector icon in the UI.

Click on the SFTP connector and this will pull up a page with the configuration parameters to be filled in by the user as shown below:

Most of the config parameters are straight forward, but they are all explained below:

Input Parameters for the SFTP connector

Here's the list of config parameters and what they mean. These parameters are defined at the connector level and some of them can be overridden at a loadunit level as well. We will look at the loadunits in the later part of this section:

  • Slice Name - (Required) Enter a name for this instance of the SFTP connector. E.g., mySFTPslice
  • Host - (Required) This is the SFTP server hostname. Please make sure this hostname is reachable from the Datacoral VPC.
  • Port - (Required) Please enter the SFTP port. Default is port 21.
  • User - (Required) Enter an username for the SFTP server access.
  • Password - (Required) Enter the password for the SFTP server access.
  • Uncompressed Files - (Required) Set it to false if you want to ingest gzip files. Default is true.
  • Data Format - (Required) Enter the format of the data on the server to be loaded. Supported formats are "JSON" or "CSV". Here, it applies to all loadunits, but you can also define it within a loadunit. Different loadunits can have input data in different formats. However, each loadunit can correspond to only one data format. You cannot mix different formats in the ingest folder for one loadunit. Default is CSV. If the input format is CSV, the connector requires a few additional parameters. You will see additional input params appear on the page as soon as you pick the CSV option for the Data Format.

Please provide the necessary additional inputs if the Data Format is CSV:

  • Number of lines to skip - (Optional) Number of lines to skip as part of the data load in CSV. The header line usually has the column names in a CSV file. Default value is 0.

    NOTE: Without the header line, there is no way for the SFTP connector to know the column names. So, a value of 0 here for CSV data does not work and will fail. Users have to make sure the CSV files does have the header line with the column names and set the value here ideally to 1.

  • CSV Column Delimiter - (Optional) Define the delimiter for the CSV file. Default is pipe - "|".

  • CSV Record Delimiter - (Optional) Define the record delimiter for the CSV file. Default is newline (\n).

  • Schedule - (Optional) In CRON format. Here it applies to all loadunits, but you can also define it within a loadunit. That way, different loadunits can be loaded at different time. E.g., "10 * * * ". Default is "0 0 * *" (daily)

  • Redshift COPY command options - (Optional) The connector uses the Redshift COPY command to load data into the Redshift table. Using this parameter, you can define custom options for the Redshift COPY command. Please refer to the Redshift documentation for more details on the COPY command. If you don't specify this parameter, Datacoral SFTP connector will use the default COPY options. E.g., "format as avro 'auto' DATEFORMAT 'auto' TIMEFORMAT 'auto' ACCEPTINVCHARS ACCEPTANYDATE"

  • Truncate Target Table - (Optional) This parameter lets the user define if they want to truncate the target table in the warehouse before the load. You can enter true or false. If true, the table will be truncated, and all existing data dropped before loading the new files from the SFTP server. Truncating the table makes a lot of sense in snapshot (full) loads where the entire dataset is loaded every time. So, the previously loaded dataset can be safely dropped using true for this option. The Default is false.

  • Loadunits - List of loadunits. Each loadunit corresponds to a table in the warehouse and will typically correspond to a particular dataset on the server. Datasets are typically organized in folders and is made up of one file type. There could be multiple datasets on the server to be loaded and hence multiple loadunits are defined in the configuration. These loadunits define how we load the datasets into one warehouse table per loadunit. The parameters below are defined at the loadunit level only. However, some of the connector level parameters like Data Format explained above can also be overridden at the loadunit level. You can add the loadunits one at a time from the UI. Here's how the page for adding a loadunit looks:

    • Loadunit Name - (Required) User defined name for the loadunit. This will also be the table name in the warehouse. E.g., employees.
    • File path - (Required) Full path to the file on the SFTP server. The connector currently support only one file per loadunit. E.g., /data/my_file.csv
    • Data Format - (Required) We have covered this parameter above under the connector level configuration. As explained above, if the format is CSV, you are required to add a few more parameters for the connector. You will see additional input params appear as soon as you pick the CSV option for the Data Format field as shown below:

Once you have defined all the parameters you need, click on "Add a loadunit" button in the page. You can repeat the process to add more loadunits if needed. Once all the loadunits are defined, click on the "Add Slice" button at the bottom of the page to add and deploy this instance of the SFTP connector slice.

JSON Key Mapping

You might notice a JSON Editor tab in your slice add screen. You don't need to review or edit the JSON, but it can come in handy sometimes. Here are the JSON keys corresponding to the configuration parameters for your reference:

Input ParameterJSON Key
Hosthost
Portport
Useruser
Passwordpassword
Uncompressed Filesuncompressed
Data FormatDataFormat
Scheduleschedule
Redshift COPY command optionsCopyOptions
Truncate Target TableTruncateTargetTable
Loadunitsloadunits
Loadunit NameloadunitName
File Pathpath
Number of lines to skipIgnoreHeader
CSV Column DelimiterCSVDelimiter
CSV Record DelimiterCSVRecordDelimiter

Limitations

  • SFTP connector supports only snapshot (full) loads of a single file (per loadunit) from the SFTP server. Neither incremental loads nor loading from a folder with multiple files is supported currently.
  • For JSON format files, filtering is supported for first-level attributes. Nested elements support is coming soon.

Smile

Job well done

Questions? Interested?

If you have questions or feedback, feel free to reach out at hello@datacoral.co or Request a demo