Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

REST API Connector for Data Prep

User Persona: Data Prep User, Data Prep Admin, or Data Source Admin

Note

This document covers all configuration fields available during connector setup. Some fields may have already been filled out by your Administrator at an earlier step of configuration and may not be visible to you. For more information on Data Prep's connector framework, see Data Prep Connector setup. Also, your Admin may have named this connector something else in the list of Data Sources.

Configure Data Prep

This connector allows you to connect to a REST API to import a REST Resource. The following is information on the parameters used to create the connector.

General

  • Name: Name of the data source as it will appear to users in the UI.

  • Description: Description of the data source as it will appear to users in the UI.

Tip

You can use the REST API Connector to connect Data Prep to multiple sources and potentially multiple instances of the same source. Using a descriptive name can be a big help to users in identifying the appropriate data source.

Web Proxy

If you connect to your REST API source through a proxy server, these fields define the proxy details.

  • Web Proxy: 'None' if no proxy is required or 'Proxied' if connection to the REST Endpoint should be made via a proxy server. If a web proxy server is required, the following fields are required to enable a proxied connection.

  • Proxy host: The host name or IP address of the web proxy server.

  • Proxy port: The port on the proxy server for Data Source.

  • Proxy username: The username for the proxy server.

  • Proxy password: The password for the proxy server. *Leave username & password blank for an unauthenticated proxy connection.

REST API Configuration

In this section, provide the information used to locate the REST API resource.

For examples of how to set this up, see REST API Authentication Configuration.

  • Base URL: Base URL of the REST API. The base URL must include the protocol (http/https), hostname (port number is optional) and context path.
    • Example: http(s)://api.domain.com(:port)/rest/v1
  • Resources: Multiple REST resources to be imported. Each line should contain a single REST resource configuration in name:path?query format.

    • The name is the user-visible name for the resource to be imported and is required for a REST resource configuration. This name will be presented in the Browse user-interface, for example: Account Details.

    • The path is the path to the resource and is required for a REST resource configuration. This path should start with a slash (/) and optionally has multiple segments separated by a slash (/), for example: /resource/sub-category

    • The query is an optional filtering criterion to use while retrieving the resource and is optional for a REST resource configuration. The query syntax must be key=value pairs delimited by '&', for example: criteria=active&order=desc or jql=status=done.

REST API Authentication Configuration

In this section, provide the information used to authenticate to the REST API service endpoint.

  • Authentication Type: Select one of the options based on your requirement.

  • No Auth: if the REST API doesn't require any authentication.

  • Basic Auth: if the REST API allows authentication with Username and Password.

  • Bearer Token: if the REST API allows authentication through Bearer Token. In the case of Bearer Token, each web service may provide access to or the generation of tokens differently and the web service’s documentation should explain how to find it.

  • Username and Password: If Authentication Type is selected as Basic Authentication, these fields are provided for authentication. Some web services only require one field or the other, so while most will require both fields, the configuration page allows them both to be blank. This may cause an error while authenticating to the data source, but will not cause form validation errors when saving the Data Source.

  • Bearer Token: If Authentication Type is selected as Bearer Token, this needs to be provided for authentication. The user must know how to obtain this token as every system will handle this differently. Obtaining this token may also require Administrator help.

REST API Test Connection & Operation Configuration

  • Test Connection & Operation Method: The HTTP method used in a request to determine if the Data Prep connector can connect to the REST API service and what method will be used when the Connector requests a resource. Selecting "Automatic" will try HEAD, GET and POST to test the connection and is the best option if you're unsure which method to select.

  • The selected method is also used for actual import, if GET or HEAD succeeds in the test, then GET will be used for import, if POST is successful, then POST will be used for the actual import.

  • Connection Timeout: Timeout (in milliseconds) for connecting to REST API.

Data Import Information

Via Browsing

Will present the Resources in the import workflow as the importable data set using the Resource Name as defined in the Resource list.

Via SQL Query

Not Supported.

Technical Specs

Pagination

  • This Connector supports RFC 5988 pagination of REST datasets: https://tools.ietf.org/html/rfc5988

  • For paginated REST responses, each paginated response contains HTTP Headers that identify the URL for the next page of results.

  • When a paginated dataset is requested, the REST Connector will automatically identify that the dataset is paginated and follow data links.

    • Automatically extract the HTTP link for the next page of data.

    • Return the results from the current page of results.

    • Execute a call to obtain the next page of results.

  • During Import using the Data Prep UI, we present only 1 page of data values in order to allow for rapid presentation of the Preview as well as to reduce hits against rate-limited APIs.

  • During Import, the Connector will:

    • Automatically extract the HTTP link for the next page of data

    • Return the results from the current page of results

    • Execute a call to obtain the next page of results.

Performance

The performance of the REST Connector is very dependent upon the implementation of the REST API that it leverages.

  • Best performance is found for REST APIs that support returning an entire dataset per REST API invocation. This is typical of APIs that leverage chunked transfer encoding. In this scenario, the REST Connector executes a single API call to obtain a full dataset.

  • REST API's that leverage pagination reduce performance by requiring additional REST API calls.

    • Pagination style: RFC 5899

    • Each response contains N records and an HTTP Header containing a URL that points to the next batch.

  • Review your REST API documentation to identify the maximum page size that you can configure in order to reduce the number of API calls.

  • Example: GitHub REST API

  • APIs can have rate limitations. When importing large datasets that are paginated, it is not uncommon to run into limitations on the number of REST calls made within a window of time. For example, GitHub allows for 5000 requests per hour and Google Drive allows for 1000 requests per 100 seconds.

FAQ/Troubleshooting/Common Issues

Is OAuth authentication supported?

Not at this time. Currently, only username/password and token authentication methods are supported. Many data sources only allow for OAuth authentication and those sources would be unsupported at this time. Please contact your Data Prep Client Success if you find that you are unable to connect to a DataSource for this reason.

What do the “Test Connection” messages mean?

  • Test Connection verifies that each entry in the Resource List matches the expected format.

  • Failure of an entry to match the expected format results in an error indicating the identified format issue and the entry number.

  • Failure to use unique Dataset Name for each Resource entry results in a format validation failure.

  • After testing confirms proper formatting of resources, only the first entry on the Resource List is used to verify the connectivity is configured correctly.

Example Configurations

The following are real-world examples of how the REST API Connector has been used. Please feel free to use any of these in your account, but please be advised that companies may change their APIs at any time, sometimes without notice, and these are not fully supported data sources. This means that Data Prep will be unable to help you troubleshoot any issues you may have with these configurations and they may become out of date.

Simple Learning Example

There are many simple, unauthenticated REST API resources on the web that were created with the expressed purposes of learning, testing, and prototyping. One of those sources is JSONPlaceholder . This example may be over simplified, but is intended to demonstrate the building blocks of how to connect to a RESTful web service using the REST API Connector.

There are no rate limits posted and no pagination as the datasets are small.

CONFIGURATION

After clicking “Test Data Source” to confirm your setup is working and clicking “Save", you can now use this Data Source to import data into Data Prep.

GitHub Example

GitHub is a cloud-based software source code repository that provides a rich, but rate-limited REST API. GitHub REST API Documentation. (Log in to GitHub before clicking this link.)

RATE LIMITS
  • GitHub API rate limit reference: https://developer.github.com/v3/rate_limit/ (Log in to GitHub before clicking this link.)

  • GitHub allows for 5000 requests per hour, specific limits vary per service.

  • Rate limits vary for unauthenticated and authenticated users.

PAGINATION
  • GitHub supports RFC 5899 pagination of REST datasets.

  • For paginated REST responses, the user receives one page of data (30 entries for /search API) per API call.

  • Users can override the number of results per page up to 100 using the "per_page" API parameter.

  • Setting per-page result count to the maximum allowable setting will enable higher throughput of data import by reducing the number of REST API calls.

CONFIGURATION
  • Base URL: https://api.github.com Log in to GitHub before clicking this link.

  • Resources:

    • Mozilla Repos: retrieve a list of software repositories that match a search for "mozilla".
      • Note: This query will exhaust a user's quota for the /search API when using default 30 records per call.
      • Expected result count > 6600
      • Mozilla Repos:/search/repositories?q=mozilla
  • Mozilla Repos Page 33+: example of performing a search set starting at page 33 of results.

    • Mozilla Repos Page 33:search/repositories?q=mozilla&page=33
  • Square Repos: retrieve a list of the repositories that belong to the Square organization.

    • Square Repos:/orgs/square/repos
  • Organizations: retrieve the paginated list of all GitHub organizations using 100 records per request. WARNING: This will run for a long time to pull 2+ million entries.

    • Organizations:/organizations?per_page=100

Jira Example

Jira is a Project and Issue tracking software typically used by software development teams. Jira REST API Documentation.

RATE LIMIT, PAGINATION, AND SETUP
  • Rate limits will vary by subscription level.

  • Jira REST API is limited to return at most 100 results per page of data.

  • Jira does not support RFC 5899 pagination.

  • Jira Cloud instances may require users to create a JIRA REST API token.

  • Create a token: https://confluence.atlassian.com/cloud/api-tokens-938839638.html

  • Authentication type: Basic Authentication.

  • Username in username field.

  • API Token in password field.

CONFIGURATION
  • Base URL:

  • On-Premise: https://(hostname):(port)/rest/api/

  • Jira Cloud: https://(your-domain).atlassian.net/rest/api/

  • Resources

  • All Project List:

    • All Projects:/project
  • Example JQL resource: run a Jira JQL query to retrieve 200 To-Do task Items

    • JQL query had to be URL encoded before pasting into configuration.
    • Connector To Do Tasks:/search?jql=Project%3D_yourProject_%20and%20statusCategory%3D%22To%20Do%22&maxResults=200
  • Authentication: Basic (username/password)

PAGINATION FOR JIRA

Jira does not support RFC 5899 pagination. In order to support pagination for JIRA:

  • Define Datasource Resource entries that specify pages of data:

  • Use "maxResults=100" to maximize the number of entries per REST call

  • Use "startAt=N" to specify the starting point. N starts at 0

  • Example: 4 pages of 100 search results

    • All Issues 0:/search?jql=&startAt=0&maxResults=100 All Issues 1:/search?jql=&startAt=100&maxResults=100 All Issues 2:/search?jql=&startAt=200&maxResults=100 All Issues 3:/search?jql=&startAt=300&maxResults=100
  • Use Data Prep's Wildcard feature to select all pages of data

  • Wildcard pattern = "All Issues*"

  • After JSON parsing, which flattened the JSON and duplicated some rows to account for subtasks we obtained 557 rows of data X 298 columns

US Census Data Example

The Census data web site is not a REST API, but we can use our REST API Connector to retrieve data over HTTP

ACS_2002_Midwest:/acs2002/2007_prod_release1/BaseTablesSubjectTables/Region/MidwestRegionBaseTables02000US2.csv

ACS_2002_US_OH_Franklin:/acs2002/2007_prod_release1/BaseTablesSubjectTables/States/Ohio/StateCounty/FranklinCountyOhio/BaseTables05000US39049.csv

ACS_2002_Base_California:/acs2002/2007_prod_release1/BaseTablesSubjectTables/States/California/CaliforniaBaseTables04000US06.csv

  • Authentication: None
  • Use Data Prep's Wildcard feature to select all pages of "ACS 2002 Base" data Wildcard Pattern:

  • "ACS 2002 Base A*": Import Alabama, Alaska, Arkansas and Arizona files as 1 dataset.

  • "ACS 2002 Base*": Import All matching 'ACS 2002 Base" files as 1 dataset.

Updated October 28, 2021
Back to top