Skip to content

Build data connections

Databases are a widely used tool to carry valuable business data. To enable integration with a variety of enterprise databases, DataRobot provides a self-service JDBC product for database connectivity setup. Once configured, you can read data from production databases for model building and predictions. This allows you to quickly train and retrain models on that data, and avoids the unnecessary step of exporting data from your enterprise database to a CSV for ingest to DataRobot. With access to more diverse data, you can build more accurate models.

Database connection terminology

Database connection configuration uses the following terminology:

  • Data store: A configured connection to a database. It has a name, a specified driver, and a JDBC URL. You can register data stores with DataRobot for ease of re-use. A data store has one connector but can have many data sources.
  • Data source: A configured connection to the backing data store (the location of data within a given endpoint). A data source specifies, via a SQL query or a selected table and schema data, which data to extract from the data store to use for modeling or predictions. A data source has one data store and one connector but can have many datasets.
  • Data driver: The software that allows the application to interact with a database; each data store is associated with either a driver or a connector (created by the administrator). The driver configuration saves the storage location in the application of the JAR file and any additional dependency files associated with the driver.
  • Connector: Similarly to data drivers, a connector allows the application to interact with a database; each data store is associated with either a driver or a connector (created by the administrator). The connector configuration saves the storage location in the application of the JAR file and any additional dependency files associated with the connector.
  • Dataset: Data, a file or the content of a data source, at a particular point in time. A data source can produce multiple datasets; a dataset has exactly one data source.

Review the workflow to set up projects or prediction datasets below.

  1. An administrator sets up datarobot.DataDriver to access a particular database. For any particular driver, this setup is performed once for the entire system and the resulting driver is used by all users.
  2. Users create a datarobot.DataStore which represents an interface to a particular database using that driver.
  3. Users create a datarobot.DataSource representing a particular set of data to be extracted from the data store.
  4. Users create projects and prediction datasets from a data source.

Users can manage their data stores and data sources, while administrators can manage drivers by listing, retrieving, updating, and deleting existing instances of them.

Create a driver

To create a driver, administrators must specify the following:

  • class_name: The Java class name for the driver if the type is JDBC; otherwise None.
  • canonical_name: A user-friendly name or resulting driver to display in the API and the GUI.
  • files:A list of local files which contain the driver if the type is JDBC; otherwise omitted.
  • typ: The enum for the type of driver. Defaults to dr.enums.DataDriverTypes.JDBC and can also be dr.enums.DataDriverTypes.DR_DATABASE_V1.
  • database_driver: The type of native database to use for non-JDBC. For example, dr.enums.DrDatabaseV1Types.BIGQUERY.
>>> import datarobot as dr
>>> driver = dr.DataDriver.create(
...     class_name='org.postgresql.Driver',
...     canonical_name='PostgreSQL',
...     files=['/tmp/postgresql-42.2.2.jar']
... )
>>> driver
DataDriver('PostgreSQL')

Use the code below to create a non-JDBC driver:

driver = dr.DataDriver.create(None, "BigQuery Native", typ=dr.enums.DataDriverTypes.DR_DATABASE_V1, database_driver=dr.enums.DrDatabaseV1Types.BIGQUERY)

To retrieve information about existing drivers, such as the driver ID for data store creation, you can use dr.DataDriver.list().

Create a data store

After an administrator has created drivers, any user can use them to create a DataStore. A data store represents a JDBC database or a non-JDBC database. When creating them, you should specify the following:

  • type: The type must be either dr.enums.DataStoreTypes.DR_DATABASE_V1 or dr.enums.DataStoreTypes.JDBC.
  • canonical_name: A user-friendly name to display in the API and GUI for the data store.
  • driver_id: The ID of the driver to use to connect to the database.
  • jdbc_url: The full URL specifying the database connection settings such as the database type, server address, port, and database name if the type is JDBC.
  • fields: The fields used if the type is dr.enums.DataStoreTypes.DR_DATABASE_V1. A list of dictionary entries, where each entry has an ID, a name, and a value field.

Note

You can only create data stores with drivers when using the Python client. Drivers and connectors are not interchangeable for this method. To create a data store with a connector, instead use the REST API .

>>> import datarobot as dr
>>> data_store = dr.DataStore.create(
...     data_store_type='jdbc',
...     canonical_name='Demo DB',
...     driver_id='5a6af02eb15372000117c040',
...     jdbc_url='jdbc:postgresql://my.db.address.org:5432/perftest'
... )
>>> data_store
DataStore('Demo DB')
>>> data_store.test(username='username', password='password')
{'message': 'Connection successful'}

You can create a non-JDBC Data Store using fields instead:

>>> fields = [
...     {
...         "id": "bq.project_id",
...         "name": "Project Id",
...         "value": "mldata-358421",
...     }
... ]
>>> data_store = dr.DataStore.create(
...     data_store_type=dr.enums.DataStoreTypes.DR_DATABASE_V1,
...     canonical_name='BigQuery Native Connection',
...     driver_id=driver_id,
...     fields=fields
... )

View data stores

To view data stores that already exist, use the code below. However, note that not all data stores show up by default with a call to dr.DataStore.list(). You must explicitly pass a typ=dr.enums.DataStoreListTypes.ALL argument, since the default is to only show JDBC connections.

List all JDBC data stores (default):

data_stores = dr.DataStore.list()

print(f"Found {len(data_stores)} DataStore(s):")
for ds in data_stores:
    print(f"  - {ds.canonical_name} (ID: {ds.id}, Type: {ds.data_store_type})")

List all data stores:

all_stores = dr.DataStore.list(typ=dr.enums.DataStoreListTypes.ALL)

Create a data source

Once you have a data store, you can query datasets via the data source. When creating a data source, first create a datarobot.DataSourceParameters object from a data store's ID and a query. Then, create the data source with the following:

  • type: The type must be either dr.enums.DataStoreTypes.DR_DATABASE_V1 or dr.enums.DataStoreTypes.JDBC.
  • canonical_name: A user-friendly name to display in the API and GUI.
  • params: The DataSourceParameters object.
>>> import datarobot as dr
>>> params = dr.DataSourceParameters(
...     data_store_id='5a8ac90b07a57a0001be501e',
...     query='SELECT * FROM airlines10mb WHERE "Year" >= 1995;'
... )
>>> data_source = dr.DataSource.create(
...     data_source_type='jdbc',
...     canonical_name='airlines stats after 1995',
...     params=params
... )
>>> data_source
DataSource('airlines stats after 1995')

You can create a non-JDBC Data Store with fields.

>>> params = dr.DataSourceParameters(
...     data_store_id=data_store_id,
...     catalog=catalog,
...     schema=schema,
...     table=table,
... )
>>> data_source = dr.DataSource.create(
...     data_source_type=dr.enums.DataStoreTypes.DR_DATABASE_V1,
...     canonical_name='BigQuery Data Source',
...     driver_id=driver_id,
...     params=params
... )

Create projects

You can create new projects from a data source, demonstrated below.

>>> import datarobot as dr
>>> project = dr.Project.create_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     username='username',
...     password='password'
... )

As of v3.0 of the Python API client, you can alternatively pass in the credential_id of an existing Dataset.Credential object.

>>> import datarobot as dr
>>> project = dr.Project.create_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     credential_id='9963d544d5ce3se783r12190'
... )

Alternatively, pass in credential_data, which conforms to CredentialDataSchema.

>>> import datarobot as dr
>>> s3_credential_data = {"credentialType": "s3", "awsAccessKeyId": "key123", "awsSecretAccessKey": "secret123"}
>>> project = dr.Project.create_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     credential_data=s3_credential_data
... )

Create prediction datasets

Given a data source, new prediction datasets can be created for any project.

>>> import datarobot as dr
>>> project = dr.Project.get('5ae6f296962d740dd7b86887')
>>> prediction_dataset = project.upload_dataset_from_data_source(
...     data_source_id='5ae6eee9962d740dd7b86886',
...     username='username',
...     password='password'
... )