Schedule predictions with a JDBC database¶
Making predictions on a daily or monthly basis is a manual, time-consuming, and cumbersome process. Batch predictions are commonly used when you have to score new records over a certain frame of time (weeks, months, etc.). For example, you can use batch predictions to score new leads on a monthly basis to predict who will churn, or to predict on a daily basis which products someone is likely to purchase.
This notebook outlines how to use DataRobot's Python client to schedule batch prediction jobs and write them to a JDBC database. Specifically, you will:
- Retrieve existing data stores and credential information.
- Configure prediction job specifications.
- Set up a prediction job schedule.
- Run a test prediction job and enable an automated schedule for scoring.
Before proceeding, note that this workflow requires a deployed DataRobot model object to use for scoring and an established data connection to read data and host prediction writeback. For more information about the Python client, reference the documentation.
Import libraries¶
import datarobot as dr
import pandas as pd
Connect to DataRobot¶
Read more about different options for connecting to DataRobot from the client.
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
List data stores¶
To enable integration with a variety of enterprise databases, DataRobot provides a “self-service” JDBC product for database connectivity setup. Once configured, you can read data from production databases for model building and predictions. This allows you to quickly train and retrain models on that data while avoiding the unnecessary step of exporting data from your enterprise database to a CSV for ingest to DataRobot. It allows access to more diverse data, which results in more accurate models.
Use the cell below to query all data sources tied to a DataRobot account. The second line lists each datastore with an alphanumeric string; that is the datastore ID.
for d in dr.DataStore.list():
print(d.id, d.canonical_name, d.params)
Retrieve credentials list¶
You can reference the DataRobot documentation for more information about managing credentials.
dr.Credential.list()
The output above returns multiple sets of credentials. The alphanumeric string included in each item of the list is the credentials ID. You can use that ID to access credentials through the API.
Specify the deployment and data connection¶
Use the snippet below to indicate the deployment you want to use (by binding the deployment ID, retrieved from the deployment's Overview tab) and the data store to which you want to write predictions (by providing the data store ID and the corresponding credentials ID).
deployment_id = "620219bb18f7f84dec6cec59"
datastore_id = "614ca745c7fab1f23da7a632"
data_store = dr.DataStore.get(datastore_id)
credential_id = "63865454a351b56ce3cb78b3"
cred = dr.Credential.get(credential_id)
Configure intake settings¶
Use the snippet below to configure the intake settings for JDBC scoring. For more information, reference the batch predictions documentation in the Python client and the intake options documentation.
intake_settings = { 'type': 'jdbc', 'table': 'LENDING_CLUB_10K', 'schema': 'TRAINING', # optional, if supported by database 'catalog': 'DEMO', # optional, if supported by database 'data_store_id': data_store.id, 'credential_id': cred.credential_id, }
print(intake_settings)
Configure output settings¶
Use the snippet below to configure the output settings for JDBC scoring. For more information, reference the output options documentation.
output_settings = {
"type": "jdbc",
"table": "LENDING_CLUB_10K_AA_Temp",
"schema": "SCORING", # optional, if supported by database
"catalog": "SANDBOX", # optional, if supported by database schema
"statement_type": "insert",
"create_table_if_not_exists": True,
"data_store_id": data_store.id,
"credential_id": cred.credential_id,
}
print(output_settings)
# Uncomment and use the following lines for local file export:
# output_settings={
# 'type': 'localFile',
# 'path': './predicted.csv',
# }
# print(output_settings)
Use the code below to retrieve the name of the deployment.
deployment = dr.Deployment.get(deployment_id)
deployment.label
schedule = {
"minute": [59],
"hour": [7],
"month": ["*"],
"dayOfWeek": ["*"],
"dayOfMonth": [1],
}
schedule
Configure a prediction job¶
job = {
"deployment_id": deployment_id,
"num_concurrent": 4,
"intake_settings": intake_settings,
"output_settings": output_settings,
"passthroughColumnsSet": "all",
}
Create a prediction job¶
After configuring a prediction job, use the BatchPredictionJobDefinition.create
method to create a prediction job definition based on the job and scheduled you configured above.
definition = dr.BatchPredictionJobDefinition.create(
enabled=True, batch_prediction_job=job, name="Monthly Prediction Job JDBC", schedule=schedule
)
definition
Lastly, the snippet below initiates the prediction job on the schedule.
job_run_automatically = definition.run_on_schedule(schedule)