Make batch predictions with Azure Blob storage¶
The DataRobot Batch Prediction API allows you to take in large datasets and score them against deployed models running on a prediction server. The API also provides flexible options for the intake and output of these files.
In this tutorial, you will learn how to use the DataRobot Python Client package (which calls the Batch Prediction API) to set up a batch prediction job. The job reads an input file for scoring from Azure Blob storage and then writes the results back to Azure. This approach also works for Azure Data Lake Storage Gen2 accounts because the underlying storage is the same.
Requirements¶
In order to use the code provided in this tutorial, make sure you have the following:
- Python 2.7 or 3.4+
- The DataRobot Python package (2.21.0+) (pypi) (conda)
- A DataRobot deployment
- An Azure storage account
- An Azure storage container
- A scoring dataset in the storage container to use with your DataRobot deployment
Create stored credentials¶
Running batch prediction jobs requires the appropriate credentials to read and write to Azure Blob storage. You must provide the name of the Azure storage account and an access key.
-
To retrieve these credentials, select the Access keys menu in the Azure portal.
-
Click Show keys to retrieve an access key. You can use either of the keys shown (key1 or key2).
-
Use the following code to create a new credential object within DataRobot that can be used in the batch prediction job to connect to your Azure storage account.
AZURE_STORAGE_ACCOUNT = "YOUR AZURE STORAGE ACCOUNT NAME" AZURE_STORAGE_ACCESS_KEY = "AZURE STORAGE ACCOUNT ACCESS KEY" DR_CREDENTIAL_NAME = "Azure_{}".format(AZURE_STORAGE_ACCOUNT) # Create Azure-specific credentials # You can also copy the connection string, which is found below the access key in Azure. credential = dr.Credential.create_azure( name=DR_CREDENTIAL_NAME, azure_connection_string="DefaultEndpointsProtocol=https;AccountName={};AccountKey={};".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_ACCESS_KEY) ) # Use this code to look up the ID of the credential object created. credential_id = None for cred in dr.Credential.list(): if cred.name == DR_CREDENTIAL_NAME: credential_id = cred.credential_id break print(credential_id)
Run the prediction job¶
With a credential object created, you can now configure the batch prediction job as shown in the code sample below:
-
Set
intake_settings
andoutput_settings
to theazure
type. -
For
intake_settings
andoutput_settings
, seturl
to the files in Blob storage that you want to read and write to (the output file does not need to exist already). -
Provide the ID of the credential object that was created above.
The code sample creates and runs the batch prediction job. Once finished, it provides the status of the job. This code also demonstrates how to configure the job to return both Prediction Explanations and passthrough columns for the scoring data.
Note
You can find the deployment ID in the sample code output of the Deployments > Predictions > Prediction API tab (with Interface set to "API Client").
DEPLOYMENT_ID = 'YOUR DEPLOYMENT ID'
AZURE_STORAGE_ACCOUNT = "YOUR AZURE STORAGE ACCOUNT NAME"
AZURE_STORAGE_CONTAINER = "YOUR AZURE STORAGE ACCOUNT CONTAINER"
AZURE_INPUT_SCORING_FILE = "YOUR INPUT SCORING FILE NAME"
AZURE_OUTPUT_RESULTS_FILE = "YOUR OUTPUT RESULTS FILE NAME"
# Set up our batch prediction job
# Input: Azure Blob Storage
# Output: Azure Blob Storage
job = dr.BatchPredictionJob.score(
deployment=DEPLOYMENT_ID,
intake_settings={
'type': 'azure',
'url': "https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_CONTAINER,AZURE_INPUT_SCORING_FILE),
"credential_id": credential_id
},
output_settings={
'type': 'azure',
'url': "https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_CONTAINER,AZURE_OUTPUT_RESULTS_FILE),
"credential_id": credential_id
},
# If explanations are required, uncomment the line below
max_explanations=5,
# If passthrough columns are required, use this line
passthrough_columns=['column1','column2']
)
job.wait_for_completion()
job.get_status()
When the job completes successfully, you should see the output file in your Azure Blob storage container.