Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Make batch predictions with Azure Blob storage

The DataRobot Batch Prediction API allows you to take in large datasets and score them against deployed models running on a prediction server. The API also provides flexible options for the intake and output of these files.

In this tutorial, you will learn how to use the DataRobot Python Client package (which calls the Batch Prediction API) to set up a batch prediction job. The job reads an input file for scoring from Azure Blob storage and then writes the results back to Azure. This approach also works for Azure Data Lake Storage Gen2 accounts because the underlying storage is the same.

Requirements

In order to use the code provided in this tutorial, make sure you have the following:

Create stored credentials

Running batch prediction jobs requires the appropriate credentials to read and write to Azure Blob storage. You must provide the name of the Azure storage account and an access key.

  1. To retrieve these credentials, select the Access keys menu in the Azure portal.

  2. Click Show keys to retrieve an access key. You can use either of the keys shown (key1 or key2).

  3. Use the following code to create a new credential object within DataRobot that can be used in the batch prediction job to connect to your Azure storage account.

    AZURE_STORAGE_ACCOUNT = "YOUR AZURE STORAGE ACCOUNT NAME"
    AZURE_STORAGE_ACCESS_KEY = "AZURE STORAGE ACCOUNT ACCESS KEY"
    
    DR_CREDENTIAL_NAME = "Azure_{}".format(AZURE_STORAGE_ACCOUNT)
    
    # Create Azure-specific credentials
    # You can also copy the connection string, which is found below the access key in Azure.
    
    credential = dr.Credential.create_azure(
      name=DR_CREDENTIAL_NAME,
      azure_connection_string="DefaultEndpointsProtocol=https;AccountName={};AccountKey={};".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_ACCESS_KEY)
    )
    
    # Use this code to look up the ID of the credential object created.
    credential_id = None
    for cred in dr.Credential.list():
      if cred.name == DR_CREDENTIAL_NAME:
        credential_id = cred.credential_id
    break
    
    print(credential_id)
    

Run the prediction job

With a credential object created, you can now configure the batch prediction job as shown in the code sample below:

  • Set intake_settings and output_settings to the azure type.

  • For intake_settings and output_settings, set url to the files in Blob storage that you want to read and write to (the output file does not need to exist already).

  • Provide the ID of the credential object that was created above.

The code sample creates and runs the batch prediction job. Once finished, it provides the status of the job. This code also demonstrates how to configure the job to return both Prediction Explanations and passthrough columns for the scoring data.

Note

You can find the deployment ID in the sample code output of the Deployments > Predictions > Prediction API tab (with Interface set to "API Client").

DEPLOYMENT_ID = 'YOUR DEPLOYMENT ID'
AZURE_STORAGE_ACCOUNT = "YOUR AZURE STORAGE ACCOUNT NAME"
AZURE_STORAGE_CONTAINER = "YOUR AZURE STORAGE ACCOUNT CONTAINER"
AZURE_INPUT_SCORING_FILE = "YOUR INPUT SCORING FILE NAME"
AZURE_OUTPUT_RESULTS_FILE = "YOUR OUTPUT RESULTS FILE NAME"

# Set up our batch prediction job
# Input: Azure Blob Storage
# Output: Azure Blob Storage

job = dr.BatchPredictionJob.score(
   deployment=DEPLOYMENT_ID,
   intake_settings={
       'type': 'azure',
       'url': "https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_CONTAINER,AZURE_INPUT_SCORING_FILE),
       "credential_id": credential_id
   },
   output_settings={
       'type': 'azure',
       'url': "https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT, AZURE_STORAGE_CONTAINER,AZURE_OUTPUT_RESULTS_FILE),
       "credential_id": credential_id
   },
   # If explanations are required, uncomment the line below
   max_explanations=5,

   # If passthrough columns are required, use this line
   passthrough_columns=['column1','column2']
)

job.wait_for_completion()
job.get_status()

When the job completes successfully, you should see the output file in your Azure Blob storage container.

Documentation


Updated December 16, 2024