The DataRobot Batch Prediction API allows you to take in large datasets and score them against deployed models running on a prediction server. The API also provides flexible options for the intake and output of these files.
In this tutorial, you will learn how to use the DataRobot Python Client package (which calls the Batch Prediction API) to set up a batch prediction job. The job reads an input file for scoring from Azure Blob storage and then writes the results back to Azure. This approach also works for Azure Data Lake Storage Gen2 accounts because the underlying storage is the same.
Running batch prediction jobs requires the appropriate credentials to read and write to Azure Blob storage. You must provide the name of the Azure storage account and an access key.
To retrieve these credentials, select the Access keys menu in the Azure portal.
Click Show keys to retrieve an access key. You can use either of the keys shown (key1 or key2).
Use the following code to create a new credential object within DataRobot that can be used in the batch prediction job to connect to your Azure storage account.
AZURE_STORAGE_ACCOUNT="YOUR AZURE STORAGE ACCOUNT NAME"AZURE_STORAGE_ACCESS_KEY="AZURE STORAGE ACCOUNT ACCESS KEY"DR_CREDENTIAL_NAME="Azure_{}".format(AZURE_STORAGE_ACCOUNT)# Create Azure-specific credentials# You can also copy the connection string, which is found below the access key in Azure.credential=dr.Credential.create_azure(name=DR_CREDENTIAL_NAME,azure_connection_string="DefaultEndpointsProtocol=https;AccountName={};AccountKey={};".format(AZURE_STORAGE_ACCOUNT,AZURE_STORAGE_ACCESS_KEY))# Use this code to look up the ID of the credential object created.credential_id=Noneforcredindr.Credential.list():ifcred.name==DR_CREDENTIAL_NAME:credential_id=cred.credential_idbreakprint(credential_id)
With a credential object created, you can now configure the batch prediction job as shown in the code sample below:
Set intake_settings and output_settings to the azure type.
For intake_settings and output_settings, set url to the files in Blob storage that you want to read and write to (the output file does not need to exist already).
Provide the ID of the credential object that was created above.
The code sample creates and runs the batch prediction job. Once finished, it provides the status of the job. This code also demonstrates how to configure the job to return both Prediction Explanations and passthrough columns for the scoring data.
DEPLOYMENT_ID='YOUR DEPLOYMENT ID'AZURE_STORAGE_ACCOUNT="YOUR AZURE STORAGE ACCOUNT NAME"AZURE_STORAGE_CONTAINER="YOUR AZURE STORAGE ACCOUNT CONTAINER"AZURE_INPUT_SCORING_FILE="YOUR INPUT SCORING FILE NAME"AZURE_OUTPUT_RESULTS_FILE="YOUR OUTPUT RESULTS FILE NAME"# Set up our batch prediction job# Input: Azure Blob Storage# Output: Azure Blob Storagejob=dr.BatchPredictionJob.score(deployment=DEPLOYMENT_ID,intake_settings={'type':'azure','url':"https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT,AZURE_STORAGE_CONTAINER,AZURE_INPUT_SCORING_FILE),"credential_id":credential_id},output_settings={'type':'azure','url':"https://{}.blob.core.windows.net/{}/{}".format(AZURE_STORAGE_ACCOUNT,AZURE_STORAGE_CONTAINER,AZURE_OUTPUT_RESULTS_FILE),"credential_id":credential_id},# If explanations are required, uncomment the line belowmax_explanations=5,# If passthrough columns are required, use this linepassthrough_columns=['column1','column2'])job.wait_for_completion()job.get_status()
When the job completes successfully, you should see the output file in your Azure Blob storage container.