The DataRobot Batch Prediction API allows you to take in large datasets and score them against deployed models running on a prediction server. The API also provides flexible options for the intake and output of these files.
In this tutorial, you will learn how to use the DataRobot Python Client package (which calls the Batch Prediction API) to set up a batch prediction job. The job reads an input file for scoring from Google Cloud Storage (GCS) and then writes the results back to GCS.
Running batch prediction jobs requires the appropriate credentials to read and write to GCS. You must create a service account within the Google Cloud Platform that has access to the GCS bucket, then download a key for the account to use in the batch prediction job.
To retrieve these credentials, log into the Google Cloud Platform console and select IAM & Admin > Service Accounts from the sidebar.
Click Create Service Account. Provide a name and description for the account, then click Create > Done.
On the Service Account page, find the account that you just created, navigate to the Details page, and click Keys.
Go to the Add Key menu and click Create new key. Select JSON for the key type and click Create to generate a key and download a JSON file with the information required for the batch prediction job.
Return to your GCS bucket and navigate to the Permissions tab. Click Add, enter the email address for the service account user you created, and give the account the “Storage Admin” role. Click Save to confirm the changes. This grants your GCP service account access to the GCS bucket.
After downloading the JSON key, use the following code to create a new credential object within DataRobot. The credentials will be used in the batch prediction job to connect to the GCS bucket. Open the JSON key file and copy its contents into the key variable. The DataRobot Python client reads the JSON data as a dictionary and parses it accordingly.
# Set name for GCP credential in DataRobotDR_CREDENTIAL_NAME="YOUR GCP DATAROBOT CREDENTIAL NAME"# Create a GCP-specific Credential# NOTE: This cannot be done from the UI# This can be generated and downloaded ready to drop in from within GCP# 1. Go to IAM & Admin -> Service Accounts# 2. Search for the Service Account you want to use (or create a new one)# 3. Go to Keys# 4. Click Add Key -> Create Key# 5. Selection JSON key type# 6. copy the contents of the json file into the gcp_key section of the credential code belowkey={"type":"service_account","project_id":"**********","private_key_id":"***************","private_key":"-----BEGIN PRIVATE KEY-----\n********\n-----END PRIVATE KEY-----\n","client_email":"********","client_id":"********","auth_uri":"https://accounts.google.com/o/oauth2/auth","token_uri":"https://oauth2.googleapis.com/token","auth_provider_x509_cert_url":"https://www.googleapis.com/oauth2/v1/certs","client_x509_cert_url":"https://www.googleapis.com/robot/v1/metadata/x509/*********"}credential=dr.Credential.create_gcp(name=DR_CREDENTIAL_NAME,gcp_key=key)# Use this code to look up the ID of the credential object created.credential_id=Noneforcredindr.Credential.list():ifcred.name==DR_CREDENTIAL_NAME:credential_id=cred.credential_idbreakprint(credential_id)
With a credential object created, you can now configure the batch prediction job. Set the intake_settings and output_settings to the gcp type. Provide both attributes with the URL to the files in GCS that you want to read and write to (the output file does not need to exist already). Additionally, provide the ID of the credential object that was created above. The code below creates and runs the batch prediction job. Once finished, it provides the status of the job. This code also demonstrates how to configure the job to return both Prediction Explanations and passthrough columns for the scoring data.
DEPLOYMENT_ID='YOUR DEPLOYMENT ID'# Set GCP InfoGCP_BUCKET_NAME="YOUR GCS BUCKET NAME"GCP_INPUT_SCORING_FILE="YOUR INPUT SCORING FILE NAME"GCP_OUTPUT_RESULTS_FILE="YOUR OUTPUT RESULTS FILE NAME"# Set up the batch prediction job# Input: Google Cloud Storage# Output: Google Cloud Storagejob=dr.BatchPredictionJob.score(deployment=DEPLOYMENT_ID,intake_settings={'type':'gcp','url':"gs://{}/{}".format(GCP_BUCKET_NAME,GCP_INPUT_SCORING_FILE),"credential_id":credential_id},output_settings={'type':'gcp','url':"gs://{}/{}".format(GCP_BUCKET_NAME,GCP_OUTPUT_RESULTS_FILE),"credential_id":credential_id},# If explanations are required, uncomment the line belowmax_explanations=5,# If passthrough columns are required, use this linepassthrough_columns=['column1','column2'])job.wait_for_completion()job.get_status()
When the job completes successfully, you will see the output file in the GCS bucket.