ESG score predictions with Python¶
This notebook walks through Python code from an example application that uses DataRobot to predict the Environmental, Societal, and Corporate (ESG) scores for stocks. After completing this lab, you will be able to:
- Use the DataRobot Python client to build a model from a training dataset and deploy the model.
- Use the DataRobot predictions REST API to calculate predicted values.
Goals¶
This example workflow creates a model to predict a company's ESG score.
ESG is a rating of a corporation's Environmental, Societal, and Governance scores. A lower ESG score means a particular company (and its stock) is less exposed to risk in that area. For example, a company that deals with oil extraction might have a very high environmental impact rating, which will in turn increase their overall ESG score.
Calculating ESG scores is an extensive process that involves in-depth analysis of a company's publicly available information, as well as data from news sources. As not every company will have their ESG score calculated, you will use DataRobot's ML technology to score a large number of companies across several different stock exchanges that don't have existing ESG scores.
This example is part of a larger project — a full demo application called "Harv the Finance Finder." This lab only covers the portions of the application specific to DataRobot AutoML. For an example of how these predictions could be included in an application, see the full application source in GitHub.
Setup¶
Prerequisites¶
In order to complete this workflow, you'll need:
- A DataRobot account
- Basic knowledge of Python
- Familiarity with data science concepts and terminology
- Python 3 and the DataRobot Python client installed.
Explore the training data¶
Start by downloading the sample data file: stock_quotes_esg_train.csv
Review the CSV file and note the features:
symbo
is the stock ticker symbol — MSFT for Microsoft, V for Visa, GM for General Motors, and so on, as seen incompanyName
.open
,close
,high
,low
,week52Low
, andweek52High
indicate how the stock price has moved, either today or in the last year. All numbers are in USD.marketCap
tells us the total valuation of the company.sector
is the primary sector that the company operates in, for example, Electronic Technology, Health Services, Transportation, etc.esg_category
is the target feature you'll be training the model on. The companies are lumped into four ESG categories:1
being the lowest ESG risk (best) and4
being the highest (worst).
Data source¶
The application uses data from the IEX API. The stock dataset was created by merging stock data from various industry sectors into a single dataset. The data was collected on 25 May 2020.
ESG scores are provided by various agencies, but not in a publicly accessible API. For this showcase application, DataRobot generated fake data to train the model, based on sustainability ratings available in Yahoo Finance. The script used in the project is available on GitHub.
Connect to DataRobot¶
To read more about the options for connecting to DataRobot from the Python client, review the API Quickstart guide.
# If the config file is not in the default location described in the API Quickstart guide, '~/.config/datarobot/drconfig.yaml', then you will need to call
# dr.Client(config_path='path-to-drconfig.yaml')
Configure the Python client¶
The most important package to import is the DataRobot Python Client(opens in a new tab) package, which provides the API to connect your client application to DataRobot.
import datarobot as dr
Then instantiate the DataRobot client.
dr.Client()
Upload data¶
Upload the project with dr.Project.create
, passing in the path to the dataset you downloaded above. Specify a project name of your choice.
project = dr.Project.create(
sourcedata="./stock_quotes_esg_train.csv", project_name="your project name"
)
Modeling¶
Define the target¶
As a best practice, this application would typically use esg_category
, a numerical property, as the target feature. However, for learning purposes only, use multiclass classification to predict ESG scores to be one of four categories, represented by integer values 1 to 4 rather than a numeric value. Do this by transforming the esg_category
numeric feature to a categorical feature called esg_category_categorical
using project.create_type_transform_feature
.
# Transform esg_category into a categorical variable type
# Note: not best practice, included for learning purposes only
project.create_type_transform_feature(
"esg_category_categorical", # new feature name
"esg_category", # parent name
dr.enums.VARIABLE_TYPE_TRANSFORM.CATEGORICAL_INT,
)
Start Autopilot¶
To start the process of training models on this data, call project.set_target()
, passing in the target name (esg_category_categorical
), which you created in previous steps. You can also pass the mode option, telling DataRobot to do a quick modeling run, building a limited set of models.
This can be a long-running process. Call project.wait_for_autopilot(
), which will print informative output and block the script until the modeling job is finished.
# This kicks off modeling using Quick Autopilot mode
project.set_target(target="esg_category_categorical", mode=dr.enums.AUTOPILOT_MODE.QUICK)
# Time for a cup of tea or a walk - this might take ~15 minutes
project.wait_for_autopilot()
Get the recommended model¶
After Autopilot has finished, you can get a list of all models it has created in your project, ranked by their accuracy. You can get DataRobot's recommendation by calling dr.ModelRecommendation.get(project.id)
, and get our model from that using get_model()
.
recommendation = dr.ModelRecommendation.get(project.id)
recommended_model = recommendation.get_model()
print(f"Recommended model is {recommended_model}")
Deploy the model¶
Now that you have the recommended model, you can deploy it to a production environment to make predictions with new data.
Models are not deployed to the same server used to train models; they are deployed to one or more prediction servers. How to identify the prediction server to use is different depending on your account type.
- Trial accounts use shared prediction servers so you do not need to specify a server.
- DataRobot Managed AI Cloud accounts and on-premises installations provide dedicated prediction servers. For those types of accounts, you first need to get the ID of one of the prediction servers using
dr.PredictionServer.list()[0].id
. Uncomment the noted lines in the code below.
After determining the prediction server (if required for your account type), use dr.Deployment.create_from_learning_model
to deploy the model.
# Uncomment for Managed AI Cloud accounts
# prediction_server_id = dr.PredictionServer.list()[0].id
deployment = dr.Deployment.create_from_learning_model(
model_id=recommended_model.id,
label="Financial ESG model",
description="Model for scoring financial quote data",
# Uncomment for Managed AI Cloud accounts
# default_prediction_server_id=prediction_server_id
)
print(f"Deployment created: {deployment}, deployment id: {deployment.id}")
Calculate ESG scores for a dataset¶
Download prediction data¶
After your model has been deployed you can start making predictions using DataRobot'sREST API. You can use any language to call the API. This notebook uses Python.
Start by downloading the dataset we want to calculate ESG scores for: stock_quotes_all.csv. Be sure to save the file in the same location where you saved the training data previously.
Configure application code¶
In the real world, predictions are likely to happen in a separate application from creating and deploying a model. In this notebook, DataRobot assumes you are working in an interactive environment such as a Python shell or notebook, but this notebook walks through the process you'd go through for adding the code to an application.
Make sure the DATAROBOT_API_TOKEN
and DATAROBOT_ENDPOINT
are set in your environment as described in the first step of this lab.
Then set up your application in Python, which is similar to what you did when setting up for the model building steps:
import csv
import json
import os
import sys
import datarobot as dr
import requests
DR_API_KEY = os.environ["DATAROBOT_API_TOKEN"]
dr.Client()
You might have taken note of the deployment ID when you deployed your model earlier. If you have that ID available, you can use it to find the prediction server for the deployed model. But that value may not be readily accessible in a real application, so use the following code to find the correct deployment using the label you set when deploying the model.
Refer back to the earlier steps where you deployed the model to find the label specified and use it here to find the deployment and its associated prediction server.
for d in dr.Deployment.list():
if d.label == "your deployment label":
deployment = d
prediction_server_url = deployment.default_prediction_server["url"]
Make a prediction request¶
Now you have everything you need to make your request. DataRobot's prediction API doesn't come with an SDK so you need to "handcraft" your API requests, using Python's requests.
You are sending the following header values:
Content-Type
in this case is text/plain as you're sending a CSV file. Alternatively, the API also acceptsapplication/json
for JSON payloads.Authorization
takes the same API key we used with the modeling API in the DataRobot Python SDK.datarobot-key
is the key specifically for the prediction server. Note that this value is not used for trial or pay-as-you-go DataRobot accounts.
To predict, send a POST request to the prediction server with the data from the file you downloaded above as the payload.
Prediction API responds in JSON format, and your predictions will be in the data field.
headers = {
"Content-Type": "text/plain; charset=UTF-8",
"Authorization": f"Bearer {DR_API_KEY}",
# comment out line below if using a trial or pay-as-you-go account.
"datarobot-key": deployment.default_prediction_server["datarobot-key"],
}
url = f"{prediction_server_url}/predApi/v1.0/deployments/{deployment.id}/predictions?passthroughColumns=symbol"
data = open("./data/stock_quotes_all.csv", "rb").read()
predictions_response = requests.post(url, data=data, headers=headers)
predictions = predictions_response.json()["data"]
Parse and save the prediction response¶
The prediction data payload is a JSON array of objects for all predicted fields, in the same order that you sent it. The actual predicted value is in the prediction field. In this example, you're creating a new list of symbol/category pairs, and filling them by iterating through the returned predictions.
# Transform predictions into a CSV file of the format:# symbol, esg_category
# where symbol is a value passed through from the prediction request
esg_categories = [["symbol", "esg_category"]]
for prediction in predictions:
symbol = prediction["passthroughValues"]["symbol"]
value = int(prediction["prediction"])
esg_entry = [symbol, value]
esg_categories.append(esg_entry)
# Write the data as a CSV file
with open("stocks_esg_scores.csv", mode="w") as out_csv:
csv_writer = csv.writer(out_csv)
csv_writer.writerows(esg_categories)
Review the contents of the output file — stocks_esg_scores.csv
— to confirm that the esg_category
column contains ESG categories (i.e. integers 1 through 4).
Recap¶
In this lab, you walked through Python application code to:
- Connect a client application to DataRobot.
- Upload a training dataset to DataRobot AutoML.
- Identify a target value.
- Run Autopilot to generate a set of models .
- Deploy the recommended model to a prediction server.
- Request predictions from DataRobot for a dataset.