Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Advanced feature selection with Python

This notebooks shows how you can use DataRobot's Python client to accomplish feature selection by creating aggregated Feature Impact using models created during Autopilot. For more information about the allowed feature transformations, reference the Python client documentation.

Requirements

  • Python version 3.7.3.
  • DataRobot API version 2.14.0.
  • A DataRobot Project object.
  • A DataRobot Model object.

Small adjustments may be needed depending on the Python version and DataRobot API version you are using.

Import libraries

import datarobot as dr
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('ticks')
sns.set_context('poster')

Select models

For this workflow, select the top five performing models from the project.

project = dr.Project.get(project_id='<project-id>')
models = project.get_models()
models = models[:5]
print(models)

Create a dataframe

Create a dataframe of features' relative rank for the top five models.

all_impact = pd.DataFrame()
for model in models[0:5]:

    # This can take about one minute for each model
    feature_impact = model.get_or_request_feature_impact(max_wait=600)

    # Ready to be converted to dataframe
    df = pd.DataFrame(feature_impact)
    # Track model names and IDs for auditing purposes
    df['model_type'] = model.model_type
    df['model_id'] = model.id
    # By sorting and re-indexing, the new index becomes the 'ranking'
    df = df.sort_values(by='impactUnnormalized', ascending=False)
    df = df.reset_index(drop=True)
    df['rank'] = df.index.values

    # Add to the master list of all models' feature ranks
    all_impact = pd.concat([all_impact, df], ignore_index=True)
all_impact.head()
featureName impactNormalized impactUnnormalized redundantWith model_type model_id rank
0 number_inpatient 1.000000 0.031445 None eXtreme Gradient Boosted Trees Classifier with... 5e620be2d7c7a80c003d16a2 0
1 discharge_disposition_id 0.950723 0.029896 None eXtreme Gradient Boosted Trees Classifier with... 5e620be2d7c7a80c003d16a2 1
2 medical_specialty 0.828289 0.026046 None eXtreme Gradient Boosted Trees Classifier with... 5e620be2d7c7a80c003d16a2 2
3 number_diagnoses 0.609419 0.019163 None eXtreme Gradient Boosted Trees Classifier with... 5e620be2d7c7a80c003d16a2 3
4 num_lab_procedures 0.543238 0.017082 None eXtreme Gradient Boosted Trees Classifier with... 5e620be2d7c7a80c003d16a2 4

View rankings and distribution

You can find the N features with the highest median ranking and visualize the distributions:

from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')

n_feats = 20
top_feats = list(all_impact
                 .groupby('featureName')
                 .median()
                 .sort_values('rank')
                 .head(n_feats)
                 .index
                 .values)

top_feat_impact = all_impact.query('featureName in @top_feats').copy()

fig, ax = plt.subplots(figsize=(20, 25))
sns.boxenplot(y='featureName', x='rank',
            data=top_feat_impact, order=top_feats,
            ax=ax, orient='h')
plt.title("Features with highest Feature Impact rating")
_ = ax.set_ylabel('Feature Name')
_ = ax.set_xlabel('Rank')

Create a new feature list

After analysis, you can create a new feature list with the top features and rerun Autopilot. Note that a feature list can also be created for a dataset and becomes usable across all projects that use that dataset in the future.

# Create new featurelist and run autopilot
featurelist = project.create_featurelist("consensus-top-features", list(top_feats))
featurelist_id = featurelist.id

project.start_autopilot(featurelist_id=featurelist_id)
project.wait_for_autopilot()

Updated June 10, 2022
Back to top