Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Analyze features using histograms

DataRobot generates a histogram for each numeric feature so that you can analyze the distribution of the feature's values and view outlier values. In this tutorial, you'll learn how to analyze numeric features using histograms.

Takeaways

This tutorial explains how to use a histogram to:

  • View the distribution of values for a feature
  • Investigate outliers

Visualize feature distribution

For numeric features, use the histogram to view a rough distribution of values.

  1. Import your dataset.

    The sample dataset featured in this tutorial contains patient data.

    The goal is to predict the likelihood of patient readmission to the hospital. The target feature is readmitted.

    Tip

    See the Assess data quality during EDA tutorial to learn how to use the Data Quality Assessment tool.

  2. When the import completes, navigate to the Project Data list and select a feature.

    For numeric features, a histogram displays equal-sized ranges called bins. The height of each bar represents the number of rows with values in that range.

  3. Hover over a bin to view the range of the bin and the number of rows that fall within the range.

    The time_in_hospital feature is the number of days spent in the hospital. The histogram indicates that a visit of one to three days is most common.

  4. Click the Showing dropdown menu on the bottom left to change the number of bins.

    With the additional bins, you can now see that a visit of two to three days is most common.

Visualize outliers

Use the histogram to investigate a feature that has outlier values.

  1. Select a feature that has outliers if one exists in your feature list.

    Tip

    Use the Data Quality Assessment tool to locate features with outliers. If a feature has outliers, a warning icon () displays in the Data Quality column. The warning tip indicates the type of issue.

  2. In the histogram that displays, toggle Show outliers on.

    The red dots at the top of the histogram are the outlier values. The gold box plot shows the middle quartiles for the data to help you determine whether the distribution is skewed.

  3. Hover over a red dot to view the value of the outlier.

    In this example, the outlier shown for the num_medications feature is 74.1—far from the median of 14.

View average target values

After you kick off EDA2, you can also view the average target values for features.

In the histogram, notice the orange circles that overlay the histogram.

The circles indicate the average target value for a bin. In this example, hospital visits of 8 days result in the highest average target value— for 8-day visits, 46.12% of rows have readmitted = 1.

Learn more

Related tutorials

Documentation


Updated August 26, 2022
Back to top