Analyze features using histograms¶
DataRobot generates a histogram for each numeric feature so that you can analyze the distribution of the feature's values and view outlier values. In this tutorial, you'll learn how to analyze numeric features using histograms.
This tutorial explains how to use a histogram to:
- View the distribution of values for a feature
- Investigate outliers
Visualize feature distribution¶
For numeric features, use the histogram to view a rough distribution of values.
The sample dataset featured in this tutorial contains patient data.
The goal is to predict the likelihood of patient readmission to the hospital. The target feature is
See the Assess data quality during EDA tutorial to learn how to use the Data Quality Assessment tool.
When the import completes, navigate to the Project Data list and select a feature.
For numeric features, a histogram displays equal-sized ranges called bins. The height of each bar represents the number of rows with values in that range.
Hover over a bin to view the range of the bin and the number of rows that fall within the range.
time_in_hospitalfeature is the number of days spent in the hospital. The histogram indicates that a visit of one to three days is most common.
Click the Showing dropdown menu on the bottom left to change the number of bins.
With the additional bins, you can now see that a visit of two to three days is most common.
Use the histogram to investigate a feature that has outlier values.
Select a feature that has outliers if one exists in your feature list.
Use the Data Quality Assessment tool to locate features with outliers. If a feature has outliers, a warning icon () displays in the Data Quality column. The warning tip indicates the type of issue.
In the histogram that displays, toggle Show outliers on.
The red dots at the top of the histogram are the outlier values. The gold box plot shows the middle quartiles for the data to help you determine whether the distribution is skewed.
Hover over a red dot to view the value of the outlier.
In this example, the outlier shown for the
num_medicationsfeature is 74.1—far from the median of 14.
View average target values¶
After you kick off EDA2, you can also view the average target values for features.
In the histogram, notice the orange circles that overlay the histogram.
The circles indicate the average target value for a bin. In this example, hospital visits of 8 days result in the highest average target value— for 8-day visits, 46.12% of rows have
readmitted = 1.