Assess data quality during EDA¶
In this tutorial, you'll learn how DataRobot performs Exploratory Data Analysis (EDA) and how to assess the quality of your data at each stage of EDA—EDA1 and EDA2.
Preparing your data is an iterative process. Even if you clean and prep your training data prior to uploading it to DataRobot, you can still improve its quality by assessing features during EDA.
This tutorial explains:
- Exploratory Data Analysis, including EDA1 and EDA2
- How to add your data to DataRobot
- How to use the Data Quality Assessment tool
- How to evaluate feature importance
Stages of EDA¶
During EDA, DataRobot performs Data Quality Assessment. The assessment provides information about data quality issues that are relevant to the stage of model building you are peforming. Click one of the following tabs to learn about the two EDA stages.
EDA1 (data ingest) occurs after you upload your data. EDA1 assesses the All Features list and detects issues like:
Once you click Start on the Data page, DataRobot performs another round of EDA. During this stage, DataRobot detects target leakage and non-linear correlations between the features and the target, which helps you analyze feature importance. EDA2 reports on the selected feature list. If a feature list is not selected, EDA2 reports on the default All Features list.
Load and view your dataset¶
As soon as you load your dataset, DataRobot performs EDA1. In this phase, DataRobot generates summary statistics based on a sample of your data.
Import your dataset.
To do so, drag a local file to the Begin a project page, browse for a Local file, or import from an external data source or URL.
DataRobot uploads the dataset, creates a new project, and performs an initial EDA. View the progress in the Worker Queue on the right.
To learn how DataRobot handles larger datasets, see Fast EDA.
Once you import your data, click Explore the data or scroll down to see the features in your dataset.
DataRobot displays the features and provides summary information and statistics.
Label Description Var Type The data type DataRobot identifies for the feature during EDA, for example, Numeric, Categorical, Boolean, Image, Text, and special features types like Date. Unique The number of unique values for the feature. Missing The number of missing values for the feature. Mean, Std Dev, Median, Min, Max DataRobot calculates these statistics for numerical features.
The sample dataset featured in this tutorial contains patient data.
The goal is to predict the likelihood of patient readmission to the hospital. The target feature is
Assess data quality after EDA1¶
EDA1 helps you catch data issues before you start modeling.
Above your feature list and to the right, click View info.
The Data Quality Assessment dropdown menu displays.
The Data Quality Assessment provides the following issue status flags:
- Warning : Attention or action required.
- Informational : No action required.
- No issue .
Optionally, click Filter affected features by type of issue detected and select particular issues to search for.
Scroll down to locate the features with issues.
If a feature has an issue, the issue flag displays in the Data Quality column. Hover over the flag to view the type of issue.
Click a feature that displays an issue flag, then use tools such as the Histogram, Frequent Values, and Feature Associations to explore further.
See Learn more for tutorials that show how to use these tools.
Assess data quality after EDA2¶
EDA2 kicks off after you set your target and start the modeling process.
Under What would you like to predict, enter your target feature.
You can keep the mode set to the default, Quick autopilot, or you can select a different modeling mode. You can also customize your modeling settings.
DataRobot performs a number of processing steps. Monitor the steps in the Worker Queue.
As soon as DataRobot finishes analyzing features, you can take a look at feature importance. DataRobot continues with blueprint generation.
Investigate feature importance¶
The importance bars show the degree to which a feature is correlated with the target. Importance is calculated using an algorithm that measures the information content of the variable. This calculation is done independently for each feature in the dataset.
Investigate feature importance to determine which features are most useful for building accurate models and which features you can remove from your training data.
In the Data tab, scroll down to the feature list.
Take a look at the Importance column.
The green bars indicate how closely a feature is related to the target.
You might want to remove features that are unrelated to the target.
- Analyze features using histograms
- Analyze frequent values
- Analyze feature associations
- Work with feature lists