Data > Analyze data > Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA)¶

Exploratory Data Analysis or EDA is DataRobot's approach to analyzing datasets and summarizing their main characteristics. Generally speaking, there are two stages of EDA—EDA1 and EDA2. EDA1 provides summary statistics based on a sample of your data. EDA2 is the step used for model building and uses the entire dataset, based on the options selected (see below).

The following describes, in general terms, the DataRobot model building process for datasets under 1GB:

Import a dataset.
DataRobot launches EDA1 (and automatically creates feature transformations if date features are detected).
Upon completion of EDA1, select a target and click Start.
- For Feature Discovery projects, DataRobot:
  - Loads secondary datasets.
  - Discovers features from secondary datasets.
  - Generates new features from the discovery.
- For time series projects, DataRobot applies the feature derivation process to create final features.
DataRobot partitions the data.
DataRobot launches EDA2 and starts model building when it completes.

The table below lists the components of EDA:

Analysis type	Analyzes...
Automatic data schema and data type	Numeric (numerical statistics, mean, standard deviation, median, min, max) Categorical Boolean Text Special feature types date Currency Percentage Length Image Geospatial points Geospatial lines or polygons
Data visualization	Histogram Frequency distribution for top 50 items Overtime Column validity for modeling (non-empty, non-duplicate) Average value Outliers Feature correlation to the target
Data quality checks	Inliers Outliers Disguised missing values Excess zeros Target leakage Missing images Duplicate images
Feature association matrix	Support numerical and categorical data with metrics: Mutual information Cramer's V Pearson Spearman

EDA1¶

DataRobot calculates EDA1 on up to 500MB of your dataset, after any applicable conversion or expansion. If the expanded dataset is under 500MB, it uses the entire dataset; otherwise, it uses a 500MB random sample (meaning it takes a random sampling equaling 500MB when the dataset is over 500MB).

Note

For larger datasets, Fast EDA runs during EDA1 and calculates early target selection using only a percentage of the input dataset. A message identifies the approximate percentage of data used. See more information on early target selection for large datasets.

EDA1 returns:

Feature type
- Numeric
- Categorical
- Boolean
- Image
- Text
Special feature type
- Date
- Currency
- Percentage
- Length
For numerics, numerical statistics
- Mean
- Standard deviation
- Median
- Min
- Max
Frequency distribution for top 50 items
Column validity for modeling (non-empty, non-duplicate)

EDA2¶

DataRobot calculates EDA2 on the portion of the data used for EDA1, excluding rows that are also in the holdout data (if there is a holdout) and rows where the target is N/A. DataRobot also does additional calculations on the target column using the entire dataset.

EDA2 returns:

Recalculation of the numerical statistics originally calculated in EDA1.
Feature correlation to the target (initial feature importance calculation). The target data used is from the sampled portion used for all the other columns.

Note that the following column types are flagged as "invalid/non-informative," cannot be transformed, and are not used in modeling:

Duplicate column(s).
Empty columns and columns lacking enough data to model.
Columns consisting of only unique identifiers (reference ID columns).

Updated March 19, 2025

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?

Thanks for your feedback!