# Exploratory Data Analysis (EDA)

> Exploratory Data Analysis (EDA) - EDA is a two-stage process that DataRobot employs to first analyze
> datasets and summarize their main characteristics and then build models.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-05-01T23:10:48.100555+00:00` (UTC).

## Primary page

- [Exploratory Data Analysis (EDA)](https://docs.datarobot.com/en/docs/reference/data-ref/eda-explained.html): Full documentation for this topic (HTML).

## Sections on this page

- [EDA1](https://docs.datarobot.com/en/docs/reference/data-ref/eda-explained.html#eda1): In-page section heading.
- [EDA2](https://docs.datarobot.com/en/docs/reference/data-ref/eda-explained.html#eda2): In-page section heading.

## Related documentation

- [Reference documentation](https://docs.datarobot.com/en/docs/reference/index.html): Linked from this page.
- [Data reference](https://docs.datarobot.com/en/docs/reference/data-ref/index.html): Linked from this page.
- [feature transformations](https://docs.datarobot.com/en/docs/classic-ui/data/transform-data/feature-transforms.html): Linked from this page.
- [Feature Discovery](https://docs.datarobot.com/en/docs/classic-ui/data/transform-data/feature-discovery/index.html): Linked from this page.
- [feature derivation process](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/ts-reference/feature-eng.html): Linked from this page.
- [more information](https://docs.datarobot.com/en/docs/classic-ui/data/import-data/large-data/fast-eda.html#fast-eda-and-early-target-selection): Linked from this page.

## Documentation content

# Exploratory Data Analysis (EDA)

Exploratory Data Analysis or EDA is DataRobot's approach to analyzing datasets and summarizing their main characteristics. Generally speaking, there are two stages of EDA—EDA1 and EDA2. EDA1 provides summary statistics based on a sample of your data. EDA2 is the step used for model building and uses the entire dataset, based on the options selected (see below).

The following describes, in general terms, the DataRobot model building process for datasets under 1GB:

1. Import a dataset.
2. DataRobot launches EDA1 (and automatically creates feature transformations if date features are detected).
3. Upon completion of EDA1, select a target and clickStart.
4. DataRobot partitions the data.
5. DataRobot launches EDA2 and starts model building when it completes.

The table below lists the components of EDA:

| Analysis type | Analyzes... |
| --- | --- |
| Automatic data schema and data type | Numeric (numerical statistics, mean, standard deviation, median, min, max)CategoricalBooleanTextSpecial feature types dateCurrencyPercentageLengthImageGeospatial pointsGeospatial lines or polygons |
| Data visualization | HistogramFrequency distribution for top 50 itemsOvertimeColumn validity for modeling (non-empty, non-duplicate)Average valueOutliersFeature correlation to the target |
| Data quality checks | InliersOutliersDisguised missing valuesExcess zerosTarget leakageMissing imagesDuplicate images |
| Feature association matrix | Support numerical and categorical data with metrics:Mutual informationCramer's VPearsonSpearman |

## EDA1

DataRobot calculates EDA1 on up to 500MB of your dataset, after any applicable conversion or expansion. If the expanded dataset is under 500MB, it uses the entire dataset; otherwise, it uses a 500MB random sample (meaning it takes a random sampling equaling 500MB when the dataset is over 500MB).

> [!NOTE] Note
> For larger datasets, Fast EDA runs during EDA1 and calculates early target selection using only a percentage of the input dataset. A message identifies the approximate percentage of data used. See [more information](https://docs.datarobot.com/en/docs/classic-ui/data/import-data/large-data/fast-eda.html#fast-eda-and-early-target-selection) on early target selection for large datasets.

EDA1 returns:

- Feature type
- Special feature type
- For numerics, numerical statistics
- Frequency distribution for top 50 items
- Column validity for modeling (non-empty, non-duplicate)

## EDA2

DataRobot calculates EDA2 on the portion of the data used for EDA1, excluding rows that are also in the holdout data (if there is a holdout) and rows where the target is `N/A`. DataRobot also does additional calculations on the target column using the entire dataset.

EDA2 returns:

- Recalculation of the numerical statistics originally calculated in EDA1.
- Feature correlation to the target (initial feature importance calculation). The target data used is from the sampled portion used for all the other columns.

Note that the following column types are flagged as "invalid/non-informative," cannot be transformed, and are not used in modeling:

- Duplicate column(s).
- Empty columns and columns lacking enough data to model.
- Columns consisting of only unique identifiers (reference ID columns).
