Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis or EDA is DataRobot's approach to analyzing datasets and summarizing their main characteristics. Generally speaking, there are two stages of EDA—EDA1 and EDA2. EDA1 provides summary statistics based on a sample of your data. EDA2 is the step used for model building and uses the entire dataset, based on the options selected (see below).

The following describes, in general terms, the DataRobot model building process for datasets under 1GB:

  1. Import a dataset.
  2. DataRobot launches EDA1 (and automatically creates feature transformations if date features are detected).
  3. Upon completion of EDA1, select a target and click Start.

    • For Feature Discovery projects, DataRobot:

      • Loads secondary datasets.
      • Discovers features from secondary datasets.
      • Generates new features from the discovery.
    • For time series projects, DataRobot applies the feature derivation process to create final features.

  4. DataRobot partitions the data.

  5. DataRobot launches EDA2 and starts model building when it completes.

The table below lists the components of EDA:

Analysis type Analyzes...
Automatic data schema and data type
  • Numeric (numerical statistics, mean, standard deviation, median, min, max)
  • Categorical
  • Boolean
  • Text
  • Special feature types date
  • Currency
  • Percentage
  • Length
  • Image
  • Geospatial points
  • Geospatial lines or polygons
Data visualization
  • Histogram
  • Frequency distribution for top 50 items
  • Overtime
  • Column validity for modeling (non-empty, non-duplicate)
  • Average value
  • Outliers
  • Feature correlation to the target
Data quality checks
  • Inliers
  • Outliers
  • Disguised missing values
  • Excess zeros
  • Target leakage
  • Missing images
  • Duplicate images
Feature association matrix Support numerical and categorical data with metrics:
  • Mutual information
  • Cramer's V
  • Pearson
  • Spearman

EDA1

DataRobot calculates EDA1 on up to 500MB of your dataset, after any applicable conversion or expansion. If the expanded dataset is under 500MB, it uses the entire dataset; otherwise, it uses a 500MB random sample (meaning it takes a random sampling equaling 500MB when the dataset is over 500MB).

Note

For larger datasets, Fast EDA runs during EDA1 and calculates early target selection using only a percentage of the input dataset. A message identifies the approximate percentage of data used. See more information on early target selection for large datasets.

EDA1 returns:

  • Feature type

    • Numeric
    • Categorical
    • Boolean
    • Image
    • Text
  • Special feature type

    • Date
    • Currency
    • Percentage
    • Length
  • For numerics, numerical statistics

    • Mean
    • Standard deviation
    • Median
    • Min
    • Max
  • Frequency distribution for top 50 items

  • Column validity for modeling (non-empty, non-duplicate)

EDA2

DataRobot calculates EDA2 on the portion of the data used for EDA1, excluding rows that are also in the holdout data (if there is a holdout) and rows where the target is N/A. DataRobot also does additional calculations on the target column using the entire dataset.

EDA2 returns:

  • Recalculation of the numerical statistics originally calculated in EDA1.
  • Feature correlation to the target (initial feature importance calculation). The target data used is from the sampled portion used for all the other columns.

Note that the following column types are flagged as "invalid/non-informative," cannot be transformed, and are not used in modeling:

  • Duplicate column(s).
  • Empty columns and columns lacking enough data to model.
  • Columns consisting of only unique identifiers (reference ID columns).

Updated October 3, 2022