Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Exploratory Data Analysis (EDA)

Exploratory Data Analysis or EDA is DataRobot's approach to analyzing datasets and summarizing their main characteristics. Generally speaking, there are two stages of EDA—EDA1 and EDA2. EDA1 provides summary statistics based on a sample of your data. EDA2 is the step used for model building and uses the entire dataset, based on the options selected (see below).

The following describes, in general terms, the DataRobot model building process for datasets under 1GB:

  1. Import a dataset.
  2. DataRobot launches EDA1 (and automatically creates feature transformations if date features are detected).
  3. Upon completion of EDA1, select a target and click Start.

    • For Feature Discovery projects, DataRobot:

      • loads secondary datasets.
      • discovers features from secondary datasets.
      • generates new features from the discovery.
    • For time series projects, DataRobot applies the feature derivation process to create final features.

  4. DataRobot partitions the data.

  5. DataRobot launches EDA2, and when it completes, starts model building.

EDA1

DataRobot calculates EDA1 on up to 500MB of your dataset, after any applicable conversion or expansion. If the expanded dataset is under 500MB, it uses the entire dataset; otherwise, it uses a 500MB random sample (meaning it takes a random sampling equaling 500MB when the dataset is over 500MB).

Note

For larger datasets, Fast EDA runs during EDA1 and calculates early target selection using only a percentage of the input dataset. A message identifies the approximate percentage of data used. See more information on early target selection for large datasets.

EDA1 returns:

  • Feature type (Numeric, Categorical, Boolean, Image, and Text, and special feature types Date, Currency, Percentage, and Length)
  • For numerics, numerical statistics (mean, standard deviation, median, min, max)
  • Frequency distribution for top 50 items
  • Column validity for modeling (non-empty, non-duplicate)

EDA2

DataRobot calculates EDA2 on the portion of the data used for EDA1, excluding rows that are also in the holdout data (if there is a holdout) and rows where the target is N/A. DataRobot also does additional calculations on the target column using the entire dataset.

EDA2 returns:

  • Recalculation of numerical statistics done in EDA1
  • Feature correlation to the target (initial feature importance calculation). The target data used is from the sampled portion used for all the other columns.

Note that the following column types are flagged as "invalid/non-informative," cannot be transformed, and are not used in modeling:

  • duplicate column(s)
  • empty columns and columns lacking enough data to model
  • columns consisting of only unique identifiers (reference ID columns)
  • non-numeric columns with a distribution of too many different values to be useful for modeling

Updated October 26, 2021
Back to top