Exploratory Data Analysis (EDA)¶
Exploratory Data Analysis or EDA is DataRobot's approach to analyzing datasets and summarizing their main characteristics. Generally speaking, there are two stages of EDA—EDA1 and EDA2. EDA1 provides summary statistics based on a sample of your data. EDA2 is the step used for model building and uses the entire dataset, based on the options selected (see below).
The following describes, in general terms, the DataRobot model building process for datasets under 1GB:
- Import a dataset.
- DataRobot launches EDA1 (and automatically creates feature transformations if date features are detected).
Upon completion of EDA1, select a target and click Start.
DataRobot partitions the data.
- DataRobot launches EDA2 and starts model building when it completes.
The table below lists the components of EDA:
|Automatic data schema and data type||
|Data quality checks||
|Feature association matrix||Support numerical and categorical data with metrics:
DataRobot calculates EDA1 on up to 500MB of your dataset, after any applicable conversion or expansion. If the expanded dataset is under 500MB, it uses the entire dataset; otherwise, it uses a 500MB random sample (meaning it takes a random sampling equaling 500MB when the dataset is over 500MB).
For larger datasets, Fast EDA runs during EDA1 and calculates early target selection using only a percentage of the input dataset. A message identifies the approximate percentage of data used. See more information on early target selection for large datasets.
Special feature type
For numerics, numerical statistics
- Standard deviation
Frequency distribution for top 50 items
- Column validity for modeling (non-empty, non-duplicate)
DataRobot calculates EDA2 on the portion of the data used for EDA1, excluding rows that are also in the holdout data (if there is a holdout) and rows where the target is
N/A. DataRobot also does additional calculations on the target column using the entire dataset.
- Recalculation of the numerical statistics originally calculated in EDA1.
- Feature correlation to the target (initial feature importance calculation). The target data used is from the sampled portion used for all the other columns.
Note that the following column types are flagged as "invalid/non-informative," cannot be transformed, and are not used in modeling:
- Duplicate column(s).
- Empty columns and columns lacking enough data to model.
- Columns consisting of only unique identifiers (reference ID columns).
- Non-numeric columns with a distribution of too many different values to be useful for modeling.