Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

EDA insights

Exploratory Data Analysis (EDA) is DataRobot's approach to analyzing datasets and summarizing their main characteristics. There are two stages of EDA—EDA1 and EDA2. DataRobot runs EDA1 prior to modeling when a dataset is added to the Data Registry for the first time, and as part of EDA, generates summary statistics based on a sample of your data and assesses the All Features list to detect common data quality issues.

The following describes, in general terms, the DataRobot model building process for datasets under 1GB:

  1. Import a dataset to DataRobot, registering it in the Data Registry.
  2. DataRobot launches EDA1 (and automatically creates feature transformations if date features are detected).
    • For Feature Discovery datasets, DataRobot:
      • Loads secondary datasets.
      • Discovers features from secondary datasets.
      • Generates new features from the discovery.
  3. Upon completion of EDA1, insights are displayed on the Features tab of the data explore page.
Where can I view EDA2 insights?

Because EDA2 is target-aware, insights are only generated after setting up and running an experiment with a dataset, however, these insights are not currently supported in Workbench. The only exception is the Feature lineage insight for Feature Discovery datasets.

EDA1

DataRobot calculates EDA1 on up to 500MB of your dataset, after any applicable conversion or expansion. If the expanded dataset is under 500MB, it uses the entire dataset; otherwise, it uses a 500MB random sample (meaning it takes a random sampling equaling 500MB when the dataset is over 500MB).

EDA1 returns:

Analysis type Analyzes
Automatic data schema and data type
  • Numeric
  • Numerical statistics:
    • Mean
    • Standard deviation
    • Median
    • Min
    • Max
  • Categorical
  • Boolean
  • Text
  • Special feature types:
    • Date
    • Currency
    • Percentage
    • Length
  • Image
  • Geospatial points
  • Geospatial lines or polygons
Data visualization
  • Histogram
  • Frequency distribution for the top 50 items
  • Average value
Data quality checks

Insights

Preparing your data is an iterative process. Even if you clean and prep your training data prior to uploading it to DataRobot, you can still improve its quality by assessing features using the insights generated as a result of EDA1. To access these insights:

  1. In your Use Case, click the Actions menu > Explore next to a registered dataset, opening the data explore page.
  2. Open the Features tab on the left.

  3. Click a feature—a panel opens displaying additional summary metrics for the feature at the top, as well as tabs for each available insight.

    The table below describes which insights are available after EDA1 based on the data type:

    Insight Description Supported data type
    Histogram Buckets numeric feature values into equal-sized ranges to show a rough distribution of the variable. numeric, summarized categorical, multicategorical
    Frequent Values Plots the counts of each individual value for the most frequent values of a feature. If there are more than 10 categories, DataRobot displays values that account for 95% of the data; the remaining 5% of values are bucketed into a single "All Other" category. numeric, categorical, text, boolean
    Table Provides a table of feature values and their occurrence counts. Note that if the value displayed contains a leading space, DataRobot includes a tag, leading space, to indicate as much. This is to help clarify why a particular value may show twice in the histogram (for example, 36 months and 36 months are both represented). numeric, categorical, text, boolean, summarized categorical, multilabel
    Illustration Shows how summarized categorical data—features that host a collection of categories—is represented as a feature. See also the summarized categorical insight differences. summarized categorical
    Overview Presents the top 50 most frequent keys for your feature. summarized categorical
    Feature lineage Provides a visual description of how the feature was derived and the datasets that were involved in the feature derivation process. Feature Discovery datasets only

Data quality checks

As part of EDA1, DataRobot automatically detects and surfaces common data quality issues and, often, handles them with minimal or no action on the part of the user. The assessment not only saves time finding and addressing issues, but provides transparency into automated data processing (you can see the automated processing that has been applied). Note that these checks are only run on features that don’t require date/time or target information (see the table above for a full list of data quality checks).

To check for data quality issues:

  1. After registration is complete, select the dataset to open the data explore page.
  2. Open the Features tab on the left. The Data quality column indicates if DataRobot detected a data quality issue with the feature.

  3. Hover over the icon to learn which check failed, and then you can use the exploratory data insights to correct them.


Updated January 8, 2025