Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Data FAQ

What is the AI Catalog?

The AI Catalog is a DataRobot tool for importing, registering, and sharing data and other assets. The catalog supports browsing and searching registered assets, including definitions and relationships with other assets.

What file types can DataRobot ingest?

DataRobot can ingest text, Excel, SAS, and various compressed or archive files. Supported file formats are listed at the bottom of the project (Start) page. You can import files directly into DataRobot or you can import them into the AI Catalog.

What data sources can DataRobot connect to?

DataRobot can ingest from JDBC-enabled data sources, as well as S3, Azure Blob, Google Cloud Storage, and URLs, among others.

What is a histogram used for?

Histograms bucket numeric feature values into equal-sized ranges to show a rough distribution of the variable (feature). Access a feature's histogram by expanding the feature in the Data tab.

What do yellow triangles mean on the Data tab?

Upon uploading data, DataRobot automatically detects and identifies common data quality issues. The Data Quality Assessment report denotes these data quality issues with yellow triangle warnings. Hover over the triangles to see the specific quality issues, such as excess zeros or outliers.

How can I share a dataset?

Use the AI Catalog to share a dataset with users, groups, and organizations. You can select a role for the users who will share the asset—they can be an owner (can view, edit, and administer), an editor (can view and edit), or a consumer (can view).

How does DataRobot reduce features?

DataRobot automatically implements feature reduction at multiple stages of the modeling life cycle:

  1. During EDA1: After uploading your data, DataRobot creates an informative feature list by excluding non-informative features, such as those with too many unique values.
  2. After EDA2: After clicking Start, DataRobot removes features with target leakage (i.e., features with a high correlation to the target) and features with an ACE score less than 0.0005 (i.e., features with a marginal correlation to the target).
  3. During model training and analysis: DataRobot removes all redundant features and retrains the model, keeping features with a cumulative feature importance score over 0.95.
  4. A step in the model's blueprint: Some algorithms offer intrinsic feature reduction, including LASSO and ENET, by shrinking coefficients to 0.5.
  5. Automated Feature Discovery: Feature Discovery projects explore and generate features based on the secondary dataset(s), and then perform supervised feature reduction to only keep features with an estimated cumulative feature importance score over 0.98.

For more information, see the documentation for data transformations.

What are informative features?

Informative features are those that are potentially valuable for modeling. DataRobot generates an informative features list where features that will not be useful are removed. Some examples include reference IDs, features that contain empty values, and features that are derived from the target. DataRobot also creates features, such as date type features, and if valuable, includes them in the informative features list.

What is a snapshot?

You can create a snapshot of your data in the AI Catalog, in which case DataRobot stores a copy of your data in the catalog. You can then schedule the snapshot to be refreshed periodically. If you don't create a snapshot, the data is dynamic—DataRobot samples for profile statistics but does not keep a copy of the data. Instead, the catalog stores a pointer to the data and pulls it upon request, for example, when you create a project.

What are the green "importance" bars on the Data tab?

The importance bars show the degree to which a feature is correlated with the target. These bars are based on "Alternating Conditional Expectations" (ACE) scores which detect non-linear relationships with the target, but are unable to detect interaction effects between features. Importance measures the information content of the feature; this calculation is done independently for each feature in the project.

How large can my datasets be?

File size requirements vary depending on deployment type (Cloud versus on premise) and whether you are using AutoML, time series, and/or Feature Discovery.


Updated May 17, 2024