Data integrity and quality are cornerstones for creating highly accurate predictive models. These sections describe the tools and visualizations DataRobot provides to ensure that your project doesn't suffer the "garbage in, garbage out" outcome.
|Dataset requirements, data type definitions, file formats and encodings, and special column treatments.
|Connect to data sources
|Set up database connections and manage securely stored credentials for reuse when accessing secure data sources.
|Import data into the AI Catalog and from there, you can transform data using SQL, as well as create and schedule snapshots of your data. Then, create a DataRobot project from a catalog asset.
|Import data from a variety of sources.
|Transform primary datasets and perform Feature Discovery on multiple datasets.
|Investigate data using reports and visualizations created after EDA1 and EDA2.
|A list of frequently asked data preparation and management questions with brief answers and links to more complete documentation.
How to build a feature store in DataRobot¶
Feature stores serve as a central repository where frequently-used features are stored and organized for reuse and sharing. Using existing functionality, you can build a feature store in DataRobot.
- Feature storage: Connect to and add data from external data sources using the Data Registry and AI Catalog, as well as saved credentials in Credentials Management.
- Feature transformations: Build wrangling recipes in Workbench to apply transformations to your data.
- Perform offline serving for batch processing by using wrangler recipe SQL and scheduling it within the AI Catalog.
- Perform online serving for realtime processing using feature cache.
- Data monitoring: Monitor your data with the Workbench exploratory data insights (EDA) or jobs.
- Automation: Create custom jobs to implement automation.
The following are the data-related considerations for working in DataRobot.
For non-time series projects (see time series considerations here):
Ingestion of XLSX files often does not work as well as using the corresponding CSV format. The XLSX format requires loading the entire file into RAM before processing can begin, which can cause RAM availability errors. Even when successful, performance is poorer than CSV (which can begin processing before the entire file is loaded). As a result, XLSX file size limits are suggested. For larger file sizes than those listed below, convert your Excel file to CSV for importing. See the dataset requirements for more information.
When using the prediction API, there is a 50MB body size limitation to the request. If you make a request seeking a prediction of more than 50MB using dedicated prediction workers, it will fail with the
HTTP response HTTP 413: Entity Too Large.
- Exportable Java scoring code uses extra RAM during model building and therefore, dataset size should be less than 8GB.
10GB Cloud ingest¶
The 10GB ingest option is only available for licensed users of the DataRobot Business Critical package and only available for AutoML (not time series) projects.
Consider the following when working with the 10GB ingest option for AutoML projects:
- Certain modeling activities may deliver less than 10GB availability, as described below.
- The capability is available for regression, binary classification, and multiclass AutoML projects.
- Project creation with datasets close to 10GB may take several hours, depending on the data structure and features enabled.
In some situations, depending on the data or the nature of the modeling activity, 10GB datasets can cause out-of-memory (OOM) errors. The following conditions have resulted in OOM errors during testing:
- Models built from the Repository; retry the model using a smaller sample size.
- Feature Impact insights; rerun the Feature Impact job using a smaller sample size.
- Using Advanced Tuning, particularly tunings that: a) add more trees to XGboost/LGBM models or b) deep grid searches of many parameters.
- Retraining models at larger sample sizes.
- Multiclass projects with more than 5-10 classes.
- Feature Effects insight; try reducing the number of features.
- Anomaly detection models, especially for datasets > 2.5GB.
Specific areas of the application may have a limit lower than 10GB. Notably:
- Location AI (geospatial modeling) is limited to 100,000 rows and 500 numeric columns. Datasets that exceed those limits will run as regular AutoML modeling projects but the Spatial Neighborhood Featurizer will not run (resulting in no geospatial-specific models).
- Out-of-time validation (OTV) modeling supports datasets up to 5GB.