Data integrity and quality are cornerstones for creating highly accurate predictive models. These sections describe the tools and visualizations DataRobot provides to ensure that your project doesn't suffer the "garbage in, garbage out" outcome.
|Fundamentals of working with data||How DataRobot data prep, management, and transformation tools support your ML workflow.|
|Connect to data sources||Set up database connections and manage securely stored credentials for reuse when accessing secure data sources.|
|AI Catalog||Import data into the AI Catalog and from there, you can transform data using SQL, as well as create and schedule snapshots of your data. Then, create a DataRobot project from a catalog asset.|
|Import data||Import data from a variety of sources.|
|Transform data||Transform primary datasets and perform Feature Discovery on multiple datasets.|
|Analyze data||Investigate data using reports and visualizations created after EDA1 and EDA2.|
|Dataset requirements||Dataset requirements, data type definitions, file formats and encodings, and special column treatments.|
The following are the data-related considerations for working in DataRobot.
For non-time series projects (see time series considerations here):
- Ingestion of XLSX files often does not work as well as using the corresponding CSV format. The XLSX format requires loading the entire file into RAM before processing can begin, which can cause RAM availability errors. Even when successful, performance is poorer than CSV (which can begin processing before the entire file is loaded). As a result, XLSX file size limits are suggested. For larger file sizes than those listed below, convert your Excel file to CSV for importing.
- When using the prediction API, there is a 50MB body size limitation to the request. If you make a request seeking a prediction of more than 50MB using dedicated prediction workers, it will fail with the
HTTP response HTTP 413: Entity Too Large.
- Exportable Java scoring code and DataRobot Prime use extra RAM during model building and therefore, dataset size should be less than 8GB.
10GB Cloud ingest¶
The 10GB ingest option is only available for licensed users of the DataRobot Business Critical package and only available for AutoML (not time series) projects.
Consider the following when working with the 10GB ingest option for AutoML projects:
- Certain modeling activities may deliver less than 10GB availability, as described below.
- The capability is available for regression, binary classification, and multiclass AutoML projects.
- Project creation with datasets close to 10GB may take several hours, depending on the data structure and features enabled.
In some situations, depending on the data or the nature of the modeling activity, 10GB datasets can cause out-of-memory (OOM) errors. The following conditions have resulted in OOM errors during testing:
- Models built from the Repository; retry the model using a smaller sample size.
- Feature Impact insights; rerun the Feature Impact job using a smaller sample size.
- Using Advanced Tuning, particularly tunings that: a) add more trees to XGboost/LGBM models or b) deep grid searches of many parameters.
- Retraining models at larger sample sizes.
- Multiclass projects with more than 5-10 classes.
- Feature Effects insight; try reducing the number of features.
- Anomaly detection models, especially for datasets > 2.5GB.
Specific areas of the application may have a limit lower than 10GB. Notably:
- Location AI (geospatial modeling) is limited to 100,000 rows and 500 numeric columns. Datasets that exceed those limits will run as regular AutoML modeling projects but the Spatial Neighborhood Featurizer will not run (resulting in no geospatial-specific models).
- Out-of-time validation (OTV) modeling supports datasets up to 5GB.