# Data

> Data - How to manage data for machine learning, including importing and transforming data, and
> connecting to data sources.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-04-24T16:03:56.540290+00:00` (UTC).

## Primary page

- [Data](https://docs.datarobot.com/en/docs/classic-ui/data/index.html): Full documentation for this topic (HTML).

## Sections on this page

- [How to build a feature store in DataRobot](https://docs.datarobot.com/en/docs/classic-ui/data/index.html#how-to-build-a-feature-store-in-datarobot): In-page section heading.
- [Feature considerations](https://docs.datarobot.com/en/docs/classic-ui/data/index.html#feature-considerations): In-page section heading.
- [General considerations](https://docs.datarobot.com/en/docs/classic-ui/data/index.html#general-considerations): In-page section heading.
- [10GB Cloud ingest](https://docs.datarobot.com/en/docs/classic-ui/data/index.html#10gb-cloud-ingest): In-page section heading.

## Related documentation

- [Classic UI documentation](https://docs.datarobot.com/en/docs/classic-ui/index.html): Linked from this page.
- [dataset requirements](https://docs.datarobot.com/en/docs/reference/data-ref/file-types.html): Linked from this page.
- [Connect to data sources](https://docs.datarobot.com/en/docs/classic-ui/data/connect-data/index.html): Linked from this page.
- [AI Catalog](https://docs.datarobot.com/en/docs/classic-ui/data/ai-catalog/index.html): Linked from this page.
- [Import data](https://docs.datarobot.com/en/docs/classic-ui/data/import-data/index.html): Linked from this page.
- [Transform data](https://docs.datarobot.com/en/docs/classic-ui/data/transform-data/index.html): Linked from this page.
- [Analyze data](https://docs.datarobot.com/en/docs/classic-ui/data/analyze-data/index.html): Linked from this page.
- [Data FAQ](https://docs.datarobot.com/en/docs/classic-ui/data/data-faq.html): Linked from this page.
- [Data Registry](https://docs.datarobot.com/en/docs/api/reference/sdk/data-registry.html): Linked from this page.
- [Credentials Management](https://docs.datarobot.com/en/docs/platform/acct-settings/stored-creds.html): Linked from this page.
- [Build wrangling recipes](https://docs.datarobot.com/en/docs/workbench/nxt-workbench/dataprep/wrangle-data/build-recipe/add-operation.html): Linked from this page.
- [scheduling it within the AI Catalog](https://docs.datarobot.com/en/docs/classic-ui/data/ai-catalog/snapshot.html): Linked from this page.
- [feature cache](https://docs.datarobot.com/en/docs/classic-ui/mlops/mlops-preview/safer-ft-cache.html): Linked from this page.
- [exploratory data insights (EDA)](https://docs.datarobot.com/en/docs/workbench/nxt-workbench/dataprep/explore-data/index.html#explore-data): Linked from this page.
- [jobs workshop](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-jobs-workshop/index.html): Linked from this page.
- [custom jobs](https://docs.datarobot.com/en/docs/workbench/nxt-registry/nxt-jobs-workshop/nxt-create-jobs/nxt-create-custom-job.html): Linked from this page.
- [here](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/ts-reference/ts-consider.html): Linked from this page.

## Documentation content

# Data

Data integrity and quality are cornerstones for creating highly accurate predictive models. These sections describe the tools and visualizations DataRobot provides to ensure that your project doesn't suffer the "garbage in, garbage out" outcome.

See the associated [considerations](https://docs.datarobot.com/en/docs/classic-ui/data/index.html#feature-considerations) for important additional information. See also the [dataset requirements](https://docs.datarobot.com/en/docs/reference/data-ref/file-types.html).

| Topic | Description |
| --- | --- |
| Dataset requirements | Dataset requirements, data type definitions, file formats and encodings, and special column treatments. |
| Connect to data sources | Set up database connections and manage securely stored credentials for reuse when accessing secure data sources. |
| AI Catalog | Import data into the AI Catalog and from there, you can transform data using SQL, as well as create and schedule snapshots of your data. Then, create a DataRobot project from a catalog asset. |
| Import data | Import data from a variety of sources. |
| Transform data | Transform primary datasets and perform Feature Discovery on multiple datasets. |
| Analyze data | Investigate data using reports and visualizations created after EDA1 and EDA2. |
| Data FAQ | A list of frequently asked data preparation and management questions with brief answers and links to more complete documentation. |

## How to build a feature store in DataRobot

Feature stores serve as a central repository where frequently-used features are stored and organized for reuse and sharing. Using existing functionality, you can build a feature store in DataRobot.

- Feature storage: Connect to and add data from external data sources using the Data Registry and AI Catalog , as well as saved credentials in Credentials Management .
- Feature transformations: Build wrangling recipes in Workbench to apply transformations to your data.
- Perform offline serving for batch processing by using wrangler recipe SQL and scheduling it within the AI Catalog .
- Perform online serving for realtime processing using feature cache .
- Data monitoring: Monitor your data with Workbench exploratory data insights (EDA) or via the jobs workshop .
- Automation: Create custom jobs to implement automation.

## Feature considerations

The following are the data-related considerations for working in DataRobot.

### General considerations

For non-time series projects (see time series considerations [here](https://docs.datarobot.com/en/docs/reference/pred-ai-ref/ts-reference/ts-consider.html)):

- Ingestion of XLSX files often does not work as well as using the corresponding CSV format. The XLSX format requires loading the entire file into RAM before processing can begin, which can cause RAM availability errors. Even when successful, performance is poorer than CSV (which can begin processing before the entire file is loaded). As a result, XLSX file size limits are suggested. For larger file sizes than those listed below, convert your Excel file to CSV for importing. See thedataset requirementsfor more information.
- When using the prediction API, there is a maximum 50MB body size limitation for real-time deployment prediction requests.
- If you make a real-time deployment prediction request with a body larger than 50MB (in either Dedicated or Serverless environments), it will fail with theHTTP response HTTP 413: Entity Too Large.
- Exportable Java scoring code uses extra RAM during model building and therefore, dataset size should be less than 8GB.

### 10GB Cloud ingest

> [!NOTE] Availability information
> The 10GB ingest option is only available for licensed users of the DataRobot Business Critical package and only available for AutoML (not time series) projects.

Consider the following when working with the 10GB ingest option for AutoML projects:

- Certain modeling activities may deliver less than 10GB availability, as described below.
- The capability is available for regression, binary classification, and multiclass AutoML projects.
- Project creation with datasets close to 10GB may take several hours, depending on the data structure and features enabled.

In some situations, depending on the data or the nature of the modeling activity, 10GB datasets can cause out-of-memory (OOM) errors. The following conditions have resulted in OOM errors during testing:

- Models built from the Repository; retry the model using a smaller sample size.
- Feature Impact insights; rerun the Feature Impact job using a smaller sample size.
- Using Advanced Tuning , particularly tunings that: a) add more trees to XGboost/LGBM models or b) deep grid searches of many parameters.
- Retraining models at larger sample sizes.
- Multiclass projects with more than 5-10 classes.
- Feature Effects insight; try reducing the number of features.
- Anomaly detection models, especially for datasets > 2.5GB.

Specific areas of the application may have a limit lower than 10GB. Notably:

- Location AI (geospatial modeling) is limited to 10,000,000 rows and 500 numeric columns. Datasets that exceed those limits will run as regular AutoML modeling projects but the Spatial Neighborhood Featurizer will not run (resulting in no geospatial-specific models).
- Out-of-time validation (OTV) modeling supports datasets up to 5GB.
