Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Import datasets

Importing data into Data Prep is the first step to getting your data ready for machine learning. During the import process you can:

  • Select multiple datasets from a variety of data sources.
  • Combine datasets together into one dataset.
  • Choose which columns in a dataset to import.
  • Select extensionless files.
  • Import datasets from zipped or compressed folder.
  • Change the format used to analyze and structure your data.

Using the Import page

After you select a dataset for import, the page divides into four quadrants called panes.

Following is an overview of each pane of the Import page.

# Element Description
Select Datasets pane Select the datasets you want to import from this pane. You can:
  • Select multiple datasets from local files and connected data sources.
  • Search and query connected data sources for datasets.
  • Combine multiple datasets into one glob for import.
You selected pane After you select a dataset, your dataset is listed in this pane. You can:
  • See a list of datasets that have been selected for import.
  • Select a dataset to preview or update the import options.
  • Quickly identify datasets with potential import errors.
  • Change the format used to analyze and structure your data for import.
  • Import the same dataset multiple times with different import options.
Your options pane In many cases, your data will import easily into Data Prep. In some cases, you may need or want to adjust the import options. This is where you make those adjustments.
Preview pane Here is where you preview of your data. As you select datasets from the You selected pane, update the format, or change import options, the Preview pane displays how the selected dataset will look once imported. From here, you can also choose which columns to import.

Snapshot of the import process

Following is a quick snapshot of how to import your datasets into Data Prep.

  1. On the Library page, click Import.

  2. On the Import page, you can select datasets, search for datasets, or combine datasets.

  3. Check the preview of the dataset. Does your data look correct?

    • If your data is correct, you can continue adding datasets until you have selected all your datasets for import.
    • If your data is incorrect, try adjusting the import settings.
  4. Click Finish.

    Your data is imported as a dataset and ready to be prepped in a project.

Select datasets

You can import datasets from a local file or a connected data source. This section provides more detail on how to select one or more datasets for import.

Select Datasets pane

Following is an overview of the elements of the Select Datasets pane.

# Element Description
Data Source options Maybe you need to import a dataset from Amazon S3, Hadoop, JDBC, or some other data source. Maybe you just want to import a spreadsheet you saved on your computer. Either way, this is where you do it.

The Data Source list lets you select a configured data source. Your administrator must connect to the data source before you can access it.

Click Upload local file to select a dataset from your computer.
Search For times when you want to find a specific dataset or a group of similar datasets, enter search criteria. The Search field accepts wildcard characters. This will help you find specific and similarly named datasets. See Search for datasets for details.
Datasets Lists pane The contents of the selected data source are listed here. In this example, the data source has six items: one comma-separated value (CSV) file and five excel files.
Select When you see a dataset you want to import, click Select. The dataset is listed in the You selected pane and will be imported when you click Finish.

Select datasets from local files

To select a dataset from a file on your computer or shared network drive:

  1. Click the Upload File pane and select a dataset or or drag a file to the pane.

    The dataset is added to the list in the You selected pane. Data Prep displays the Your options pane for the dataset and a preview of the dataset.

  2. To add more datasets, click any additional dataset you want to include in the import.

    The additional datasets are added to the list in the You selected pane.

Select datasets from a data source

To select a dataset from a connected data source:

  1. Click Select Data Source and choose the data source you want to use.

  2. Locate the dataset you want to import.

    To locate your dataset using search, see Search for datasets.

  3. To select a dataset, click Select.

    The dataset is added to the list in the You selected pane. Data Prep displays the Your options pane for the dataset and a preview of the dataset.

  4. To add more datasets from the currently selected data source, click any additional dataset you want to include in the import.

  5. To add more datasets from a different data source, repeat steps 1 - 3 for each data source.

The additional datasets are added to the datasets list in the You selected pane.

Search for datasets

You can search for datasets by typing the name of the dataset or entering a query string. The search is case sensitive and only the results that exactly match your search criteria are returned. You can use wildcard characters to locate a dataset when you aren’t sure what the exact name is or to locate similarly named datasets.

Search for a dataset

To search for a dataset:

  1. Select a data source and click the Search icon on the top right of the Select Datasets pane.

    The Search icon appears only when you select a data source, not when you upload a local file.

  2. In the Wildcard Search field, type your search criteria.

    Datasets that match your search criteria exactly are returned.

    See Wildcard characters for help setting search criteria.

Query a database

To query a database:

  1. Click Select Data Source and choose the data source you wish to use.
  2. Click Create Query on the bottom right of the Select Datasets pane.
  3. In the Query String field, type your search criteria.

    To search with wildcard characters, see Wildcard characters.

    Datasets that match your search criteria exactly are returned.

Wildcard characters

Following are the wildcard characters you can use to search for datasets.

Character Matches
* Any number of characters, including none
? A single character
[0-9] or [a-z] A character in the range given in the bracket
[123] or [abc] A character listed inside the bracket

Example searches using wildcards

Here are some example searches and the results:

Search example Returns
* All the datasets
*.csv Datasets with a ‘.csv’ file extension.
a?b.csv Datasets that that are named aac.csv, abc.csv, …, azc.csv..
a*z.csv Datasets that begin with a lowercase ‘a’ and end with ‘z.csv’ regardless of what characters or how many characters are between
a[0-9].csv Datasets that are named a0.csv, a1.csv, a2.csv, …, a9.csv
a[a-z].csv Datasets that are named aa.csv, ab.csv, …, az.csv
a[abc].csv Datasets that are named aa.csv, ab.csv, ac.csv

Combine datasets

Data Prep can combine multiple datasets into one glob to be imported. A glob is the result of appending multiple datasets into one dataset during import. This section provides more information on how to glob multiple datasets together prior to import.

Guidelines for combining datasets

Following are some guidelines that help make globbing datasets a success.

  • Datasets can only be globbed from the same data source.
  • Datasets can only be globbed through a wildcard search.
  • Datasets being globbed together should have the same structure (number of columns and type of data).

Data sources that support globbing

For a list of data sources and file formats that are supported for globbing, review the Platform Support matrix in the latest Data Prep Release Notes.

Note

To access the Release Notes, you will need the following:

  • Access to the internet.
  • A Data Prep Customer Account.
If you do not have a Data Prep Customer Account, contact DataRobot Support.

Create a glob

To combine multiple datasets into one glob:

  1. Click Select Data Source and select the data source.

  2. Use search to locate the datasets you want to combine.

  3. Click Combine All Results.

    The datasets are combined into one glob. The glob is added to the datasets list in the You selected pane. The name of the glob defaults to the search criteria. Data Prep displays the Your options pane for the glob and a preview of the glob.

Preview a dataset before import

To change the dataset in the preview, from the You selected pane, click the dataset you want to preview.

The Preview pane displays the selected dataset.

By default, Data Prep displays a preview of the last selected dataset.

Add a dataset again

During import, there might be times when you need to apply different import options to the same dataset. This is especially true when you need to import more than one Excel worksheet from the same Excel file.

To add a dataset with different import options:

  1. From the You selected pane, click the More button (three vertical dots) of the dataset you want to add again.

  2. Click Add Again.

    The dataset is added to the list in the You selected pane.

  3. Adjust the import settings as needed.

Adjust import settings

Once a dataset is selected, Data Prep analyzes your data to determine the right settings for the best results. But data isn’t a one-size-fits all kind of thing. Sometimes, you need to tweak the settings to get them just right. This section provides information on how to adjust some of the more universal settings of a dataset prior to import. For specific information about a setting, hover your cursor over the help tip (question mark) button.

Following are a few of the frequent and more basic settings you can adjust:

Action Steps
Add a tag. In the Your options pane, type or select the tag from the Tags list.
Add a column to show the source file lineage. In the Your options pane, toggle the Add column to show source file button.

The new Source File column is added to the end of the dataset showing the path of the source file for every imported row.
Change the format of the dataset. In the You selected pane, select the format you want to apply to the dataset from the Format menu. See Supported Formats for more information.
Change the name of a dataset. In the Your options pane, type the new name in the Name field. The dataset name is updated in the You selected pane.
Exclude columns from import. In the Preview pane:
  1. Click Edit columns.
  2. Deselect the columns you don’t want to import.
  3. Click Show preview.
The deselected columns are removed from the Preview.
Import additional worksheets from the same Excel file. For each additional worksheet:
  1. In the You selected pane, add the Excel file again.
  2. In the Your options pane, select the worksheet to import from the Worksheet menu.
Rearrange the columns. In the Preview pane:
  1. Click Edit columns.
  2. Click the up arrow or down arrow until the column is in the position you want.
  3. Click Show preview.
Rename a column. In the Preview pane:
  1. Click Edit columns.
  2. Click Edit (pencil icon) and type the new name.
  3. Click Show preview.

Supported Formats

For file-based connectors, the common formats are listed in the following table. Data Prep's Intelligent Ingest identifies the format of the file by looking into the contents of the file instead of relying on the file extension. Even if your file does not have an extension or has an incorrect extension, Data Prep correctly identifies the format.

Common format Import support for wildcards and globbing
Delimited files (CSV, TSV, etc.) Yes
Fixed-width column data Yes
JSON Yes
XML Yes
Apache Avro Yes
Microsoft Excel (XLS, XLSX) No. See Wildcard characters and Guidelines for combining datasets.
SAS BDAT Yes

Data Prep supports the import of compressed files in one of the following formats: Deflate, LZ4, Snappy, ZIP, Gzip, or Bzip. In general, the decompressed file must be a common format as listed in the previous table.

Additionally, connectors that support Parquet files also support compressed versions of Parquet files.

Note

When importing a ZIP file that contains multiple files, the largest file in the compressed set is automatically identified and selected for import to the Library.


Updated October 28, 2021
Back to top