Import to DataRobot directly¶
This section describes detailed steps for importing data to DataRobot. Before you import data, review DataRobot's data guidelines to understand dataset requirements, including file formats and sizes.
This section assumes that your data sources are configured. If not, see the JDBC connection instructions on selecting a data connection and data source, as well as creating SQL queries.
To get started¶
The first step to building models is to import your data. To get started:
Create a new DataRobot project in either of the following ways:
Sign in to DataRobot and click the DataRobot logo in the upper left corner.
Open the Projects folder in the upper right corner and click the Create New Project link.
Once the new project page is open, select a method to import an acceptable file type to the page. (Accepted types are listed at the bottom of the screen.) If a particular upload method is disabled on your cluster, the corresponding ingest button will be grayed out.
Once you sign in to DataRobot, you can import data and start a project. The following are ways to import your data.
Some import methods need to be configured before users in your organization can use them, as noted in the following sections.
|Drag and drop||Drag a dataset into DataRobot to begin an upload.|
|Use an existing data source||Import from a configured data source.|
|Import a dataset from a URL||Specify a URL from which to import data.|
|Import local files||Browse to a local file and import.|
|Import files from S3||Upload from an AWS S3 bucket.|
|Import files from Google Cloud Storage||Import directly from Google Cloud.|
|Import files from Azure Blob Storage||Import directly from Azure Blob.|
|Import a dataset from HDFS||Import directly from Hadoop.|
A particular upload method may be disabled on your cluster, in which case a button for that method does not appear. Contact your system administrator for information about the configured import methods.
For larger datasets, DataRobot provides special handling that lets you see your data earlier and select project options earlier.
Drag and drop¶
To use drag and drop, simply drag a file onto the app. Note, however, that when dropping large files (greater than 100MB) the upload process may hang. If that happens:
Compress the file into a supported format and then try again.
Save the file to a remote data store (e.g., S3) and use URL ingest, which is more reliable for large files.
If security is a concern, use a temporarily signed S3 URL.
Use an existing data source¶
You can use this method if you have already configured data sources. If not, see the JDBC connection instructions for details on selecting a data connection and data source, as well as creating SQL queries.
When DataRobot ingests from the data source option, it makes a copy of the selected database rows for your use in the project.
To use an existing data source:
On the new project screen, click Data Source.
Select the desired data source and click Next.
Use saved credentials or enter new credentials for the database configured.
Click Save and sign in.
Import a dataset from a URL¶
To import data from a URL:
On the new project screen, click URL.
The ability to import from Google Cloud, Azure Blob Storage, or S3 using a URL needs to be configured for your organization's installation.
Click Create New Project to create a new project.
Import local files¶
Instead of copying data to the client and then uploading it via the browser, you can specify the URL link as
file:///local/file/location. DataRobot will then ingest the file from the network storage drive connected to the cluster. This import method needs to be configured for your organization's installation.
The ability to load locally mounted files directly into DataRobot is not available for Managed AI Cloud users.
Import files from S3¶
On-premise installations with this import method configured can ingest S3 files via a URL by specifying the link to S3 as
s3://<bucket-name>/<file-name.csv> (instead of, for example,
https://s3.amazonaws.com/bucket/file?AWSAccessKeyId...). This allows you to ingest files from S3 without setting your object and buckets to public.
This method is disabled for Managed AI Cloud users. Instead, import S3 files using one of the following methods:
- Using an Amazon S3 data connection.
- Generate a pre-signed URL allowing public access to S3 buckets with authentication, then you can use a direct URL to ingest the dataset.
Import files from Google Cloud Storage¶
You can configure DataRobot to directly import files stored in Google Cloud Storage using the link
gs://<bucket-name>/<file-name.csv>. This import method needs to be configured for your organization's installation.
The ability to import files using the
gs://<bucket-name>/<file-name.csv> link is not available for Managed AI Cloud users.
Import files from Azure Blob Storage¶
It is possible to directly import files stored in Azure Blob Storage using the link
azure_blob://<container-name>/<file-name.csv>. This import method needs to be configured for your organization's installation.
The ability to import files using the
azure_blob://<container-name>/<file-name.csv> link is not available for Managed AI Cloud users.
Import a dataset from HDFS¶
If your dataset is stored on a Hadoop Distributed File System, you can import it directly into the DataRobot application (DataRobot Hadoop users only). See the section on scalable ingest for information on downsampling files that are larger than the 10GB file size requirement for project creation.
The HDFS server(s) hosting the file that you specify in the URL field must be both resolvable and reachable from the machines hosting the DataRobot application.
On the new project screen, click HDFS. The starting (root) directory used is configured as part of the DataRobot installation.
Specify the dataset you want to use by browsing to the file location and clicking the file name. You can also navigate in the HDFS browser by entering the path in the URL field. The contents shown in the file browser filter automatically as you type. The path completes in the URL field.
If you do not have permission to the file, DataRobot displays the file name grayed out.
You can use the search feature to filter the list and display only files names matching your search criteria. Note that this only filters within the files and directories currently displayed; it does not perform a new search against HDFS.
Click Select to begin file import.
Project creation and analysis¶
After you select a data source and import your data, DataRobot creates a new project. This first exploratory data analysis step is known as EDA1. (See the section on "Fast EDA" to understand how DataRobot handles larger datasets.)
Progress messages indicate that the file is being processed.
When EDA1 completes, DataRobot displays the Start screen. From here you can scroll down or click the Explore link to view a data summary. You can also specify the target feature to use for predictions.
Once you're in the data section, you can:
Click View Raw Data (1) to display a modal presenting up to a 1MB random sample of the raw data table DataRobot will be building models with:
Set your target (2) by mousing over a feature name in the data display.
Work with feature lists (3).
You can also view a histogram for each feature. The histogram provides several options for modifying the display to help explain the feature and its relationship to the dataset.
More information becomes available once you set a target feature and begin your model build, which is the next step.