Model building FAQ¶
Does DataRobot support feature transformations?
In AutoML, DataRobot performs automatic feature transformations for features recognized as type “date,” adding these new features to the modeling dataset. Additionally, you can create manual transformations and change variable type. For image datasets, the train-time image augmentation process creates new training images. The time series feature derivation process creates a new modeling dataset. Feature Discovery discovers and generates new features from multiple datasets to consolidate datasets. Or, use a Spark SQL query from the AI Catalog to prepare a new dataset from a single dataset or blend two or more datasets. Transformed features are marked with an info icon on the data page.
Can I choose which optimization metric to use?
The optimization metric defines how to score models. DataRobot selects a metric best-suited for your data from a comprehensive set of choices, but also computes alternative metrics. After EDA1 completes, you can change the selection from the Advanced Options > Additional tab. After EDA2 completes, you can redisplay the Leaderboard listing based on a different computed metric.
Can I change the project type?
Once you enter a target feature, DataRobot automatically analyzes the training dataset, determines the project type (classification if the target has categories or regression if the target is numerical), and displays the distribution of the target feature. If the project is classified as regression and eligible for multiclass conversion, you can change the project to a classification project, and DataRobot will interpret values as classes instead of continuous values.
How do I control how to group or partition my data for model training?
By default, DataRobot splits your data into a 20% holdout (test) partition and an 80% cross-validation (training and validation) partition, which is divided into five sub-partitions. You can change these values after loading data and selecting a target from the Advanced Options > Partitioning tab. From there, you can set the method, sizes for data partitions, number of partitions for cross-validation, and the method by which those partitions are created.
What do the green "importance" bars represent on the Data tab?
The Importance green bars, based on "Alternating Conditional Expectations" (ACE) scores, show the degree to which a feature is correlated with the target. Importance has two components—Value and Normalized Value—and is calculated independently for each feature in the dataset.
Does DataRobot handle Natural Language Processing (NLP)?
When text fields are detected in your data, DataRobot automatically detects the language and applies appropriate preprocessing. This may include advanced tokenization, data cleaning (stop word removal, stemming, etc.), and vectorization methods. DataRobot supports n-gram matrix (bag-of-words, bag-of-characters) analysis as well as word embedding techniques such as Word2Vec and fastText with both CBOW and Skip-Gram learning methods. Additional capabilities include Naive Bayes SVM and cosine similarity analysis. For visualization, there are per-class word clouds for text analysis. You can see the applied language preprocessing steps in the model blueprint.
How do I restart a project with the same data?
If your data is stored in the AI Catalog, you can create and recreate projects from that dataset. To recreate a project—using either just the data or the data and the settings (i.e., to duplicate the project)—use the Actions menu in the project control center.
Do I have to use the UI or can I interact programmatically?
DataRobot provides both a UI and a REST API. The UI and REST API provide nearly matching functionality. Additionally, Python and R clients provide a subset of what you can do with the full API.
Does DataRobot provide partner integrations?
DataRobot offers an Excel add-in, an Alteryx add-in, and a Tableau extension . A Snowflake integration allows joint users to execute Feature Discovery projects in DataRobot while performing computations in Snowflake for minimized data movement.
What is the difference between prediction and modeling servers?
Modeling servers power all the creation and model analysis done from the UI and from the R and Python clients. Prediction servers are used solely for making predictions and handling prediction requests on deployed models.