Eureqa Models¶
The Eureqa Models tab provides access to model blueprints for Eureqa generalized additive models (Eureqa GAM), Eureqa regression, and Eureqa classification models. These blueprints use a proprietary Eureqa machine learning algorithm to construct models that balance predictive accuracy against complexity.
The Eureqa modeling algorithm is robust to noise and highly flexible, and performs well across a wide variety of datasets. Eureqa typically finds simple, easily interpretable models with exportable expressions that provide an accurate fit to your data.
Eureqa GAM blueprints, a Eureqa/XGBoost hybrid, are available for both regression and classification projects.
When DataRobot runs a Eureqa blueprint, the Eureqa algorithm tries millions of candidate models and selects a handful (of varying complexity) which represent the best fit to the data. From the Eureqa Models tab you can inspect and compare those models, and select one which best balances your requirements for complexity against predictive accuracy.
You can select one or more Eureqa GAM models to add to the Leaderboard for later deployment. Additionally, the ability to recreate Eureqa models enables you to fully reproduce their predictions outside of DataRobot. This is helpful for meeting requirements in regulated industries as well as for simplifying the steps to embed models in production software. Recreating a Eureqa model is as simple as copying and pasting the model expression to the target database or production environment. (Also, for GAM models only, parameters can be exported to recreate models.)
See the associated considerations for additional information.
Benefits of Eureqa models¶
There are a number of advantages to using Eureqa models:
-
They return human-readable and interpretable analytic expressions, which are easily reviewed by subject matter experts.
-
They are very good at feature selection because they are forced to reduce complexity during the model building process. For example, if the data had 20 different columns used to predict the target variable, the search for a simple expression would result in an expression that only uses the strongest predictors.
-
They work well with small datasets, so they are very popular with scientific researchers who gather data from physical experiments that don’t produce massive amounts of data.
-
They provide an easy way to incorporate domain knowledge. If you know the underlying relationship in the system that you're modeling, you can give Eureqa a "hint," (for example, the formula for heat transfer or how house prices work in a particular neighborhood) as a building block or a starting point to learn from. Eureqa will build machine learning corrections from there.
Build a Eureqa model¶
Eureqa models are run in full Autopilot, not Quick, but can always be accessed from the model Repository. (See the reference on when models are available based on modeling mode and project type.)
If you ran Quick mode and Eureqa models were not built, or you chose manual mode, you can create them from the Repository. For Comprehensive mode, all Eureqa models are created during Autopilot. Running a Eureqa blueprint creates a model of that name.
To run a blueprint:
-
Upload your dataset and select a target, select the modeling mode, and click Start to begin the model building process. If you used Manual mode to start your project, you see this message:
-
Click Repository in the message or select Repository from the menu to add a Eureqa blueprint. (Note that Autopilot mode automatically creates a Eureqa generalized additive model and makes it available from the Leaderboard.)
-
In the search box in the Repository, type
eureqa
to filter the display. Click Add from the dropdown for each Eureqa model you want to create. -
When ready, click Run Tasks.
DataRobot then begins processing the selected model(s); you can follow the status in the Worker Queue. When the build completes, models are available from the Leaderboard.
Eureqa Models tab¶
To view details for a Eureqa model, select it from the Leaderboard () and then select the Eureqa Models tab:
Display component | Description |
---|---|
1 Eureqa Model summary | Displays the Leaderboard model’s Eureqa complexity, Eureqa error, and model expression. |
2 Decimal rounding | Sets the number of decimal places to display for rounding in Eureqa constants. |
3 Models by Error vs Complexity chart | Plots model error against model complexity. |
4 Selected Model Detail | Displays the mathematical expression and plot for the selected model. |
5 Export link | Exports the Leaderboard model's preprocessing and parameter information to CSV (for GAM only). |
Note that the tab's graphs and other UI elements update periodically as DataRobot creates and selects additional candidate models.
Eureqa model summary¶
The model summary information DataRobot displays in this upper section represents information for the Leaderboard model. It includes complexity and error scores as well as a mathematical representation of the model (i.e., model expression) and access to model export (for GAM only).
Note
When customizing a Eureqa model to configure a prior solution (prior_solutions), for example, you copy the model expression content to the right of the equal sign. Also, when using the model expression for a target expression string (target_expression_string), make sure to replace the original variable name with Target
. For example, in the screenshot above the target expression would be:
Target = High Cardinality and Text features Modeling +1.23938372292399sqrt(perc_alumni) + 0.031847155305945Top25perclog(Enroll) + 0.000123426619061881Outstatelog(Accept) - 23.3747552223482 - 0.00203437584904968Personal
The complexity score reports the complexity of this model, as represented in the Models by Error vs. Complexity chart. The "Eureqa error" value provides a mechanism for comparing Eureqa models. Once you have selected the best-suited model, you can move that model to the Leaderboard to compare it against other DataRobot models. The model expression denotes the mathematical functions representing the model. The Export link opens a dialog for downloading model preprocessing and parameter data. See this note on data partitioning and error metrics.
Decimal rounding¶
To improve readability, DatRobot shows constants to two decimal points of precision by default. You can change the precision displayed from the Rounding dropdown. Changes to the display do not affect the underlying model.
Default display:
With all points displayed:
Models by Error vs. Complexity graph¶
The left panel of the Eureqa Model display plots model error against model complexity. Each point on the resulting graph (known as a Pareto front) represents a different model created by Eureqa. The color range for each point varies from red for the simplest and lowest accuracy model to blue for the most complex and accurate model.
The location of the Leaderboard entry—the “current model”—is indicated on the graph (). Hover over any other point to display a tooltip reporting the model’s Eureqa complexity and Eureqa error. Clicking a model (point) updates the Selected Model Detail graph on the right with details for that model.
Selected Model Detail graph¶
The Selected Model Detail graph reports, for the selected model, the complexity and error scores, as well as the mathematical representation of the model.
Clicking a model (point) on the Models by Error vs. Complexity graph updates the Selected Model Detail graph. Additionally, selecting a different model activates the Move to Leaderboard button. Once you click the button, DataRobot creates a new, additional Leaderboard entry for the selected model. Because DataRobot already built the model, no new computations are needed.
The contents of the graphing portion are dependent on whether you are working with a regression or classification problem.
For regression projects¶
The Selected Model Detail graph for regression problems displays a scatter plot fit to data for the selected model. Similar to the Lift Chart, the orange points in the Selected Model Detail graph show the target value across the data; the blue line graphs model predictions. To see output for a different model, select a new model in the Models by Error vs. Complexity graph to the left.
Interpret the graph as follows:
Component | Description |
---|---|
1 | Complexity values, error values, and model expression for the selected model. |
2 | Action to send the selected model to the Leaderboard. Because all available Eureqa models are built when first run, there is no additional processing necessary. |
3 | Tooltip displaying target and model values. |
4 | Dropdown to control row ordering along the X-axis. |
The Order by dropdown has several options, including:
- Row (default): rows are ordered in the same order as the original data
- Data Values: rows are ordered by the target values
- Model Values: rows are ordered by the model predictions
For classification projects¶
The Selected Model Detail graph for classification problems displays a distribution histogram—a confusion matrix—for the selected model. That is, it shows the percentage of model predictions that fall into each of n buckets, spaced evenly across the range of model predictions. For more information about understanding a confusion matrix, see a general description in the ROC Curve details.
The histogram displays all predicted values applicable to the selected model. To see output for a different model, select a new model (different point) in the Models by Error vs. Complexity graph.
Interpret the graph as follows:
Component | Description |
---|---|
1 | Complexity values, error values, and model expression for the selected model. |
2 | Action to send the selected model to the Leaderboard. Because all available Eureqa models are built when first run, there is no additional processing necessary. |
3 | Tooltip describing the content of the bucket, including total values, range of values, and breakdown of true/false counts. |
4 | Order by value for the rows along the X-axis. By default, rows are ordered by model predictions. |
The histogram displays a vertical threshold line (0.5
in the above example), dividing the plot into four regions. The top portion of the plot shows all rows where the target value was 1 while the bottom portion includes all rows where the target value was 0. All predictions to the left of the threshold were predicted false (negative); lower left represents correct predictions, upper left incorrect predictions. Values to the right of the threshold are predicted to be true. Histogram counts are computed across the entire training dataset.
Export model parameters¶
Note
Although you can recreate GAM models using the Export button, consider that another simple way to recreate any GAM or non-GAM Eureqa model is by copying and pasting the model expression into the target environment directly (such as a SQL query, Python, Java, etc.).
The Export button opens a window allowing you to download the Eureqa preprocessing and parameter table for the selected Leaderboard entry. This export provides all the information necessary to recreate the GAM model outside of DataRobot. Interpret the output in the same way as you would the export available from the Coefficients tab (with GAM-specific information here), with the following differences:
-
The first section of output shows the Eureqa model formula. This is the mathematical equation displayed at the top of the Eureqa Models tab, beginning with
Target=...
. -
The second section displays the DataRobot preprocessing parameters for each feature used in the model, which includes parameters for one or two input transformations (e.g., standardization). With Eureqa models, the
Coefficient
field is set to 0 when there are no text or-high cardinality features. “Coefficient” is used in linear models to denote the column’s linearly-fit coefficient. -
Eureqa model parameters can be exported to .csv format only (.png and .zip options are not selectable here).
More info...¶
With traditional DataRobot model building, data is split into training, validation, and holdout sets. Eureqa, by contrast, uses the training DataRobot split and then, to compute the Eureqa error, further splits that set using its own internal training/validation splitting logic.
Model availability¶
The following table describes the conditions under which Eureqa models for AutoML and time series projects are available in Autopilot and the Repository.
Eureqa model type | Autopilot | Repository |
---|---|---|
AutoML projects | ||
Regressor/Classifier |
|
|
GAM |
|
|
Time series projects | ||
Regressor/Classifier |
|
No restrictions |
GAM |
|
No restrictions |
Eureqa With Forecast Distance Modeling | N/A |
|
Number of generations¶
The following table describes the number of generations performed, based on blueprint selected. Generation values are reflected in the blueprint name.
Eureqa model type | Autopilot generations | Repository generations |
---|---|---|
AutoML projects * | ||
Regressor/Classifier | 250 | 40, 250, or 3000 |
GAM | Dynamic* | 40, dynamic*, or 10,000 |
Time series projects | ||
Regressor/Classifier | 250 | 40 or 3000 |
GAM | 250 | 40, 250, dynamic* |
Eureqa With Forecast Distance Modeling (one model per forecast distance) | N/A | Number of generations is determined by the Advanced Tuning task_size parameter. Default is medium (1000 generations). |
* The dynamic option for the number of generations is based on the number of rows in the dataset. The value will be between 1000 and 3000 generations.
Eureqa and stacked predictions¶
Because it would be too computationally "expensive" to do so, Eureqa blueprints don't support stacked predictions. Most models use stacking to generate predictions on the data that was used to create the project. When you generate Eureqa predictions on the training data, all predictions will come from a single Eureqa model, not from stacking.
This means the Eureqa error isn't exactly the error on the data; it's the error on a filtered version of the data. This explains why the reported Eureqa error can lower than the Leaderboard error when the error metrics are the same. You cannot change the Eureqa error metric, although you can change the DataRobot optimization metric (the value DataRobot uses to rank models on the Leaderboard).
The following lists differences from non-Eureqa modeling due to lack of stacked predictions:
- In AutoML, blenders that train on predictions (for example, GLM or ENET) are disabled. Other blenders are available (such as AVG or MED).
- Validation and Cross-Validation scores are hidden for Eureqa and Eureqa GAM models trained into Validation and/or Holdout.
- Downloading predictions on training data is disabled.
Model training process¶
When training a Eureqa model, DataRobot executes either a new solution search or a refit:
- New solution search: The Eureqa evolution process does a complete search, looking for a new set of the solutions. The mechanism is slower than retrofitting.
- Refit: Eureqa refits coefficients of the linear components. In other words, it takes the target expression from the existing solution, extracts linear components, and refits its coefficients using all the training data.
The following table describes, for each Eureqa model type, training behavior for validation/backtesting and frozen runs:
Model type | Backtesting/Cross-Validation | Frozen run |
---|---|---|
Eureqa Regressor/Classifier | Refits coefficients of existing solutions from the model trained on the first fold. | Refits coefficients of existing solutions from the parent model. |
Eureqa GAM* | Refits coefficients of existing solutions from the model trained on the first fold. | Freezes XGBoost hyperparameters; performs new solution search for Eureqa second-stage models. |
Eureqa with Forecast Distance Modeling (selects the best solution—per strategy—for each forecast distance) | Performs a new solution search. | Performs a new solution search with fixed Eureqa building blocks. |
* Eureqa GAM consists of two stages—first stage is XGBoost, second stage is Eureqa approximating the XGBoost model but trained on a subset of the training data.
Deterministic modeling¶
Like other DataRobot models, Eureqa's model-generation process is deterministic: if you run Eureqa twice against the same data, with the same configuration arguments, you will get the same model—same error, same complexity, same model equation. Because of Eureqa's unique model-generation process, if you make a very small change in its inputs, such as removing a single row or changing a tuning parameter slightly, it's possible that you will get a very different model equation.
Note
If the sync_migrations Advanced Tuning parameter is set to False, then Eureqa's model-generation process will be non-deterministic. If this is the case, DataRobot may identify good Eureqa models more quickly (though this isn't guaranteed), and it will better utilize all available CPUs.
Tune with error metrics¶
The metric used by Eureqa for Eureqa GAM (Mean Absolute Error) is a "surrogate" error, as the Eureqa GAM blueprint runs Eureqa on the output of XGBoost. It measures how well Eureqa could reproduce the raw output of XGBoost. For regression, you can change the loss function used in XGBoost in the advanced option but you cannot change the Eureqa error metric. You can also change the DataRobot optimization metric (the value DataRobot uses to rank models on the Leaderboard). This tuning affects the tuning of XGBoost and the default choice of XGBoost loss function, and leads to different results for Eureqa GAM.
Advanced Tuning parameters¶
You can tune your Eureqa models by modifying building blocks, customizing the target expression, and modifying other model parameters, such as support for building blocks, error metrics, row weighting, and data splitting. Eureqa models use expressions to represent mathematical relationships and transformations.
See the reference guide to Eureqa's Advanced Tuning options for more information.
Feature considerations¶
The following considerations apply to working with both GAM and general Eureqa models and for working with Eureqa models in time series projects, specifically.
Note
Eureqa model blueprints are deterministic only if the number of cores in the training and validation environments is kept constant. If the configurations differ, the resulting Eureqa blueprints produce different results.
-
There is no support for multiclass modeling.
-
Cross-validation can only be run from the Leaderboard (not from the Repository).
-
For legacy Eureqa SaaS product users, accuracy may be comparatively reduced due to fewer cores. (Legacy users can contact their DataRobot representative to discuss options for addressing this.)
-
Eureqa scoring code is available for both AutoMl and time series. When using with time series, Scoring Code is supported for Eureqa regression and Eureqa GAMs only (no classification).
-
There is no support for offsets with time series models.