Tune Eureqa models¶
You can customize Eureqa models by modifying various Advanced Tuning parameters and creating custom target expressions. Parameters you can adjust for your models include building blocks, target expressions, error metrics, row weighting, and prior solutions. Additionally, you can customize how DataRobot partitions data for the Eureqa model.
Eureqa model expressions use building blocks (discrete sets of mathematical functions) for combining variables and creating new features from a dataset. Building blocks range from simple arithmetic functions (addition, subtraction) to complex functions (logistic or gaussian) and more.
DataRobot creates Eureqa models using default sets of building blocks for preset problem types; however, certain problems may require different sets of building blocks. Advanced users dealing with systems that already have known or expected behavior may want to encourage certain model structures in DataRobot. For example, if you think that seasonality or some other cyclical trend may be a factor in your data, including the building blocks sin(x) and cos(x) will let DataRobot know to test those types of interactions against the data.
See Configuring building blocks for information on selecting building blocks for the target expression.
Building block complexity¶
Complexity settings are additional weights DataRobot can apply to specific building blocks and terms to penalize related aspects of a given model. Changing the complexity given to certain building blocks or terms will affect which models will appear on the pareto frontier (in the Eureqa Models tab) with the focus on finding the simplest possible models that achieve increasing levels of accuracy.
The default complexity settings typically work well; however, if you have prior knowledge of the system you are trying to model, you may want to modify those settings. If there are particular building blocks that you know will be, or expect to be, part of a solution that accurately captures the core dynamics of a system, you might lower the complexity values of those building blocks to make it more likely that they will appear in the related Eureqa models. Similarly, if there are building blocks that you don't want to appear unless they significantly improve the fit of the models, you might raise the complexity values of those building blocks.
See Setting building block complexity for more information.
The target expression tells DataRobot how to create the Eureqa model. Target expressions are comprised of variables that exist in your dataset and mathematical "building blocks". DataRobot creates the default target expression for a model using the selected target variable modeled as a function of all input variables.
Here's an example default target expression:
You can customize the expression (model formula) to specify the type of relationship you want to model and incorporate your domain expertise of the fundamental behavior of the system. Complex expressions are possible and give you the power to tune for complex relationships, including: differential equations, polynomial equations, and binary classification.
See Customizing target expressions for more information.
DataRobot uses error metrics to guide how the quality of potential solutions is assessed. Each Eureqa model has default error metrics settings; however, advanced users can choose to optimize for different error metrics. Changing the error metric will change how DataRobot optimizes the solutions.
See Configuring error metrics for more information.
You can designate one of your variables as an indicator of how much relative weight (i.e., importance) you want Eureqa to give to the data in each row. For example, if the designated row weight variable has a value of 10 in the first row and 20 in the second row, data in the second row will be given twice the weight of the data in the first row when Eureqa is calculating how well a model performs on the data. Row weight can be specified by using a row weight variable or by using a row weight expression.
See Configuring row weighting blocks for more information.
Prior solutions "seed" DataRobot with solutions or partial solutions that express relationships that you believe will play some role in an eventual solution. Entering prior solutions for a Eureqa model may speed search performance by initializing that model with known information. The Prior Solutions parameter, prior_solutions, is available within the Prediction Model Parameters and can be specified as part of tuning your Eureqa models. You can specify multiple expressions, one per line, where each expression is a valid target expression (such as from a previous Eureqa model).
The following shows an example of two prior solutions (expressions), (x - 1) and sin(2 * x), set for a model:
If you have entered a custom target expression that uses multiple functions (as explained here), enter a sub-expression for each function. Each f() is listed with its sub-expression, separated by a comma, on the same line. For example, if the expression contains two functions, such as Target = f0(x) * f1(x), and the prior model is Target = (x-1) * sin(2 * x), you will enter the prior solution as:
Target = (x - 1), f1 = sin(2 * x)
To specify multiple expressions from prior models, enter each set of functions on a new line. You can enter expressions for only some of the functions that exist in the target expression; if this is the case, DataRobot will fill in '0' as the seed for other functions.
For example, if you enter:
f1 = sin(2 * x)
DataRobot will translate this to:
f0 = 0, f1 = sin(2 * x)
Data partitioning for training and cross-validation¶
DataRobot performs its standard process for data partitioning (as explained here) for each Eureqa model. Then, it further subdivides the training set data into two more sets: a Eureqa internal training set and a Eureqa internal validation set. The data for these Eureqa internal sets is derived from the original DataRobot training set. (The original DataRobot validation set is never used as part of the Eureqa data partitioning process.)
DataRobot uses the Eureqa internal training set to drive the core Eureqa evolutionary algorithm, and uses both the Eureqa internal training and validation sets to select which models are the "best" and, therefore, selected for inclusion in the final Eureqa Pareto Front (within the Eureqa Models tab).
A random split will randomly assign rows for Eureqa internal training and Eureqa internal validation. Rows (within the original training set) are split based on the Eureqa internal training and Eureqa internal validation percentages. If the training and validation percentages total more than 100%, the overlapping percentage of rows will be assigned to both training and validation. Random split with 50% of data for Eureqa internal training and 50% for Eureqa internal validation is recommended for most (non-Time Series) modeling problems.
For very small data sets (e.g., under a few hundred points) it is usually best to use overlapping Eureqa internal training/Eureqa internal validation datasets. When the data is extremely small, or has very little or no noise, you may want to use 50% of the original DataRobot training data for Eureqa internal training and 100% for Eureqa validation. In extreme cases, you may want to include 100% of the data for both Eureqa internal training/Eureqa internal validation datasets, and then limit your model selection to those with lower complexities.
For large data sets (e.g., over 1,000 points) it is usually best to use a smaller fraction of data for the Eureqa training set. It is recommended to choose a fraction such that the size of the Eureqa training data is approximately 10,000 rows or less. Then, use all remaining data for the Eureqa validation set.
An in-order split maintains the original order of the input data (i.e., the original DataRobot training set) and selects a percentage of rows, starting with the first row, to use for the Eureqa internal training set and a different percentage of rows, starting with the last row, to use for the Eureqa internal validation set. If the training and validation percentages total more than 100%, the overlapping percentage of rows will be assigned to both the Eureqa internal training and internal Eureqa validation sets.
This option can be used if you have pre-arranged your data with rows you want to use for the Eureqa internal training set at the beginning of the dataset and rows you want to use for the Eureqa internal validation set at the end.
In-order split is applied by default when performing data partitioning for Time Series and OTV models, as explained here.
Split by variable¶
Split by variable allows you to manually indicate which rows to use for training and which to use for validation using variables that have been pre-defined in your project dataset. Rows are selected if the indicator variable has a value greater than 0. By default, the Eureqa internal training rows will be selected as the inverse of the Eureqa internal validation rows, unless a separate indicator is provided for training rows.
You may include a validation data variable and/or a training data variable in your data before uploading it to Eureqa, or use Eureqa to create a derived variable that will be used to split the data.
Split by expression¶
Split by expression allows you to manually identify which rows to use for Eureqa internal training and which rows to use for Eureqa internal validation using expressions entered as part of the target expression. Rows are selected if the expression has a value greater than 0. By default, the Eureqa internal training set rows will be selected as the inverse of the Eureqa internal validation set rows, unless a separate expression is provided for training rows.
Eureqa data partitioning¶
You can modify the default Data Partitioning settings using the training_fraction and validation_fraction parameters. To adjust how DataRobot splits the data for the model, modify the split_mode parameter. Finally, to direct DataRobot to create the Eureqa internal training and internal validation sets based on custom expressions (rather than the default settings, explained previously), add those expressions to the training_split _expr and/or validation_split _expr parameters, as applicable.
The default Eureqa data partitioning process (to create the Eureqa internal training and internal validation sets) differs between non-Time Series and Time Series models:
For non-Time Series models: DataRobot performs a 50/50 random split of the shuffled training set data and then uses the first half as the Eureqa internal training set and the second half as the Eureqa internal validation set (where split_mode = 1, for random).
For datetime-partitioned models (i.e., models created as either Time Series Modeling and Out-of-Time Validation (OTV): DataRobot performs a 70/30 in-order split of the chronologically sorted training set data and then uses the first 70% as the Eureqa internal training set and the second 30% as the Eureqa internal validation set (where split_mode = 2, for in-order).
If you selected random partitioning when you started your project (using Advanced options), it is strongly recommended that you do not select in-order split mode when tuning Eureqa models.
Data Partitioning for cross-validation¶
When performing cross-validation for Eureqa models, DataRobot uses only the first CV split for training; therefore, only that training data (from the first CV split) is split further into Eureqa internal training data and Eureqa internal validation data.