Modeling > Modeling reference > Eureqa advanced tuning > Error metric guidance

Error metric guidance¶

The good news is that there are common error metrics that work well on a large majority of problem types. Starting with one of these error metrics is usually a safe bet. This section identifies the top error metrics for different types of problems.

Numeric and time series models¶

Mean Absolute Error¶

By minimizing the absolute residual error, rather than squared residual error, Mean Absolute Error is also a good general-purpose error metric, but is more permissive to outliers than Mean Squared Error and R^2 Goodness of Fit Error. This can be a good choice if outliers in the data are likely to be due to noise, or if capturing an overall trend is more important than avoiding a few large errors. Mean Absolute Error can also be interpreted as "on average, predictions are off by this amount."

Mean Absolute Percentage Error¶

Mean Absolute Percentage Error is a common error metric for time series forecasting. It can be interpreted as the average absolute percentage by which predicted values deviate from the actuals. It can be a good choice when relative errors are more important than absolute error values. Mean Absolute Percentage Error may not be a good choice if there are very small actual values in the dataset since small errors on these rows may dominate the metric calculation; any rows where the actual value is 0 are not included in the error metric calculation.

R^2 Goodness of Fit Error (R^2)¶

R^2 is a standard measure of model fitness. It can be interpreted as the “percent of variance explained”. As a percentage, the R^2 can be compared across models and datasets since the scale is not dependent on the scale of the data.

Note

R^2 Goodness of Fit Error is the Eureqa default for numeric and time series models.

Mean Squared Error¶

Mean Squared Error is a common error metric. Optimizing for Mean Squared Error will be equivalent to optimizing for R^2; however, Mean Squared Error values will depend on the scale of the data. Because they depend on squared error, both Mean Squared Error and R^2 tend to be sensitive to outliers and a good choice when there is strong incentive to avoid individual large errors.

Classification models¶

Mean Squared Error for classification¶

Mean Squared Error for Classification is the default error metric for classification problems in Eureqa. This metric optimizes Mean Squared Error but with internal optimizations for classification problems. The output values of logistic models that have been optimized for Mean Squared Error can be interpreted as the probability of a 1 outcome. Mean Squared Error may not be the best error metric when trying to classify rare events since it attempts to minimize overall error rather than separation between positive and negative cases.

Area Under ROC Curve Error (AUC)¶

AUC is a common error metric for classification and works by optimizing the ability of a model to separate the 1s from the 0s. AUC is not sensitive to the relative number of 0s and 1s in the target variable and can be a good choice for skewed classes. When optimized for AUC, predicted values will effectively order inputs from the most to least likely to be 1; however, they cannot be interpreted as a predicted probability.

Error metrics and noise¶

One consideration for choosing an error metric is the expected amount of noise in the data. Different error metrics effectively make different assumptions about the distribution of the noise in the observed output. For example, for very noisy systems you might select an error metric that would give relatively less weight to some large errors (e.g., Mean Absolute Error, IQR Error, Median Error) under the assumption that these large errors may be due to noise in the input data rather than poor model fit. On the contrary, when input data is expected to have very low noise you might select an error metric which heavily penalizes large errors (e.g., R^2 or Maximum Error).

Noisy systems¶

Mean Logarithm Squared Error¶

Mean Logarithm Squared Error uses the log function to squash error values and decrease the impact of large errors.

Interquartile Mean Absolute Error (IQME)¶

By ignoring the smallest 25% and largest 75% of error values, IQME will not be impacted by a significant number of outliers and may work well if you are most interested in “on average” performance.

Median Absolute Error¶

By ignoring all residual values except for the median, Median Absolute Error is the most permissive of outliers.

Low-noise systems¶

Maximum Absolute Error¶

By ignoring all but the maximum error value, Maximum Absolute Error can work well if you are expecting a perfect or nearly perfect fit; for example, if you are using the Eureqa model for symbolic simplification.

Error Metrics for classification¶

In addition to the common classification error metrics outlined above, the error metrics described in this section are specific to classification problems.

Additional error metrics for classification¶

Log Loss Error¶

Log Loss Error is a common metric for classification problems. The log transformation on the errors heavily penalizes a high confidence in wrong predictions.

Maximum Classification Accuracy¶

Maximum Classification Accuracy optimizes the overall ability of a model to make correct 0 or 1 predictions. It may not work well for skewed classes (e.g., when only 1% of the data is ‘1’), since in these cases sometimes the highest predictive accuracy is achieved by simply predicting 0 all of the time.

Hinge Loss Error¶

Hinge Loss Error is used to optimize classification models that will be used for 0 or 1 predictions. It is a one-sided metric that increasingly penalizes wrong predictions as they get more confident, but treats all true predictions identically after they reach a minimum threshold value. When optimizing Hinge Loss Error, logistic() (building_block__logistic_function) should not be used in the target expression since this metric expects a large range of predicted score values.

Use case-specific error metrics¶

Predicting rank¶

Rank Correlation will measure a model based on its ability to rank-order observations rather than to predict a particular value. This can be useful when looking for a model that can predict a relative ranking, such as the finishing order of contestants in a race.