To calculate custom metric values, you can use MetricEvaluator to calculate metric values over time, BatchMetricEvaluator to calculate metric values per batch, or IndividualMetricEvaluator to evaluate metrics without data aggregation.
The MetricEvaluator class calculates metric values over time using the selected source. This class is used to "stream" data through the metric object, generating metric values. Initialize the MetricEvaluator with the following mandatory parameters:
To use MetricEvaluator, create a metric class implementing the MetricBase interface and a source implementing DataSourceBase. Then, specify the level of aggregation granularity. Initialize MetricEvaluator with all parameters:
If a string or list of strings is passed, MetricEvaluator will look for matched Sklearn metrics. If a metric or a list of objects is passed, they must implement the MetricBase interface.
source: DataSourceBase
The source to pull the data from, DataRobotSource or DataFrameSource or other sources that implement the DataSourceBase interface.
time_bucket: TimeBucket
The time-bucket size to use for evaluating metrics, determining the granularity of aggregation.
prediction_col: Optional[str]
The name of the column that contains predictions.
actuals_col: Optional[str]
The name of the column that contains actuals.
timestamp_col: Optional[str]
The name of the column that contains timestamps.
filter_actuals: Optional[bool]
Whether the metric evaluator removes missing actuals values before scoring. True removes missing actuals; the default value is False.
filter_predictions: Optional[bool]
Whether the metric evaluator removes missing predictions values before scoring. True removes missing predictions; the default value is False.
filter_scoring_data: Optional[bool]
Whether the metric evaluator removes missing scoring values before scoring. True removes missing scoring values; the default value is False.
segment_attribute: Optional[str]
The name of the column with segment values.
segment_value: Optional[Union[str or List[str]]]
A single value or a list of values of the segment attribute to segment on.
The score method returns a metric aggregated as defined by TimeBucket. The output returned as a pandas DataFrame contains the results per time bucket for all data from the source.
source=DataRobotSource(deployment_id=DEPLOYMENT_ID,start=datetime.utcnow()-timedelta(hours=3),end=datetime.utcnow(),)metric=LogLossFromSklearn()me=MetricEvaluator(metric=metric,source=source,time_bucket=TimeBucket.HOUR)aggregated_metric_per_time_bucket=me.score()print(aggregated_metric_per_time_bucket.to_string())timestampsampleslog_loss02023-09-1413:29:48.065000+00:004990.53931512023-09-1414:01:51.484000+00:004990.539397# we can see the evaluator's statisticsstats=me.stats()print(stats)totalrows:998,scorecalls:2,reducecalls:2
To pass more than one metric at a time, do the following:
If data is missing, use filtering flags. In the following example, the data is missing actuals. In this scenario without a flag, an exception is raised:
test_df=gen_dataframe_for_accuracy_metric(nr_rows=10,rows_per_time_bucket=5,prediction_value=1,time_bucket=TimeBucket.HOUR,)test_df["actuals"].loc[2]=Nonetest_df["actuals"].loc[5]=Noneprint(test_df)timestamppredictionsactuals001/06/200513:00:00.00000010.999101/06/200513:00:00.00000010.999201/06/200513:00:00.0000001NaN301/06/200513:00:00.00000010.999401/06/200513:00:00.00000010.999501/06/200514:00:00.0000001NaN601/06/200514:00:00.00000010.999701/06/200514:00:00.00000010.999801/06/200514:00:00.00000010.999901/06/200514:00:00.00000010.999source=DataFrameSource(df=test_df)metric=MedianAbsoluteError()me=MetricEvaluator(metric=metric,source=source,time_bucket=TimeBucket.HOUR)aggregated_metric_per_time_bucket=me.score()"ValueError: Could not apply metric median_absolute_error, make sure you are passing the right data (see the sklearn docs).Theerrormessagewas:InputcontainsNaN."
Compare the previous result with the result when you enable the filter_actuals flag:
me=MetricEvaluator(metric=metric,source=source,time_bucket=TimeBucket.HOUR,filter_actuals=True)aggregated_metric_per_time_bucket=me.score()"removed 1 rows out of 5 in the data chunk before scoring, due to missing values in ['actuals'] data""removed 1 rows out of 5 in the data chunk before scoring, due to missing values in ['actuals'] data"print(aggregated_metric_per_time_bucket.to_string())timestampsamplesmedian_absolute_error001/06/200513:00:00.00000040.001101/06/200514:00:00.00000040.001
Using the filter_actuals, filter_predictions, and filter_scoring_data flags, you can filter out missing values from the data before calculating the metric. By default, these flags are set to False. If all data needed to calculate the metric is missing from the data chunk, the data chunk is skipped with the appropriate log:
test_df=gen_dataframe_for_accuracy_metric(nr_rows=4,rows_per_time_bucket=2,prediction_value=1,time_bucket=TimeBucket.HOUR,)test_df["actuals"].loc[0]=Nonetest_df["actuals"].loc[1]=Noneprint(test_df)timestamppredictionsactuals001/06/200513:00:00.0000001NaN101/06/200513:00:00.0000001NaN201/06/200514:00:00.00000010.999301/06/200514:00:00.00000010.999source=DataFrameSource(df=test_df)metric=MedianAbsoluteError()me=MetricEvaluator(metric=metric,source=source,time_bucket=TimeBucket.HOUR,filter_actuals=True)aggregated_metric_per_time_bucket=me.score()"removed 2 rows out of 2 in the data chunk before scoring, due to missing values in ['actuals'] data""data chunk is empty, skipping scoring..."print(aggregated_metric_per_time_bucket.to_string())timestampsamplesmedian_absolute_error101/06/200514:00:00.00000020.001
Perform segmented analysis by defining the segment_attribute and each segment_value:
metrics=LogLossFromSklearn()me=MetricEvaluator(metric=metric,source=source,time_bucket=TimeBucket.HOUR,segment_attribute="insulin",segment_value="Down",)aggregated_metric_per_time_bucket=me.score()print(aggregated_metric_per_time_bucket.to_string())timestampsampleslog_loss[Down]02023-09-1413:29:49.737000+00:00490.59448312023-09-1414:01:52.437000+00:00490.594483# passing more than one segment valueme=MetricEvaluator(metric=metric,source=source,time_bucket=TimeBucket.HOUR,segment_attribute="insulin",segment_value=["Down","Steady"],)aggregated_metric_per_time_bucket=me.score()print(aggregated_metric_per_time_bucket.to_string())timestampsampleslog_loss[Down]log_loss[Steady]02023-09-1413:29:48.502000+00:001990.5944830.51581112023-09-1414:01:51.758000+00:001990.5944830.515811# passing more than one segment value and more than one metricme=MetricEvaluator(metric=[LogLossFromSklearn(),RocAuc()],source=source,time_bucket=TimeBucket.HOUR,segment_attribute="insulin",segment_value=["Down","Steady"],)aggregated_metric_per_time_bucket=me.score()print(aggregated_metric_per_time_bucket.to_string())timestampsampleslog_loss[Down]log_loss[Steady]roc_auc_score[Down]roc_auc_score[Steady]02023-09-1413:29:48.502000+00:001990.5944830.5158110.7833330.82663212023-09-1414:01:51.758000+00:001990.5944830.5158110.7833330.826632
The IndividualMetricEvaluator class is used to evaluate metrics without data aggregation. It performs metric calculations on all exported data and returns a list of individual results. This evaluator allows submitting individual data points with a corresponding association ID, which is useful for cases when you want to visualize your metric results alongside predictions and actuals. To use this evaluator with custom metrics, provide a score() method that contains, among others, the following parameters: timestamps and association_ids.
fromitertoolsimportzip_longestfromtypingimportListfromdatetimeimportdatetimefromdatetimeimporttimedeltafromdmmimportCustomMetricfromdmmimportDataRobotSourcefromdmmimportSingleMetricResultfromdmm.individual_metric_evaluatorimportIndividualMetricEvaluatorfromdmm.metricimportLLMMetricBasefromnltkimportsent_tokenizeimportnumpyasnpimportpandasaspdsource=DataRobotSource(deployment_id=DEPLOYMENT_ID,start=datetime.utcnow()-timedelta(weeks=1),end=datetime.utcnow(),)custom_metric=CustomMetric.from_id()classSentenceCount(LLMMetricBase):""" Calculates the total number of sentences created while working with the LLM model. Returns the sum of the number of sentences from prompts and completions. """def__init__(self):super().__init__(name=custom_metric.name,description="Calculates the total number of sentences created while working with the LLM model.",need_training_data=False,)self.prompt_column="promptColumn"defscore(self,scoring_data:pd.DataFrame,predictions:np.ndarray,timestamps:np.ndarray,association_ids:np.ndarray,**kwargs,)->List[SingleMetricResult]:ifself.prompt_columnnotinscoring_data.columns:raiseValueError(f"Prompt column {self.prompt_column} not found in the exported data, "f"modify 'PROMPT_COLUMN' runtime parameter")prompts=scoring_data[self.prompt_column].to_numpy()sentence_count=[]forprompt,completion,ts,a_idinzip_longest(prompts,predictions,timestamps,association_ids):ifnotisinstance(prompt,str)ornotisinstance(completion,str):continuevalue=len(sent_tokenize(prompt))+len(sent_tokenize(completion))sentence_count.append(SingleMetricResult(value=value,timestamp=ts,association_id=a_id))returnsentence_countsentence_count_evaluator=IndividualMetricEvaluator(metric=SentenceCount(),source=source,)metric_results=sentence_count_evaluator.score()