Prediction Explanations on small data¶
The described workaround is intended for users who are very familiar with the partitioning methods used in DataRobot modeling. Be certain you understand the implications of the changes and their impact on resulting models.
Can I get Prediction Explanations for a small dataset?
For small datasets, specifically those with validation subsets less than 100 rows, we cannot run XEMP Prediction Explanations. (I assume that's true for SHAP also, but I haven't confirmed). Is there a common workaround for this? I was considering just doubling or tripling the dataset by creating duplicates, but not sure if others have used slicker approaches.
It’s not true for SHAP, actually. No minimum row count there. 🤠
I feel like I’ve seen workarounds described in
#data-science or somewhere... One thing you can do is adjust the partitioning ratios to ensure 100 rows land in Validation. There might be other tricks too.
Right, that second idea makes sense, but you'd need probably > 200 rows. The user has a dataset with 86 rows.
I just don't want to have to eighty-six their use case. 🥁
OK, no dice there. 🎲
I’d want to be really careful with duplicates, but this MIGHT finesse the issues:
Train on your actual dataset, do “Training Predictions”, and carefully note the partitions for all rows.
Supplement the dataset with copied rows, and add a partition columns such that all original rows go in the same partitions as before, and all copied rows go in the Validation fold. I guess you probably want to leave the holdout the same.
Start a new project, select User CV, and train the model. Probably do Training Predictions again and make sure the original rows kept the same prediction values.
You should be able to run XEMP now.
I think (fingers crossed) that this would result in the same trained model, but you will have faked out XEMP. However, Validation scores for the modified model would be highly suspect. XEMP explanations would probably be OK, as long as you ensure the copied data didn’t appreciably change the distributions of any features in the Validation set.
I think if you scrupulously kept the Holdout rows the same, and the Holdout scores match in the two models, that is a sign of success.
Right, so if I ran Autopilot again, it would do unreasonably well on that Validation set, but if I just train the same blueprint from the original Autopilot, that would be fine.
Yes. Autopilot would probably run a different sequence of blueprints because the Leaderboard order would be wacky and the winning blueprint would quite likely be different.
It almost goes without saying, but this is more suspect the earlier you do it in the model selection process. If you’re doing a deep dive on a model you’ve almost locked in on, that’s one thing, but if you’re still choosing among many options, it’s a different situation.
Brilliant, thank you!