About augmented models¶
By creating new images for training by randomly transforming existing images, you can build insightful projects with datasets that might otherwise be too small. In addition, all image projects that use augmentation have the potential for smaller overall loss by improving the generalization of models on unseen data. That is:
- Augmentation is the action taken on the image dataset.
- Transformations are the actions applied to an image.
After the process of augmentation, each image is transformed.
For a general explanation of image augmentation, see the description in albumentations documentation—this is the open-source library that helps power DataRobot's implementation of the augmentation feature.
This page provides a general overview of how to configure augmentation. The parameters used to configure augmentation are detailed in this page about augmentation lists and transformation parameters.
There are two places where you can configure the Train-Time Image Augmentation step:
If you add a secondary dataset with images to a primary tabular dataset, the augmentation options described above are not available. Instead, if you have access to Composable ML, you can modify each needed blueprint by adding an image augmentation vertex directly after the raw image input (as the first vertex in the image branch) and configure augmentation from there.
A key advantage of train-time image augmentation is that because it is only applied during training, the prediction times for a model are relatively unchanged by whether it was trained with augmentation. This allows you to deploy models with better loss at no cost to your prediction times.
Some performance notes:
Benchmarking has shown that in a project where dataset rows are doubled with image augmentation, building in Autopilot will take about 50% longer.
When image augmentation improves the LogLoss of a model, it improves it on average by approximately 10%, with a very large variance model-to-model and dataset-to-dataset.
While models trained with image augmentation are often more robust to data drift than models trained without, transformations applied in image augmentation should not be used to anticipate future data drift. For example, if you are training a model to detect species of freshwater fish, and you anticipate that you'll apply your model in the future to a different region with larger fish, the best approach would be to collect data from that different region and incorporate it into your dataset. If you were to just apply the Scale transformation to your current dataset in an attempt to simulate larger fish not seen in your dataset, you would be creating images with larger fish in training, but when DataRobot scored your model against the validation or holdout, model performance would suffer because there were no larger fish in those partitions. This makes it difficult to correctly evaluate your model with augmentation against other models on the Leaderboard— your current training dataset is not representative of your future data.
There are many research papers available that explain and provide evidence of the benefits of image augmentation for machine learning models—improved performance and outcomes as well as making them more robust. Below are a sample of external resources:
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020, November). A Simple Framework for Contrastive Learning of Visual Representations.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks.
Wang, J., & Perez, L. (2017). The Effectiveness of Data Augmentation in Image Classification using Deep Learning.
Zoph, B., Cubuk, E. D., Ghiasi, G., Lin, T. Y., Shlens, J., & Le, Q. V. (2020, August). Learning Data Augmentation Strategies for Object Detection.