Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Visual AI reference

The following sections provide a very brief overview on the technological components of Visual AI.

A common approach for modeling image data is building neural networks that take raw pixel values as input. However, using a fully-connected neural network for images often leads to enormous network sizes and makes them difficult to work with. For example, a color (i.e., red, green, blue), 224x224 pixel image has 150,528 input features (224 x 224 x 3). The network can result in more than 150 million weights in the first layer alone. Additionally, because images have too much "noise," it is very difficult to make sense of them by looking at individual pixels. Instead, pixels are most useful in the context of their neighbors. Since the position and rotation of pixels representing an object in an image can change, the network must be trained to, for example, detect a cat regardless of where it appears in the image. Visual AI provides automated and efficient techniques for solving these challenges, along with model interpretability, tuning, and predictions in a human-consumable and familiar workflow.

Pre-trained network architectures

To use images in a modeling algorithm, they must first be turned into numbers. In DataRobot blueprints, this is the responsibility of the blueprint tasks called "featurizers". The featurizer takes the binary content of an image file as input and produces a feature vector that represents key characteristics of that image at different levels of complexity. These feature vectors can further be combined with other features in the dataset (numeric, categorical, text, etc.) and used downstream as input to a modeler.

Additionally, fine-tuned featurizers train the neural network on the given dataset after initializing it with pre-trained information, further customizing the output features. Fine-tuned featurizers are incorporated in a subset of blueprints and are only run by Autopilot in Comprehensive mode. You can run them from the Repository if the project was built using a different mode. Additionally, you can edit an existing blueprint using Composable ML and replace a pre-trained featurizer with a fine-tuned featurizer.

How do different types of featurizers handle domain adaptation?

Separate sets of blueprints incorporate "fine-tuned" featurizers vs pre-trained (non-fine-tuned) featurizers. Both are pre-trained, but they handle transfer learning differently. The non-fine-tuned featurizer blueprints produce features of various complexity, but use the downstream ML algorithm to adapt to a different domain. Fine-tuned featurizer blueprints, on the other hand, adjust their own weights during training.

All DataRobot, featurizers, fine-tuned classifiers/regressors, and fine-tuned featurizers are based on pre-trained neural network architectures. Architectures define the internal structure of the featurizer—the neural network—and they influence runtime and accuracy. DataRobot automates the selection of hyperparameters for these featurizers and fine-tuners to a certain extent, but it is also possible to further customize them manually to optimize results.

There is, additionally, a baseline blueprint that answers the question "What results would I see if I didn't bother with a neural network?". If selected, DataRobot builds a blueprint that contains a grayscale downscaled featurizer that is not a network. These models are faster but less accurate. They are useful for investigating target leakage (is one class brighter than others? Are there unique watermarks or visual patches for each class? Is accuracy too good to be true?).

Furthermore, DataRobot implements state-of-the-art neural network optimizations to run over popular architectures, making them significantly faster while preserving the same accuracy. DataRobot offers a pruned version of several of the top featurizers which, if that architectural variant exists, is highly recommended (providing the same accuracy at up to three times the speed).

DataRobot Visual AI not only offers the best imaging architectures but automatically selects the best neural network architecture and featurizer_pooling type for the dataset and problem type. Automatic selection—known as Visual AI Heuristics —is based on optimizing the balance between accuracy and speed. Additionally, when Autopilot concludes, the logic automatically retrains the best Leaderboard model using the EfficientNet-B0-Pruned architecture for an accuracy boost. DataRobot integrates state-of-the-art architectures, allowing you to select the best for your needs. The following lists the architectures DataRobot supports:

Featurizer Description
Darknet This simple neural network consists of eight 3x3 convolutional blocks with batch normalization, Leaky ReLu activation, and pooling. The channel depth increases by a factor of two with each block. Including a final dense layer, the network has nine layers in total.
Efficientnet-b0, Efficientnet-b4 The fastest network in the EfficientNet family of networks, the b0 model notably outperforms ResNet-50 top-1 and top-5 accuracy on ImageNet while having ~5x fewer parameters. The main building of the EfficientNet models is the mobile inverted residual bottleneck (MBConv) convolutional block, which constrains the number of parameters. The b4 neural network is likely to be the most accurate for a given dataset. The implementation of the b4 model scales up the width of the network (number of channels in each convolution) by 1.4 and the depth of the network (the number of convolutional blocks) by 1.8, providing a more accurate and slower model than b0, with results comparable to ResNext-101 or PolyNet. EfficientNet-b4, while it takes longer to run, can deliver significant accuracy increases.
Preresnet10 Based on ResNet, except within each residual block the batch norm and ReLu activation happen before rather than after the convolutional layer. This implementation of the PreResNet architecture has four PreRes blocks with two convolutional blocks each, which yield 10 total layers when including an input convolutional layer and output dense layer. The model's computational complexity should scale linearly with the depth of the network, so this model should be about 5x faster than ResNet50. However, because the richness of the features generated can affect the fitting time of downstream modelers like XGB with Early Stopping, the time taken to train a model using a deeper featurizer like ResNet50 could be even more than 5x.
Resnet50 This classic neural network is based on residual blocks containing skip-ahead layers, which in practice allow for very deep networks that still train effectively. In each residual block, the inputs to the block are run through a 3x3 convolution, batch norm, and ReLu activation—twice. That result is added to the inputs to the block, which effectively turns the result into a residual of the layer. This implementation of ResNet has an input convolutional layer, 48 residual blocks, and a final dense layer, which yield 50 total layers.
Squeezenet The fastest neural network in DataRobot, this network was designed to achieve the speed of AlexNet with 50x fewer parameters, allowing for faster training, prediction, and storage size. It is based around the concept of fire modules, consisting of a combination of "squeeze" layers followed by "expand" layers, the purpose of which is to dramatically reduce the number of parameters used while preserving accuracy. This implementation of SqueezNet v1.1 has an input convolutional layer followed by eight fire modules of three convolutions each, leading to a total of 25 total layers.
Xception This neural network is an improvement in accuracy over the popular Inception V3 network that has comparable speed to ResNet-50 but with better accuracy on some datasets. It saves on parameters by learning spatial correlations separately from cross-channel correlations. The core building block is the depth-wise separable convolution (a depthwise convolution + pointwise convolution) with residual layers added (similar to PreResNet-10). This building block aims to "decouple" the learning happening across the spatial dimensions (height and width) with the learning happening across the channel dimensions (depth), so that they are handled in separate parameters whose interaction can be learned from other parameters downstream in the network. Xception has 11 convolutional layers in the "entry flow" where the width and height are reduced and the depth increases, then 24 convolutional layers where the size remains constant for a total of 36 convolutional layers.
MobileNetV3-Small-Pruned MobileNet V3 is the latest in the MobileNet family of neural networks, which are specially designed for mobile phone CPUs and other low-resource devices. It comes in two 2 variants: MobileNet3-Large for high resource usage and MobileNet3-Small for low resource usage. MobileNetV3-Small is 6.6% more accurate than the previous MobileNetV2 with the same or better latency. In addition to its lightweight blocks and operations, pruning is applied resulting in faster feature extraction. This pruned version keeps the same architecture but with a significantly reduced number of layers ( ~50 ). Conv2D or DepthwiseConv2D followed by BatchNormalization are merged into single Conv2D layer.
DarkNet-Pruned Based on the same architecture as DarkNet, the pruned version is optimized for inference speed. Conv2D layers followed by BatchNormalization layers are merged into single Conv2D layer.
EfficientNet-b0-Pruned, EfficientNet-b4-Pruned Providing a modified version of EfficientNet-b0 and -b4, the pruned variant removes the BatchNormalization layers after Conv2D layer and merges them into the preceding Conv2D layer. This results in a network with fewer layers but the same accuracy, providing faster inference for both CPU and GPU.
EfficientNetV2-S-Pruned EfficientNetV2-S-Pruned is the latest Neural Network part of EfficientNet family. It combines all previous insights from EfficientNetV1 models (2019), and applies the new Fused-MBConv approach by Google Neural Architecture Search as follows:
  • Replaces"DepthwiseConv2D 3x3 followed by Conv2D 1x1" with "Conv2D 3x3" (operation is called Fused-MBConv).
  • Improves training procedures. Models are now pre-trained with over 13M+ images from 21k+ classes ImageNet 21k.
In addition, DataRobot applies a layer-reducing "pruning operation", removing the BatchNormalization layers after the Conv2D and DepthwiseConv2D. This results in a network with fewer layers, achieving the same accuracy but providing faster inference for both CPU and GPU.
ResNet50-Pruned The only difference between ResNet50 and ResNet50-Pruned is that the variant removes the BatchNormalization layers after the Conv2D layer and merges them into the preceding Conv2D layer. This results in a network with fewer layers but the same accuracy, providing faster inference for both CPU and GPU.

Images and neural networks

Featurizers are deep convolutional neural networks made of sequential layers, each layer aggregating information from previous layers. The first layers capture low level patterns made of a few pixels: points, edges, corners. The next layers capture shapes and textures; final layers capture objects. You can select the level of features you want to extract from the neural network model, tuning and optimizing results (although the more layers enabled the longer the run time).

Keras models

The Neural Network Visualizer, available from the Leaderboard, illustrates layer connectivity for each layer in a model's neural network. It applies to models where either a preprocessing step is a neural network (like in the case of Visual AI blueprints) or the algorithm making predictions is a Keras model (like with tabular Keras blueprints without images).

All Visual AI blueprints, except the Baseline Image Classifier/Regressor, use Keras for preprocessing images. Some Visual AI blueprints use Keras for preprocessing and another Keras model for making predictions—those blueprints have "Keras" in the name. There are also non-Visual AI blueprints that use Keras for making predictions; those blueprints also have "Keras" in the name.

Convolutional Neural Networks (CNNs)

CNNs are a class of deep learning networks applied to image processing for the purpose of turning image input to machine learning output. (See also the KDnuggets explanation of CNNs.) With CNNs, instead of having all pixels connected to all other pixels, the network only connects pixels within regions, and then regions to other regions. This training process, known as the "rectified linear unit" or ReLU network layer, significantly reduces the number of parameters and can be illustrated as:


The drawbacks of CNNs are that they require millions of rows of labeled data to train accurate models. Additionally, for large images, feature extraction can be quite slow. As the amount of training data and the resolution of the data increases, the required computational resources to train can also increase dramatically.

To address these issues, Visual AI relies on pre-trained networks to featurize images, speeding up processing because there is no need to train deep learning featurizers from scratch. Also, Visual AI requires much less training data: hundreds of images instead of thousands. By combining features from various layers, Visual AI is not limited to using the output of the pre-trained featurizers only, which means the subsequent modeling algorithm (XGBoost, Linear model, etc.) can learn the specificity of the training images. This is DataRobot's application of transfer learning, allowing you to apply Visual AI to any kind of problem. The mathematics of transfer learning also makes it possible to combine image and non-image data in a single project.


There are two model-specific visualizations available to help understand how Visual AI grouped images and which aspects of the image were deemed most important.

Activation Maps

Activation Maps illustrate which areas of an image the model is paying attention to. They are computed similarly to Feature Impact for numeric and categorical variables, relying on the permutation method and/or SHAP techniques to capture how a prediction changes when data is modified. The implementation itself leverages a modified version of Class Activation Maps, highlighting the regions of interest. Based on Gradient-weighted Class Activation Mapping (Grad-CAM), DataRobot:

  • Takes co-variates into account, while traditional activation maps are for “image only” datasets.

  • Calculates the activation map inclusive of the final model. For example, if an image model is connected to XGBoost, the activation maps are inclusive of the XGBoost model adjustments.

  • Scales activation maps to the target. In other words, it is not “the model looked at this region,” but instead “this region influences the target variable.”

See also "Understand your algorithm with Grad-CAM".

These maps are important because they allow you to verify that the model is learning the right information for your use case, does not contain undesired bias, and is not overfitting on spurious details. Furthermore, convolutional layers naturally retain spatial information which is otherwise lost in fully connected layers. As a result, the last convolutional layers have the best compromise between high-level object recognition and detailed spatial information. These layers look for class-specific information in the image. Knowing the importance of each class' activation helps to better understand the deep model's focus.

Image Embeddings

At the input layer, classes are quite tangled and less distinct. Visual AI uses the last layer of a pre-trained neural network (because the last layer of the network represents a high-level overview of what the network knows about forming complex objects). This layer produces a new representation in which the classes are much more separated, allowing them to be projected into a two-dimensional space, defined by similarity, and inspected with the Image Embeddings tab. DataRobot uses Trimap, a state-of-the-art unsupervised learning dimensionality reduction approach for its image embedding implementation.

Image embeddings are about projecting. Using the super high dimensional space the images exist in (224x224x3 or 528 dimensions), DataRobot projects them into 2D-space. Their proximity, while dependent on the data, can potentially aid in outlier detection.

Updated April 2, 2024