Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Modeling algorithms

DataRobot supports a comprehensive library of pre- and post-processing (modeling) steps, which combine to make up the model blueprint. Which are run or available in the model repository is dependent on the dataset. The comprehensive combination of pre- and post-processing steps allows DataRobot to confidently create a Leaderboard of your best modeling options. Some examples of the modeling flexibility include logistic regression with and without PCA as a pre-processor or random forests with and without a greedy search for interaction terms.

The implication of this is that for every model in the list below, DataRobot likely runs two-to-five times, each with a different pre-processing and/or variable selection. The following sections list the relevant algorithms:

Pre-processing tasks

Categorical

  • Buhlman credibility estimates for high cardinality features
  • Categorical embedding
  • Category count
  • One-hot encoding
  • Ordinal encoding of categorical variables
  • Univariate credibility estimates with L2
  • Efficient, sparse one-hot encoding for extremely high cardinality categorical variables

Numerical

  • Binning of numerical variables
  • Constant splines
  • Missing values imputed
  • Numeric data cleansing
  • Partial Principal Components Analysis
  • Truncated Singular Values Decomposition
  • Normalizer

Geospatial

  • Geospatial Location Converter
  • Spatial Neighborhood Featurizer

Images

  • Greyscale Downscaled Image Featurizer
  • No Post Processing
  • OpenCV detect largest rectangle
  • OpenCV image featurizer
  • Pre-trained multi-level global average pooling image featurizer

Text models

  • Character / word n-grams
  • Pretrained byte-pair encoders (best of both words for char-grams and n-grams)
  • Stopword removal
  • TF-IDF scaling (optional sublinear scaling and binormal separation scaling)
  • Hashing vectorizers for big data
  • Cosine similarity between pairs of text columns (on datasets with 2+ text columns)
  • Support for all languages, including English, Japanese, Chinese, Korean, French, Spanish, Chinese, Portuguese, Arabic, Ukrainian, Klingon, Elvish, Esperanto, etc.
  • Unsupervised Fasttext models
  • Linear n-gram models (character/word n-grams + TF-IDF + penalized linear/logistic regression)
  • SVD n-gram models (n-grams + TF-IDF + SVD)
  • Naive Bayes weighted SVM
  • TinyBERT / Roberta/ MiniLM embedding models
  • Text CNNs

Generalized Linear Models

  • NA imputation (methods for missing at random and missing not at random), standardization, ridit transform
  • Search for best transformations
  • Efficient, sparse one-hot encoding for extremely high cardinality categorical variables

Linear or additive models

Generalized Linear Models

  • Penalty: L1 (Lasso), L2 (Ridge), ElasticNet, None (Logistic Regression)
  • Distributions: Binomial, Gaussian, Poisson, Tweedie, Gamma, Huber
  • Special Cases: 2-stage model (Binomial + Gaussian) for zero-inflated regression

Support Vector Machines

  • Penalty: L1 (Lasso), L2 (Ridge), ElasticNet, None
  • Kernel: Linear, Nyström RFB, RBF
  • liblinear and libsvm

Generalized Additive Models

  • GAM
  • GA2M

Tree-based models

  • Decision Tree (or CART)
  • Random Forest
  • ExtraTrees (or Extremely Randomized Forests)
  • Gradient Boosted Trees (or GBM— Binomial, Gaussian, Poisson, Tweedie, Gamma, Huber)
  • Extreme Gradient Boosted Trees (or XGBoost— Binomial, Gaussian, Poisson)
  • LightGBM
  • AdaBoost
  • RuleFit

Deep learning and foundational models

  • Keras MLPs with residual connections, adaptive learning rates and adaptive batch sizes
  • Keras self-normalizing MLPs with residual connections
  • Keras neural architecture search MLPs using hyperband
  • DeepCTR
    • Neural Factorization Machines
    • AutoInt
    • Cross Networks
  • Pretrained CNNs for images using foundational models (especially EfficientNet)
    • Manually pruned and optimized for faster inference
  • Pretrained + fine-tuned CNNs for images
  • Image augmentation
  • Pretrained TinyBERT models for text
  • Keras Text CNNs
  • Fastext models for text

Time series-specific models

  • LSTMs
  • DeepAR models
  • AutoArima
  • ETS, aka exponential smoothing
  • TBATS
  • Prophet

Unsupervised models

Anomaly detection models

  • Isolation Forest
  • Local Outlier Factor
  • One Class SVM
  • Double Median Absolute Deviation
  • Mahalanobis Distance
  • Anomaly Detection Blenders
  • Keras Deep Autoencoder
  • Keras Deep Variational Autoencoder

Clustering models

  • Kmeans
  • HDBScan

Other model types

  • Eureqa (proprietary genetic algorithm for symbolic regression)
  • K-Nearest Neighbors (three distances)
  • Partial-least squares (used for blenders)
  • Isotonic Regression (used for calibrating predictions from other models)

Click a blueprint node to access full model documentation. Using Composable ML, you can build models that best suit your needs using built-in tasks and custom Python/R code.


Updated February 5, 2024