Skip to content

Click in-app to access the full platform documentation for your version of DataRobot.

Data Quality Assessment

The Data Quality Assessment capability automatically detects and surfaces common data quality issues and, often, handles them with minimal or no action on the part of the user. The assessment not only saves time finding and addressing issues, but provides transparency into automated data processing (you can see the automated processing that has been applied). It includes a warning level to help determine issue severity.

As part of EDA1, DataRobot runs checks on features that don’t require date/time and/or target information. Once EDA2 starts, DataRobot runs additional checks. In the end, the following checks are run:

Additionally, for time series projects:

Once EDA1 completes, the Data Quality Assessment appears just above the feature listing on the Data page.

In addition to the baseline data quality assessment, DataRobot provides additional detail for time series and Visual AI projects. Once model building completes, if your organization has uncensored blueprints you can view the Data Quality Handling Report for additional imputation information.

For more information, refer to the following reference material:

Overview

The Data Quality Assessment provides information about data quality issues that are relevant to your stage of model building. Initially run as part of EDA1 (data ingest), the results report on the All Features list. It runs again and updates after EDA2, displaying information for the selected feature list (or, by default, All Features). For checks that are not applicable to individual features (for example, Inconsistent Gaps), the report provides a general summary. Click View Info to view (and then Close Info to dismiss) the report:

Each data quality check provides issue status flags, a short description of the issue, and a recommendation message, if appropriate:

  • Warning (): Attention or action required

  • Informational (): No action required

  • No issue ()

Because the results are feature-list based, it is possible that if you change the selected feature list on the Data page, new checks will appear or current checks will disappear from the assessment. For example, if feature list List 1 contains a feature problem, which contains outliers, the outliers check will show in the assessment. If you change lists to List 2 which does not include problem (or any other feature with outliers), the outliers check will report "no issue" ().

From within the assessment modal, you can filter by issue type to see which features triggered the checks. Toggle on Show only affected features and check boxes next to the check names to select which checks to display:

DataRobot then displays only features violating the selected data quality checks, and within the selected feature list, on the Data page. Hover on an icon for more detail:

Explore the assessment

Once EDA1 completes and you have, perhaps, filtered the display, view the list of features impacted by the issues you are interested in investigating. To see the values that triggered a warning or information notification, expand a feature and review the Histogram and Frequent Values visualizations.

Interpret the Histogram tab

Use the Histogram chart, which "buckets" numeric feature values into equal-sized ranges to show a rough distribution of the variable, to visualize outliers. A box plot above the chart graphically displays the middle quartiles for a group of data. It is useful for helping to determine whether a distribution is skewed and/or whether the dataset contains a problematic number of outliers. Initially, the display shows the bucketed data:

Check Show outliers to calculate and then display outliers:

Note the change in the X-axis scale and compression of the box plot to allow for outlier display. Because there tend to be fewer rows recording an outlier value (it's what makes them outliers), the blue bar may not display. Hover on that column to display a tooltip with the actual row count.

Interpret Frequent Values

The Frequent Values chart, in addition to showing common values, reports inliers, disguised missing values, and excess zeros.

Time series assessment details

Time series projects run all the baseline data quality checks as well as checks for:

Visual AI assessment details

When EDA1 completes for a Visual AI project, the Data Quality Assessment runs the same baseline checks and an additional missing image check:

The summary provides an indication of the number of missing images and how DataRobot handled them. Click Preview Log for a more detailed view:

In this example, row 1 reports a file name referenced that did not exist in the uploaded file (1). Row 2 reports that a row was missing an image path (2). The log provides both the nature of the issue as well as the row in which the problem occurred. The log previews up to 100 rows; choose Download to export the log and view additional rows.

More info...

The following sections provide:

Quality check descriptions

The sections below detail the checks DataRobot runs for the potential data quality issues. The table that follows summarizes this information.

Outliers

Outliers, the observation points at the far ends of the sample mean, may be the result of data variability. DataRobot automatically creates blueprints that handle outliers. Each blueprint applies an appropriate method for handling outliers, depending on the modeling algorithm used in the blueprint. For linear models, DataRobot adds a binary column inside of a blueprint to flag rows with outliers. Tree models handle outliers automatically.

How they are detected: DataRobot uses its own implementation of Ueda's algorithm for automatic detection of discordant outliers.

How they are handled: The data quality tool checks for outliers; to view outliers use the feature's histogram.

Multicategorical format errors

Multilabel modeling is a classification task that allows each row to contain one, several, or zero labels. To create a training dataset that can be used for multilabel modeling, you must follow the requirements for multicategorical features.

How they are detected: From a sampling of 100 random rows, DataRobot checks every feature that might qualify as multicategorical, looking for at least one value with the proper multicategorical format. If found, each row is checked to determine whether it complies with the multicategorical format. If there is at least one row that does not, the "multicategorical format error" is reported for the feature. The logic for the check is:

  • Value must be a valid JSON.
  • Value must represent a list of non-empty strings.

How they are handled: A selection of errors are reported to the data quality tool. If a feature has a multicategorical format error, it is not detected as multicategorical. View the assessment log for details of the error:

Inliers

Inliers are values that are neither above nor below the range of common values for a feature, however, they are anomalously frequent compared to nearby values (for example, 55555 as a zip code value, entered by people who don't want to disclose their real zip code). If not handled, they could negatively affect model performance.

How they are detected: For each value recorded for a feature, DataRobot computes the value's frequency for that feature and makes an array of the results. Inlier candidates are the outliers in that array. To reduce false positives, DataRobot then applies another condition, keeping as inliers only those values for which:

frequency > 50 * (number of non-missing rows in the feature) / (number of unique non-missing values in the feature)

The algorithm allows inlier detection in numeric features with many unique values where, due to the number of values, inliers wouldn’t be noticeable in a histogram plot. Note that this is a conservative approach for features with a smaller number of unique values. Additionally, it does not detect inliers in features with fewer than 50 unique values.

How they are handled: A binary column is automatically added inside of a blueprint to flag rows with inliers. This allows the model to incorporate possible patterns behind abnormal values. No additional user action is required.

Excess zeros

Repeated zeros in a column could be regular values but could also represent missing values. For example, sales could be zero for a given item either because there was no demand for the item or due to no stock. Using 0s to impute missing values is often suboptimal, potentially leading to decreased model accuracy.

How they are detected: Using the array described in inliers, if the frequency of the value 0 is an outlier, DataRobot flags the feature.

How they are handled: A binary column is automatically added inside of a blueprint to flag rows with excess zeros. This allows the model to incorporate possible patterns behind abnormal values. No additional user action is required.

Disguised missing values

A "disguised missing value" is the term applied to a situation when a value (for example, -999) is inserted to encode what would otherwise be a missing value. Because machine learning algorithms do not treat them automatically, these values could negatively affect model performance if not handled.

How they are detected: DataRobot finds values that both repeat with greater frequency than other values and are also detected outliers. To be considered a disguised missing value, repeated outliers must meet one of the following heuristics:

  • All digits in the value are the same and repeat at least twice (e.g., 99, 88, 9999).
  • The value begins with 1 and is then followed by two or more zeros.
  • The value is equal to -1, 98, or 97.

How they are handled: Disguised missing values are handled in the same way as standard missing values—a median value is imputed and inserted and a binary column flags the rows where imputation occurred.

Target leakage

The goal of predictive modeling is to develop a model that makes accurate predictions on new data, unseen during training. Because you cannot evaluate the model on data you don’t have, DataRobot estimates model performance on unseen data by saving off a portion of the historical dataset to use for evaluation.

A problem can occur, however, if the dataset uses information that is not known until the event occurs, causing target leakage. Target leakage refers to a feature whose value cannot be known at the time of prediction (for example, using the value for “churn reason” from the training dataset to predict whether a customer will churn). Including the feature in the model’s feature list would incorrectly influence the prediction and can lead to overly optimistic models.

How they are detected: DataRobot checks for target leakage during EDA2 by calculating ACE importance scores (Gini Norm metric) for each feature with regard to the target. Features that exceed the moderate-risk (0.85) threshold are flagged; features exceeding the high risk (0.975) threshold are removed.

How they are handled: If the advanced option for leakage removal is enabled (which it is by default), DataRobot automatically creates a feature list (Informative Features - Leakage Removed) that removes the high-risk problematic columns. Medium-risk features are marked with a yellow warning to alert you that you may want to investigate further.

After DataRobot detects leakage and creates Informative Features - Leakage Removed, it behaves according to the Advanced Option “Run Autopilot on feature list with target leakage removed” setting. If enabled (the default):

  • Quick, full, or Comprehensive Autopilot: DataRobot runs the newly created feature list unless you specified a user-created list. To run on one of the other default lists, rebuild models after the initial build with any list you select.
  • Manual mode: DataRobot makes the list available so that you can apply it, at your discretion, from the Repository.
  • The target leakage list will be available when adding models after the initial build.

If disabled, DataRobot applies the above to Informative Features (with potential target leakage remaining) or any user-created list you specified.

Pre-derived lagged feature

When a time series project starts, DataRobot automatically creates multiple date/time- related features, like lags and rolling statistics. There are times, however, when you do not want to automate time-based feature engineering (for example, if you have extracted your own time-oriented features and do not want further derivation performed on them). In this case, you should flag those features as Excluded from derivation or Known in advance. The “Lagged feature” check helps to detect whether features that should have been flagged were not, which would lead to duplication of columns.

How they are detected: DataRobot compares each non-target feature with target(t-1), target(t-2) ... target(t-8).

How they are handled: All features detected as lags are automatically set as excluded from derivation to prevent "double derivation." Best practice suggests reviewing other uploaded features and setting all pre-derived features as “Excluded from derivation” or “Known in advance”, if applicable.

Irregular time steps

The “inconsistent gaps” check is flagged when a time series model has irregular time steps. These gaps cause inaccurate rolling statistics. Some examples:

  • Transactional data is not aggregated for a time series project and raw transactional data is used.

  • Transactional data is aggregated into a daily sales dataset, and dates with zero sales are not added to the dataset.

How they are detected: DataRobot detects when there are expected timestamps missing.

It is important to understand that gaps could be consistent (for example, no sales for each weekend). DataRobot accounts for that and only detects inconsistent or unexpected gaps.

How they are handled: Because their inclusion is not good for rolling statistics, if greater than 20% of expected time steps are missing, the project runs in row-based mode (i.e., a regular project with out-of-time (OTV) validation). If that is not the intended behavior, make corrections in the dataset and recreate the project.

Leading or trailing zeros

Just as for excess zeros, this check works to detect zeros that are used to fill in missing values. It works for the special case where 0s are used to fill in missing values in the beginning or end of series that started later or finished earlier than others.

How they are detected: DataRobot estimates a total rate for zeros in each series and performs a statistical test to identify the number of consecutive zeros that cannot be considered a natural sequence of zeros.

How they are handled: If that is not the intended behavior, make corrections in the dataset and recreate the project.

Infrequent negative values

Data with excess zeros in the target can be modeled with a special two-stage model for zero-inflated cases. This model is only available when the min value of the target is zero (that is, a single negative value will invalidate its use). In sales data, for example, this can happen when returns are recorded along with sales. This data quality check identifies a negative value when two-stage models are appropriate and provides a warning to correct the target if the desire is to enable zero-inflated modeling and other additional blueprints.

How they are detected: If DataRobot detects that fewer than 2% of values are negative, it treats the project as zero-inflated.

How they are handled: DataRobot surfaces a warning message.

New series in validation

Depending on the project settings (training and validation partition sizes), a multiseries project might be configured so that a new series is introduced at the end of the dataset and therefore isn't part of the training data. For example, this could happen when a new store opens. This check returns an information message indicating that the new series is not within the training data.

How they are detected: If DataRobot detects that more than 20% of series are new (meaning that they are not in the training data).

How they are handled: DataRobot surfaces an informational message.

Missing images

When an image dataset is used to build a Visual AI project, the CSV contains paths to images contained in the provided ZIP archive. These paths can be missing, refer to an image that does not exist, or refer to an invalid image. A missing path is not necessarily an issue as a row could contain a variable number of images or simply not have an image for that row and column.

How they are detected: DataRobot checks each image path provided to ensure it refers to an image that exists and is valid.

How they are handled: For paths that fail to resolve, DataRobot attempts to find the intended image and replace the problematic path. In the event that an auto-correction is not possible, the problematic path is removed. If the image was invalid, the path is removed.

All missing images, paths that fail to resolve (even when automatically fixed), and invalid images are logged and available for viewing.

Data quality check logic summary

The following table summarizes the logic behind each data quality check:

Check / Run Detection logic Handling Reported in...
Outliers / EDA2 Ueda's algorithm Linear: Flag added to feature in blueprint
Tree: Handled automatically
Data > Histogram
Multicategorical format error / EDA1 Meets any of the following 3 conditions:
  • Value is not valid JSON
  • Value does not represent a list/li>
  • List entry contains an empty string
Feature is not identified as multicategorical Data Quality Assessment log
Disguised Missing Values / EDA1 Meets the following 3 conditions:
  • Value is an outlier
  • Frequency is an outlier
  • Value matches one of these patterns:
    • is 2 or more digits and all digits are the same
    • begins with “1”, followed by multiple zeros
    • -1, 98, or 97
Median imputed; flag feature in blueprint Data > Frequent Values
Inliers / EDA1 Value is not an outlier; frequency is an outlier Flag added feature in blueprint Data > Frequent Values
Excess zeros / EDA1 Frequency is an outlier; value is 0 Flag added feature in blueprint Data > Frequent Values
Inconsistent gaps / EDA2 Irregular time steps Model runs in a row-based mode Message in the time-aware modeling configuration
Target leakage / EDA2 Importance score for each feature, calculated using Gini Norm metric. Threshold levels for reporting are moderate-risk (0.85) or high-risk (0.975). High-risk leaky features excluded from Autopilot (using "Leakage Removed" feature list) Data page; optionally, filter by issue type
Pre-derived lagged features / EDA2 Features equal to target(t-1), target(t-2) ... target(t-8) Excluded from derivation Data page; optionally, filter by issue type
Leading/ trailing zeros / EDA2 For series starting/ending with 0, compute probability of consecutive 0s; flag series with <5% probability User correction Data page; optionally, filter by issue type
Missing images / EDA1 Empty cell, missing file, broken link Links are fixed automatically Assessment preview log
New series in validation / EDA1 More than 20% of series not seen in training data User correction Informational message
Infrequent negative values / EDA1 Fewer than 2% of values are negative User correction Warning message

Considerations

Consider the following when working with data assessments:

  • For disguised missing values, inlier, and excess zero issues, automated handling is only enabled for linear, Keras, and Vowpal Wabbit blueprints, where they have proven to reduce model error. Detection is applied to all blueprints.

  • There is currently no control imputation (you cannot disable automated handling).

  • A public API is not yet available.
  • Automated feature engineering runs on raw data (instead of removing all excess zeros and disguised missing values before calculating rolling averages).

Updated November 23, 2021
Back to top