Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Document AI insights

DataRobot provides a variety of visualizations to help better understand document features.

Insight Description
Prior to modeling
AI Catalog Profile tab Preview dataset column names and row data.
Data Quality Assessment (DQA) After EDA1, use the DQA to find potential issues with the modeling data.
Document Insights Understand how DataRobot processed document features for modeling.
Clustering Insights Show how text (of type document) is clustered, which can capture a latent features or identify segments of content.
Prediction Explanations* Show extracted text from documents. Note that while you will see the document text for each row selected, and can get a preview of each feature, the highlighting that accompanies Text Explanations is not available.
Word Cloud* Display the most relevant words and short phrases found in the project's document column.
Lift Chart* View bin data for actual and predicted values of the document feature.
Blueprint View the text extraction process represented as part of the model blueprint.

* These insights work similarly to DataRobot's handling of text features, with minor differences.

Document Insights

The Document Insights tab provides document-specific visualizations to help you see and understand the unique nature of a document's text elements. It lets you compare rendered pages of a document with the extracted text of the documents. There are several components to the screen:

Element Description
1 Filters Sets the display to match the classes selected by the filters. Both actual and predicted filter values are applied as an and to the display.
2 Task Identifies the task used in the text extraction process.
3 High-level page preview Scroll through or select the PDF documents that are used in the model. Click an entry to change the middle and right columns to reflect that text.
4 Mid-level page view Shows the content of the selected document, page by page, highlighting the areas that were extracted as text. Use arrows below the page (if present) to cycle through the pages.
5 Detailed page view Shows the individual text rows.

This insight is useful for double-checking which information DataRobot extracted from the document and whether you selected the correct task. For example, if you see that the information from an image is not available, and you need the text from within that image, you can then retry with the OCR task.

To use the insight:

  1. Click a high-level page preview (1) to select a page. The mid-level and detailed pages update to reflect the selected page.

  2. Select an individual line in the mid-level preview (2) and:

    • Use the zoom in/zoom out features to change the view.
    • Use pagination for documents with more than one page.
    • Notice that the line is highlighted in the detailed page view.

  3. Select a line in the detailed page view

Clustering Insights

Document AI also supports Cluster Insights. For each cluster based on document features, DataRobot displays the ngrams for features in the document column. Each ngram is listed according to importance. In the example below, the insight shows:

  • Previews of the images in the cluster. Hover to enlarge the image.

  • Ranked importance of the ngrams found. Hover on a feature for more details of its use within the document.

Advanced Tuning

The Tesseract OCR engine may not recognize documents with very small text (some footnotes, for example). If that happens and the text is necessary to the model accuracy, use Advanced Tuning to manually set model parameters.

When the Tesseract OCR task is present, a Resolution option becomes available through this tuning (as does a language option). The resolution, which sets the number of DPI, is the value used to convert the document page to images before they are processed with the Tesseract library. With a higher number, the OCR results could improve; however, the run times are extended. In other words, if you notice that text is missed, from Document Insights for example, you could increase the value and compare results.

Updated February 15, 2024