Modeling > Specialized workflows > Document AI > Document AI insights

Document AI insights¶

DataRobot provides a variety of visualizations to help better understand document features.

Insight	Description
Prior to modeling
AI Catalog Profile tab	Preview dataset column names and row data.
Data Quality Assessment (DQA)	After EDA1, use the DQA to find potential issues with the modeling data.
Post-modeling
Document Insights	Understand how DataRobot processed `document` features for modeling.
Clustering Insights	Show how text (of type `document`) is clustered, which can capture a latent features or identify segments of content.
Prediction Explanations*	Show extracted text from documents. Note that while you will see the `document` text for each row selected, and can get a preview of each feature, the highlighting that accompanies Text Explanations is not available.
Word Cloud*	Display the most relevant words and short phrases found in the project's `document` column.
Lift Chart*	View bin data for actual and predicted values of the `document` feature.
Blueprint	View the text extraction process represented as part of the model blueprint.

* These insights work similarly to DataRobot's handling of text features, with minor differences.

Document Insights¶

The Document Insights tab provides document-specific visualizations to help you see and understand the unique nature of a document's text elements. It lets you compare rendered pages of a document with the extracted text of the documents. There are several components to the screen:

	Element	Description
1	Filters	Sets the display to match the classes selected by the filters. Both actual and predicted filter values are applied as an `and` to the display.
2	Task	Identifies the task used in the text extraction process.
3	High-level page preview	Scroll through or select the PDF documents that are used in the model. Click an entry to change the middle and right columns to reflect that text.
4	Mid-level page view	Shows the content of the selected document, page by page, highlighting the areas that were extracted as text. Use arrows below the page (if present) to cycle through the pages.
5	Detailed page view	Shows the individual text rows.

This insight is useful for double-checking which information DataRobot extracted from the document and whether you selected the correct task. For example, if you see that the information from an image is not available, and you need the text from within that image, you can then retry with the OCR task.

To use the insight:

Click a high-level page preview (1) to select a page. The mid-level and detailed pages update to reflect the selected page.
Select an individual line in the mid-level preview (2) and:
- Use the zoom in/zoom out features to change the view.
- Use pagination for documents with more than one page.
- Notice that the line is highlighted in the detailed page view.
Select a line in the detailed page view

Clustering Insights¶

Document AI also supports Cluster Insights. For each cluster based on document features, DataRobot displays the ngrams for features in the document column. Each ngram is listed according to importance. In the example below, the insight shows:

Previews of the images in the cluster. Hover to enlarge the image.
Ranked importance of the ngrams found. Hover on a feature for more details of its use within the document.

Advanced Tuning¶

The Tesseract OCR engine may not recognize documents with very small text (some footnotes, for example). If that happens and the text is necessary to the model accuracy, use Advanced Tuning to manually set model parameters.

When the Tesseract OCR task is present, a Resolution option becomes available through this tuning (as does a language option). The resolution, which sets the number of DPI, is the value used to convert the document page to images before they are processed with the Tesseract library. With a higher number, the OCR results could improve; however, the run times are extended. In other words, if you notice that text is missed, from Document Insights for example, you could increase the value and compare results.

Document AI insights¶

Document Insights¶

Clustering Insights¶

Advanced Tuning¶

Was this page helpful?

Great! Let us know what you found helpful.

What can we do to improve the content?