Document AI insights¶
DataRobot provides a variety of visualizations to help better understand
|Prior to modeling
|AI Catalog Profile tab
|Preview dataset column names and row data.
|Data Quality Assessment (DQA)
|After EDA1, use the DQA to find potential issues with the modeling data.
|Understand how DataRobot processed
document features for modeling.
|Show how text (of type
document) is clustered, which can capture a latent features or identify segments of content.
|Show extracted text from documents. Note that while you will see the
document text for each row selected, and can get a preview of each feature, the highlighting that accompanies Text Explanations is not available.
|Display the most relevant words and short phrases found in the project's
|View bin data for actual and predicted values of the
|View the text extraction process represented as part of the model blueprint.
* These insights work similarly to DataRobot's handling of
text features, with minor differences.
The Document Insights tab provides
document-specific visualizations to help you see and understand the unique nature of a document's text elements. It lets you compare rendered pages of a document with the extracted text of the documents. There are several components to the screen:
|Sets the display to match the classes selected by the filters. Both actual and predicted filter values are applied as an
and to the display.
|Identifies the task used in the text extraction process.
|High-level page preview
|Scroll through or select the PDF documents that are used in the model. Click an entry to change the middle and right columns to reflect that text.
|Mid-level page view
|Shows the content of the selected document, page by page, highlighting the areas that were extracted as text. Use arrows below the page (if present) to cycle through the pages.
|Detailed page view
|Shows the individual text rows.
This insight is useful for double-checking which information DataRobot extracted from the document and whether you selected the correct task. For example, if you see that the information from an image is not available, and you need the text from within that image, you can then retry with the OCR task.
To use the insight:
Click a high-level page preview (1) to select a page. The mid-level and detailed pages update to reflect the selected page.
Select an individual line in the mid-level preview (2) and:
- Use the zoom in/zoom out features to change the view.
- Use pagination for documents with more than one page.
- Notice that the line is highlighted in the detailed page view.
Select a line in the detailed page view
Document AI also supports Cluster Insights. For each cluster based on
document features, DataRobot displays the ngrams for features in the document column. Each ngram is listed according to importance. In the example below, the insight shows:
Previews of the images in the cluster. Hover to enlarge the image.
Ranked importance of the ngrams found. Hover on a feature for more details of its use within the document.
The Tesseract OCR engine may not recognize documents with very small text (some footnotes, for example). If that happens and the text is necessary to the model accuracy, use Advanced Tuning to manually set model parameters.
When the Tesseract OCR task is present, a
Resolution option becomes available through this tuning (as does a language option). The resolution, which sets the number of DPI, is the value used to convert the document page to images before they are processed with the Tesseract library. With a higher number, the OCR results could improve; however, the run times are extended. In other words, if you notice that text is missed, from Document Insights for example, you could increase the value and compare results.