# Document AI insights

> Document AI insights - Use the Document AI visualizations to better understand the information
> contained in your documents.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-04-24T16:03:56.604398+00:00` (UTC).

## Primary page

- [Document AI insights](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-insights.html): Full documentation for this topic (HTML).

## Sections on this page

- [Document Insights](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-insights.html#document-insights): In-page section heading.
- [Clustering Insights](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-insights.html#clustering-insights): In-page section heading.
- [Advanced Tuning](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-insights.html#advanced-tuning): In-page section heading.

## Related documentation

- [Classic UI documentation](https://docs.datarobot.com/en/docs/classic-ui/index.html): Linked from this page.
- [Modeling](https://docs.datarobot.com/en/docs/classic-ui/modeling/index.html): Linked from this page.
- [Specialized workflows](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/index.html): Linked from this page.
- [Document AI](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/index.html): Linked from this page.
- [Profile](https://docs.datarobot.com/en/docs/classic-ui/data/ai-catalog/catalog-asset.html#asset-details): Linked from this page.
- [Data Quality Assessment](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-ingest.html#data-quality): Linked from this page.
- [Prediction Explanations](https://docs.datarobot.com/en/docs/classic-ui/modeling/analyze-models/understand/pred-explain/predex-text.html): Linked from this page.
- [Word Cloud](https://docs.datarobot.com/en/docs/classic-ui/modeling/analyze-models/understand/word-cloud-classic.html): Linked from this page.
- [Lift Chart](https://docs.datarobot.com/en/docs/classic-ui/modeling/analyze-models/evaluate/lift-chart-classic.html#lift-chart-drill-down): Linked from this page.
- [blueprint](https://docs.datarobot.com/en/docs/api/reference/public-api/blueprints.html): Linked from this page.
- [Cluster Insights](https://docs.datarobot.com/en/docs/classic-ui/modeling/analyze-models/understand/cluster-insights-classic.html): Linked from this page.
- [Advanced Tuning](https://docs.datarobot.com/en/docs/classic-ui/modeling/analyze-models/evaluate/adv-tuning.html): Linked from this page.

## Documentation content

# Document AI insights

DataRobot provides a variety of visualizations to help better understand `document` features.

| Insight | Description |
| --- | --- |
| Prior to modeling |  |
| AI Catalog Profile tab | Preview dataset column names and row data. |
| Data Quality Assessment (DQA) | After EDA1, use the DQA to find potential issues with the modeling data. |
| Post-modeling |  |
| Document Insights | Understand how DataRobot processed document features for modeling. |
| Clustering Insights | Show how text (of type document) is clustered, which can capture a latent features or identify segments of content. |
| Prediction Explanations* | Show extracted text from documents. Note that while you will see the document text for each row selected, and can get a preview of each feature, the highlighting that accompanies Text Explanations is not available. |
| Word Cloud* | Display the most relevant words and short phrases found in the project's document column. |
| Lift Chart* | View bin data for actual and predicted values of the document feature. |
| Blueprint | View the text extraction process represented as part of the model blueprint. |

* These insights work similarly to DataRobot's handling of `text` features, with minor differences.

## Document Insights

The Document Insights tab provides `document` -specific visualizations to help you see and understand the unique nature of a document's text elements. It lets you compare rendered pages of a document with the extracted text of the documents. There are several components to the screen:

|  | Element | Description |
| --- | --- | --- |
| (1) | Filters | Sets the display to match the classes selected by the filters. Both actual and predicted filter values are applied as an and to the display. |
| (2) | Task | Identifies the task used in the text extraction process. |
| (3) | High-level page preview | Scroll through or select the PDF documents that are used in the model. Click an entry to change the middle and right columns to reflect that text. |
| (4) | Mid-level page view | Shows the content of the selected document, page by page, highlighting the areas that were extracted as text. Use arrows below the page (if present) to cycle through the pages. |
| (5) | Detailed page view | Shows the individual text rows. |

This insight is useful for double-checking which information DataRobot extracted from the document and whether you selected the correct task. For example, if you see that the information from an image is not available, and you need the text from within that image, you can then retry with the OCR task.

To use the insight:

1. Click a high-level page preview (1) to select a page. The mid-level and detailed pages update to reflect the selected page.
2. Select an individual line in the mid-level preview (2) and:
3. Select a line in the detailed page view

## Clustering Insights

Document AI also supports [Cluster Insights](https://docs.datarobot.com/en/docs/classic-ui/modeling/analyze-models/understand/cluster-insights-classic.html). For each cluster based on `document` features, DataRobot displays the ngrams for features in the document column. Each ngram is listed according to importance. In the example below, the insight shows:

- Previews of the images in the cluster. Hover to enlarge the image.
- Ranked importance of the ngrams found. Hover on a feature for more details of its use within the document.

## Advanced Tuning

The Tesseract OCR engine may not recognize documents with very small text (some footnotes, for example). If that happens and the text is necessary to the model accuracy, use [Advanced Tuning](https://docs.datarobot.com/en/docs/classic-ui/modeling/analyze-models/evaluate/adv-tuning.html) to manually set model parameters.

When the Tesseract OCR task is present, a `Resolution` option becomes available through this tuning (as does a language option). The resolution, which sets the number of DPI, is the value used to convert the document page to images before they are processed with the Tesseract library. With a higher number, the OCR results could improve; however, the run times are extended. In other words, if you notice that text is missed, from Document Insights for example, you could increase the value and compare results.
