# Document AI overview

> Document AI overview - Read background information and a simplified workflow overview.

This Markdown file sits beside the HTML page at the same path (with a `.md` suffix). It summarizes the topic and lists links for tools and LLM context.

Companion generated at `2026-04-24T16:03:56.604634+00:00` (UTC).

## Primary page

- [Document AI overview](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-overview.html): Full documentation for this topic (HTML).

## Sections on this page

- [Workflow overview](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-overview.html#workflow-overview): In-page section heading.
- [Feature considerations](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-overview.html#feature-considerations): In-page section heading.

## Related documentation

- [Classic UI documentation](https://docs.datarobot.com/en/docs/classic-ui/index.html): Linked from this page.
- [Modeling](https://docs.datarobot.com/en/docs/classic-ui/modeling/index.html): Linked from this page.
- [Specialized workflows](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/index.html): Linked from this page.
- [Document AI](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/index.html): Linked from this page.
- [PDF-based dataset](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-ingest.html): Linked from this page.
- [document-specific insights](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-insights.html): Linked from this page.
- [making predictions](https://docs.datarobot.com/en/docs/classic-ui/modeling/special-workflows/doc-ai/doc-ai-predictions.html): Linked from this page.

## Documentation content

# Document AI overview

Analysts and data scientists often want to use the information contained in PDF documents to build models. However, manually intensive data preparation requirements present a challenging barrier to efficient use of documents as a data source. Often the volume of documents is large enough that reading through each or manually formatting and preparing them into tabular formats is not feasible. Information spread out in a large corpus of documents makes the frequently valuable text information contained within these documents inaccessible.

Document AI provides a way to build models on raw PDF documents without manually intensive data preparation steps. It provides end-to-end support for PDFs with encoded text that is readily machine readable:

- DocumentTextExtractor (DTE): Extracts embedded text from a PDF document. Example: Save a document written on your computer as PDF, then upload it.
- Optical Character Recognition (OCR): Extracts scanned text. Example: You print out a document and then scan it and upload it as PDF. Content is seen as pixels (not as “known” text).

Document AI works with many project types, including regression, binary and multiclass classification, multilabel, clustering, and anomaly detection. The process extracts content and categorizes it as type `document` for modeling:

Projects can include not only one or more `document` features, but any other feature type that DataRobot supports.

## Workflow overview

Following is the Document AI workflow:

1. Create aPDF-based datasetfor use in projects via the AI Catalog or local file upload.
2. Preview documents for potentialdata qualityissues.
3. Build models using the standard DataRobot workflow.
4. Evaluate models on the Leaderboard withdocument-specific insights.
5. Select a model to use formaking predictionsvia Make Predictions, the DataRobot API, or batch predictions.

## Feature considerations

- Time series projects are not supported.