Document AI overview¶
Analysts and data scientists often want to use the information contained in PDF documents to build models. However, manually intensive data preparation requirements present a challenging barrier to efficient use of documents as a data source. Often the volume of documents is large enough that reading through each or manually formatting and preparing them into tabular formats is not feasible. Information spread out in a large corpus of documents makes the frequently valuable text information contained within these documents inaccessible.
Document AI provides a way to build models on raw PDF documents without manually intensive data preparation steps. It provides end-to-end support for PDFs with encoded text that is readily machine readable:
-
DocumentTextExtractor (DTE): Extracts embedded text from a PDF document. Example: Save a document written on your computer as PDF, then upload it.
-
Optical Character Recognition (OCR): Extracts scanned text. Example: You print out a document and then scan it and upload it as PDF. Content is seen as pixels (not as “known” text).
Document AI works with many project types, including regression, binary and multiclass classification, multilabel, clustering, and anomaly detection. The process extracts content and categorizes it as type document
for modeling:
Projects can include not only one or more document
features, but any other feature type that DataRobot supports.
Workflow overview¶
Following is the Document AI workflow:
-
Create a PDF-based dataset for use in projects via the AI Catalog or local file upload.
-
Preview documents for potential data quality issues.
-
Build models using the standard DataRobot workflow.
-
Evaluate models on the Leaderboard with document-specific insights.
-
Select a model to use for making predictions via Make Predictions, the DataRobot API, or batch predictions.
Feature considerations¶
- Time series projects are not supported.