Skip to content

OCR for unstructured data

DataRobot’s OCR (optical character recognition) addresses one of the most significant bottlenecks in agentic workflows—the reliable transformation of messy unstructured data such as scanned PDFs, PowerPoints, and documents into AI-consumable formats. By automating the extraction of document hierarchies, tables, and figures without the need for custom parsing code or brittle scripts, using OCR ensures that AI agents can process real-world enterprise information at scale. The preservation of document structure is critical for retrieval-augmented generation (RAG) pipelines, as it prevents the "flattening" of data, which often leads to silent errors, inaccurate grounding, and loss of context during reasoning.

With advanced parsing capabilities tht directly integrate into the REST ecosystem, developers can move from manual document preparation to production-ready agent deployment in a fraction of the time. The integration provides broad format coverage and standardized JSON outputs that flow seamlessly into DataRobot’s orchestration, evaluation, and monitoring tools. This unified approach eliminates the need for fragmented "glue code" and separate parsing tools, allowing organizations to build, operate, and govern sophisticated agents that can reason over complex knowledge bases with higher confidence and reduced maintenance overhead.

Perform OCR

To perform OCR with unstructured data, use Datarobot's Python client:

import datarobot as dr

from datarobot import OCREngineSpecificParameters, OCRJobResource
from datarobot import OCRJobDatasetLanguage

dr_client = dr.Client(
    token='YOUR TOKEN',
    endpoint='ENGPOINT'
)

# upload file
file_path = 'OCR_files_integration/demo_files.zip'
f = dr.Files.upload(file_path, use_archive_contents=True)

resp = dr_client.post(
    url='ocrJobResources/',
    data={
        'dataset_id': input_file_id,
        'language': 'ENGLISH',
        'engine_specific_parameters': {'engine_type': 'ARYN', 'output_format': 'JSON'},  # we can try JSON or MARDOWN here
    }
)

job_resource_id = resp.json()['id']
output_file_id = resp.json()['outputCatalogId']
job_resource = OCRJobResource.get(job_resource_id)
start_resp = job_resource.start_job()

output_file = dr.Files.get(output_file_id)
output_file.list_contained_files(). # view all the files inside

output_file.download('demo_files/file1.arynjson', file_path='output_file1.json')