Skip to content

On-premise users: click in-app to access the full platform documentation for your version of DataRobot.

Vector database data sources

Generative modeling in DataRobot supports two types of vector databases:

  • Local, "in-house" built vector databases, identified in the application by the DataRobot badge and stored in the Data Registry.
  • External, hosted in the model workshop for validation and registration, and identified as External in the Use Case directory listing.

Dataset requirements

When uploading datasets for use in creating a vector database, the supported formats are either .zip or .csv. Two columns are mandatory for the files—document and document_file_path. Additional metadata columns, up to 50, can be added for use in filtering during prompt queries. Note that for purposes of metadata filtering, document_file_path is displayed as source.

For .zip files, DataRobot processes the file to create a .csv version that contains text columns (document) with an associated reference ID (document_file_path) column. All content in the text column is treated as strings. The reference ID column is created automatically when the .zip is uploaded. All files should be either in the root of the archive or in a single folder inside an archive. Using a folder tree hierarchy is not supported.

See the considerations for more information on supported file content.

Internal vector databases

Internal vector databases in DataRobot are optimized to maintain retrieval speed while ensuring an acceptable retrieval accuracy. To add data for an internal vector database:

  1. Prepare the data by:

    • Compressing the files that will make up your knowledge source into a single .zip file. You can either select files and zip or compress a folder holding all the files.
    • Preparing a CSV with mandatory document and document_file_path columns as well as up to 50 additional metadata columns. The document_file_path column lists the individual items from the decompressed .zip file; the document column lists the content of each file. For purposes of metadata filtering, document_file_path is displayed as source.
    • Using a previously exported vector database.
  2. Upload the file. You can do this either from:

    • A Workbench Use Case from either a local file or data connection.

    • The AI Catalog from a local file, HDFS, URL, or JDBC data source. DataRobot converts a .zip file to .csv format. Once registered, you can use the Profile tab to explore the data:

Once the data is available on DataRobot, you can add it as a vector database for use in the playground.

Export a vector database

You can export a vector database, or a specific version of a database, to the Data Registry for re-use in a different Use Case. To export, open the Vector database tab of your Use Case. Click the Actions menu and select Export latest vector database version to Data Registry.

When you export, you are notified that the job is submitted. If you open the Data tab, you can see the dataset registering for use via the Data Registry. It is also saved to the AI Catalog.

Once registered, you can create a new vector databases from this dataset. To do so, from the Add vector database dropdown, select Data > Add data. The Data Registry opens. Click on the newly created dataset.

Notice that each chunk from the vector database is now a dataset row.

You can download the dataset from the AI Catalog, modify it on a chunk level, and then re-upload it, creating a new version or a new vector database.

External vector databases

The external "bring-your-own" (BYO) vector database provides the ability to leverage your custom model deployments as vector databases for LLM blueprints, using your own models and data sources. Using an external vector database cannot be done via the UI; review the notebook that walks through creating an external vector database using DataRobot’s Python client.

Key features of external vector databases:

  • Custom Model Integration: Incorporate your own custom models as vector databases, enabling greater flexibility and customization.

  • Input and output format compatibility: External BYO vector databases must adhere to specified input and output formats to ensure seamless integration with LLM blueprints.

  • Validation and registration: Custom model deployments must be validated to ensure they meet the necessary requirements before being registered as an external vector database.

  • Seamless integration with LLM blueprints: Once registered, external vector databases can be used with LLM blueprints in the same way as local vector databases.

  • Error Handling and updates: The feature provides error handling and update capabilities, allowing you to revalidate or create duplicates of LLM blueprints to address any issues or changes in custom model deployments.

Basic external workflow

The basic workflow, which is covered in depth in this notebook, is as follows:

  1. Create the vector database via the API.
  2. Create a custom model deployment to bring the vector database into DataRobot.
  3. Once the deployment is registered, link to it as part of vector database creation in your notebook.

You can view all vector databases (and associated versions) for a Use Case from the Vector database tab within the Use Case. For external vector databases, you can see only the source type. Because these vector databases aren't managed by DataRobot, other data is not available for reporting..


Updated January 31, 2025