Skip to content

Create a vector database

GPU usage for Self-Managed users

When working with datasets over 1GB, Self-Managed users who do not have GPU usage configured on their cluster may experience serious delays. Email DataRobot Support, or visit the Support site, for installation guidance.

The basic steps for creating a vector database for use in a playground are to choose a provider, set the basic configuration, and set text chunking.

Choose a provider for the output of the vector database creation process, either DataRobot-resident or a connected source.

Set the basic configuration, including data source from the Data Registry or File Registry and embedding model.

Use the Vector databases tile in the Use Case directory to manage built vector databases and deployed embedding models. You will ultimately store the newly created vector database (the output of the vector database creation process) within the Use Case. They are stored either as internal vector databases (FAISS) or on connected provider instances.

Add a vector database

First, add a vector database from one of multiple points within the application. Each method opens the Create vector database modal and uses the same workflow from that point.

From within a Use Case, click the Add dropdown and expand Vector database:

  • Choose Create vector database to open the vector database creation modal.
  • Choose Add deployed vector database to add a deployment containing a vector database that you previously registered and deployed.

If there are not yet any assets associated with the Use Case, you can add a vector database from the tile landing page.

From the Data assets tile, open the Actions menu associated with a Use Case and select Create vector database.

The Actions menu is only available if the data is detected as eligible, which means:

  1. Processing of the dataset has finished.
  2. The data source has the mandatory document and document_file_path columns.
  3. There are no more than 50 metadata columns.

From the Vector databases tile, click the Add dropdown:

  • Choose Create vector database to open the vector database creation modal.
  • Choose Add deployed vector database to add a deployment that contains a vector database that you previously registered and deployed.
  • Choose Add external vector database to begin assembling a custom vector database in the Registry workshop, which can then be linked to the Use Case.

If there are not yet any vector databases associated with the Use Case, the tile landing page will lead you to create one.

When in a playground, use the Vector database tab in the configuration section of the LLM blueprint:

Once you've selected to add a vector database, start the creation process.

Select a provider

Select a vector database provider, either internal or connected (external) with credentials. This setting determines where the output of the vector database creation process lands. The input to the creation process is always either the Data or File Registry.

Provider/type Description Max size
DataRobot/resident An internal FAISS-based vector database, hosted locally in the Data Registry. These vector databases can be versioned and do not require credentials. 10GB
BYO A deployment containing a vector database that you previously registered and deployed. Use a notebook to bring-your-own vector database via a custom model deployment. No constraints, but must fit within the resource bundle memory of the vector database deployment.
Connected An external vector database that allows you to use your own Pinecone, Elasticsearch, or Milvus instance, with credentials. This option allows you to choose where content is stored but still experiment with RAG pipelines built in DataRobot and leverage DataRobot's out-of-the-box embedding and chunking functionality. 100GB

Use a resident vector database

Using a resident vector database means using data that is accessible within the application, either via the Data Registry or a custom model. Internal vector databases in DataRobot are optimized to maintain retrieval speed while ensuring an acceptable retrieval accuracy. See the following for dataset requirements and specific retrieval methods:

Use an external vector database

To use an external (BYO) vector database, use the Add > Add deployed vector database option that is available from the Vector database tile. Before adding a vector database this way, develop the vector database externally with the DataRobot Python client, assemble a custom model for it, and then deploy that custom model. See an example using ChromaDB. Note that this vector database type is identified as Read-only connected in the Use Case directory listing.

Use a connected vector database

DataRobot allows direct connection to either Pinecone, Elasticsearch, or Milvus external data sources for vector database creation. In this case, the data source is stored locally in the Data Registry, configuration settings are applied, and the created vector database is written back to the provider. Select your provider in the Create vector database modal.

To use a provider connection, select the provider and enter authentication information. If you choose to use saved credentials, for any of the connected providers, simply select the appropriately named credentials from the dropdown. Available credentials are those that are created and stored in the credential management system:

If you choose to use new credentials, they must first be created in the provider instance. Once entered, they are stored in DataRobot's credential management system for reuse.

Connect to Pinecone

All connection requests to Pinecone must include an API key for connection authentication. If you do not have a Pinecone API key saved to the credential management system, click New credentials. Complete the field for the API key and, optionally, change the display name. In the API token (API key) field, paste the key you created in the Pinecone console. Once added, DataRobot saves the Pinecone API key in the credential management system for reuse when working with Pinecone vector databases.

Once the token is input, select a cloud provider for your Pinecone instance—AWS, Azure, or GCP— and assigned cloud region.

After selection, the vector database configuration options become available.

Connect to Elasticsearch

All connection requests to the Elasticsearch cluster must include a username and password (basic credentials) or an API key for connection authentication.

If you do not have, or wish to add, an Elasticsearch API key saved to the credential management system, click New credentials. There are two types of credentials available for selection in the modal that appears.

  • Basic: Basic credentials consist of the username and password you use to access the Elasticsearch instance. Enter them here and they will be saved to DataRobot.

  • API key: In the API token (API key) field, paste the key you created, as described in the Elasticsearch documentation.

Once the credential type is selected, optionally change the display name and select a connection method, either Cloud ID (recommended by Elastic) or URL. See the Elasticsearch documentation for information on finding your cloud ID.

After selection, the vector database configuration options become available.

Connect to Milvus

Milvus, a leading open-source vector database project, is distributed under the Apache 2.0 license. All connection requests to Milvus must include a username and password (basic credentials) or an API token for connection authentication.

If you do not have, or wish to add, a Milvus API key saved to the credential management system, click New credentials. There are two types of credentials available for selection in the modal that appears.

  • Basic: Basic credentials consist of the username and password you use to access the Milvus instance. Enter them here and they will be saved to DataRobot.

  • API token: In the API token (API key) field, paste the key you created, as described in the Milvus documentation.

Once the credential type is selected, optionally change the display name and enter a URI for the Milvus server address.

Note

When you complete the configuration and create the vector database, it will be available on the Milvus site for your cluster, under the Collections tab. Open the collection and then the Data tab to see the vectors, text, and various metadata.

Set basic configuration

The following table describes the configuration settings used in vector database creation:

Field Description
Name The name the vector database is saved with. This name displays in the Use Case Vector databases tile and is selectable when configuring playgrounds.
Data source The dataset used as the knowledge source for the vector database. The list populates based on the entries in the Use Case's Vector databases tile, if any. If you started the vector database creation from the action menu on the Data assets tile, the field is prepopulated with that dataset. If there are no associated vector databases or none present are applicable, use the Add data option.
Attach metadata The name of the file, in the file registry, that is used for appending columns to the vector database to support filtering the citations returned by the prompt query.
Distance metric (connected providers only) The vector similarity metric to use in nearest neighbor search, ranking a vector's similarity against the query.
Embedding model The model that defines the type of embedding used for encoding data.

Add a data source

If no data sources are available, or if you want to add new sources, choose Add data in the Data source dropdown. The Add data modal opens. Vector database creation supports ZIP and CSV dataset formats and specific supported file types within the datasets. You can access a supported dataset from either the File Registry or the Data Registry.

Registry type Description
File A "general purpose" storage system that can store any type of data. In contrast to the Data Registry, the File Registry does not do CSV conversion on files uploaded to it. In the UI, vector database creation is the only place where the File Registry is applicable, and it is only accessible via the Add data modal. While any file type can be stored there, regardless of registry type the same file types are supported for vector database creation.
Data In the Data Registry, a ZIP file is converted into a CSV with the content of each member file stored as row of the CSV. The file path for each file becomes the document_file_path column and the file content (text or base64-encoding of the file) becomes the document column.

Choose a dataset. Datasets from the Data Registry show a preview of the chunk (document) and the file it was sourced from (document_file_path).

Attach metadata

Optionally, you can select an additional file to define the metadata to attach to the chunks in the vector database. That file must reside in the Datasets container of the Data Registry. (It cannot reside in the Files container because the EDA process, which validates the that the required columns are present, only runs on Datasets assets.)

The file must contain, at minimum, the document_file_path column. A document column is optional. You can append up to 50 additional columns, which, can be used for filtering during prompt queries.

You can upload a single CSV with document and metadata information in a single file. Alternatively, you can upload data and metadata as separate files and achieve the same final result.

Note

Vector databases created before the introduction of metadata filtering do not support this feature. To use filtering with them, create a version from the original and configure the LLM blueprint to use the new vector database instead.

Either select an available file, or use the Add data modal to add a metadata file to the Data Registry. In this example, the file has one column, document_file_path, which defines the file or chunk, as well as a variety of other columns that define the metadata.

Once you select a metadata file, you are prompted to choose whether, if you also have metadata in the dataset, DataRobot should keep both sets of metadata or overwrite with the new file. Whether to replace or merge the metadata only applies to if there are duplicate columns between the dataset and metadata dataset. Non-duplicate metadata columns from both are always maintained.

Duplicate metadata example

A dataset has col1 and col2. The metadata file has col2 and col3. The metadata going into the vector database is then col1, col2, and col3. Since col2 is a duplicate, DataRobot either does the following, based on the settings:

  • Replace: DataRobot uses the values from the metadata file.
  • Keep both: DataRobot merges the two columns, with precedence going to the metadata file.

Set the distance metric

When using a connected provider, you can also set a distance vector, also known as a similarity metric (this setting is not available when DataRobot is selected as provider). These metrics measure how similar vectors are; selecting the appropriate metric can substantially boost the effectiveness of classification and clustering tasks. Consider choosing the same similarity metric that was used to train the embedding model.

Provider Provider documentation
Pinecone Similarity metrics
Elasticnet Similarity parameter
Milvus Metric types: float

Set the embedding model

To encode your data, select the embedding model that best suits your Use Case. Use one of the DataRobot-provided embeddings or a BYO embedding. DataRobot supports the following types of embeddings; see the full embedding descriptions here.

Embedding type Description
cl-nagoya/sup-simcse-ja-base A medium-sized language model for Japanese RAG.
huggingface.co/intfloat/multilingual-e5-base A medium-sized language model used for multilingual RAG performance across multiple languages.
huggingface.co/intfloat/multilingual-e5-small A smaller-sized language model used for multilingual RAG performance with faster performance than the multilingual-e5-base.
intfloat/e5-base-v2 A medium-sized language model used for medium-to-high RAG performance. With fewer parameters and a smaller architecture, it is faster than e5_large_v2.
intfloat/e5-large-v2 A large language model designed for optimal RAG performance. It is classified as slow due to its architecture and size.
jinaai/jina-embedding-t-en-v1 A tiny language model pre-trained on the English corpus and is the fastest, and default, embedding model offered by DataRobot.
jinaai/jina-embedding-s-en-v2 Part of the Jina Embeddings v2 family, this embedding model is the optimal choice for long-document embeddings (large chunk sizes, up to 8192).
sentence-transformers/all-MiniLM-L6-v2 A small language model fine-tuned on a 1B sentence-pairs dataset that is relatively fast and pre-trained on the English corpus. It is not recommended for RAG, however, as it was trained on old data.
Add deployed embedding model Select a deployed embedding model to use during vector database creation. Column names can be found on the deployment's overview page in the Console.

The embedding models that DataRobot provides are based on the SentenceBERT framework, providing an easy way to compute dense vector representations for sentences and paragraphs. The models are based on transformer networks (BERT, RoBERTA, T5) trained on a mixture of supervised and unsupervised data, and achieve state-of-the-art performance in various tasks. Text is embedded in a vector space such that similar text is grouped more closely and can efficiently be found using cosine similarity.

Add a deployed embedding model

While adding a vector database, you can select an embedding model deployed as an unstructured custom model. On the Create vector database panel, click the Embedding model dropdown, then click Add deployed embedding model.

On the Add deployed embedding model panel, configure the following settings:

Setting Description
Name Enter a descriptive name for the embedding model.
Deployment name Select the unstructured custom model deployment.
Prompt column name Enter the name of the column containing the user prompt, defined when you created the custom embedding model in the workshop (for example, promptText).
Response (target) column name Enter the name of the column containing the LLM response, defined when you created the custom embedding model in the workshop (for example, responseText or resultText).

After you configure the deployed embedding model settings, click Validate and add. The deployed embedding model is added to the Embedding model dropdown list:

Chunking settings

Text chunking is the process of splitting a text document into smaller text chunks that are then used to generate embeddings. You can either:

  • Choose Text chunking and further configure how chunks are derived—method, separators, and other parameters.
  • Select No chunking. DataRobot will then treat each row as a chunk and directly generate an embedding on each row.

Chunking method

The chunking method sets how text from the data source is divided into smaller, more manageable pieces. It is used to improve the efficiency of nearest-neighbor searches so that when queried, the database first identifies the relevant chunks that are likely to contain the nearest neighbors, and then searches within those chunks rather than searching the entire dataset.

Method Description
Recursive Splits text until chunks are smaller than a specified max size, discards oversized chunks, and if necessary, splits text by individual characters to maintain the chunk size limit.
Semantic Splits larger text into smaller, meaningful units based on the semantic content instead of length (chunk size). It is a fully automatic method, meaning that when it is selected, no further chunking configuration is available—it creates chunks where sentences are semantically "closed.". See the deep dive below for more information.
Deep dive: Chunking methods

Recursive text chunking works by recursively splitting text documents according to an ordered list of text separators until a text chunk has a length that is less than the specified maximum chunk size. If generated chunks have a length/size that is already less than the max chunk size, the subsequent separators are ignored. Otherwise, DataRobot applies, sequentially, the list of separators until chunks have a length/size that is less than the max chunk size. In the end, if a generated chunk is larger than the specified length, it is discarded. In that case, DataRobot will use a "separate each character" strategy to split on each character and then merge consecutive split character chunks up to the point of the max chunk size limit. If no "split on character" is listed as a separator, long chunks are cut off. That is, some parts of the text will be missing for the generation of embeddings but the entire chunk will still be available for document retrieval.

Semantic chunking is the process of breaking down a larger piece of text into smaller, meaningful units (or "chunks") based on the semantic content or meaning of the text, rather than just arbitrary character or word limits. Instead of splitting the text based solely on length, semantic chunking attempts to keep coherent ideas or topics intact within each chunk. This method is useful for tasks like natural language processing (NLP), where understanding the meaning and context of the text is important for tasks like information retrieval, summarization, or generating embeddings for machine learning models. For example, in a semantic chunking process, paragraphs might be kept together if they discuss the same topic, even if they exceed a specific size limit, ensuring that the chunks represent complete thoughts or concepts.

That said, the DataRobot implementation of semantic chunking for out-of-the-box embedding models automatically detects the maximum supported chunk size of the selected embedding model and uses it as a safety cutoff. This forces a chunk to be cut if it would exceed the embedding model's maximum input length to ensure that all text is actually embedded. BYO embeddings, which use the default version of the algorithm, do not support the safety cutoff. Chunks can be of any length, even exceeding the embedding model's maximum input length, which can result text being cut off (and therefore not embedded). The non-embedded text is still included in the citation returned from the vector database.

Work with separators

Separators are "rules" or search patterns (not regular expressions although they can be supported) for breaking up text by applying each separator, in order, to divide text into smaller components—they define the tokens by which the documents are split into chunks. Chunks will be large enough to group by topic, with size constraints determined by the model’s configuration. Recursive text chunking is the method applied to the chunking rules.

Each vector database starts with four default rules, which define what to split text on:

  • Double new lines
  • New lines
  • Spaces

While these rules use a word to identify them for easy understanding, on the backend they are interpreted as individual strings (i.e., \n\n, \n, " ", "").

There may be cases where none of the separators are present in the document, or there is not enough content to split into the desired chunk size. If this happens, DataRobot applies a "next-best character" fallback rule, moving characters into the next chunk until the chunk fits the defined chunk size. Otherwise, the embedding model would just truncate the chunk if it exceeds the inherent context size.

Add custom rule

You can add up to five custom separators to apply as part of your chunking strategy. This provides a total of nine separators (when considered together with the four defaults). The following applies to custom separators:

  • Each separator can have a maximum of 20 characters.
  • There is no "translation logic" that allows use of words as a separator. For example, if you want to chunk on punctuation, you would need to add a separator for each type.

  • The order of separators matters. To reorder separators, simply click the cell and drag it to the desired location.

  • To delete separators, whether in fine-tuning your chunking strategy or to free space for additional separators, click the trashcan icon. You cannot delete the default separators.

Use regular expressions

Select Interpret separators as regular expressions to allow regular expressions in separators. It is important to understand that with this feature activated, all separators are treated as regex. This means, for example, that adding "." matches and splits on every character. If you instead want to split on "dots," you must escape the expression (i.e., "\."). This rule applies to all separators, both custom and predefined (which are configured to act this way).

Chunking parameters

Chunking parameters further define the output of the vector database. The default values for chunking parameters are dependent on the embedding model.

Chunk overlap

Overlapping refers to the practice of allowing adjacent chunks to share some amount of data. The Chunk overlap parameter specifies the percentage of overlapping tokens between consecutive chunks. Overlap is useful for maintaining context continuity between chunks when processing the text with language models, at the cost of producing more chunks and increasing the size of the vector database.

Retrieval limits

The value you set for Top K (nearest neighbors) instructs the LLM on how many relevant chunks to retrieve from the vector database. Chunk selection is based on similarity scores. Consider:

  • Larger values provide more comprehensive coverage but also require more processing overhead and may include less relevant results.
  • Smaller values provide more focused results and faster processing, but may miss relevant information.

Max tokens specifies:

  • The maximum size (in tokens) of each text chunk extracted from the dataset when building the vector database.
  • The length of the text that is used to create embeddings.
  • The size of the citations used in RAG operations.

Save the vector database

Once the configuration is complete, click Create vector database to make the database available in the playground.

Manage vector databases

The Vector databases tile lists all the vector databases and deployed embedding models associated with a Use Case. Vector database entries include information on the versions derived from the parent; see the section on versioning for detailed information on vector database versioning.

You can view all vector databases (and associated versions) for a Use Case from the Vector database tab within the Use Case. For external vector databases, you can see only the source type. Because these vector databases aren't managed by DataRobot, other data is not available for reporting.

Click on any entry in the Vector databases tile listing to open a new modal where you can view an expanded view of that database's configuration and related items.

You can:

Description
1 Select a different vector database to explore from the dropdown in the breadcrumbs.
2 Select a different version of the vector database to explore from the dropdown. When you click a version, the details and reported assets (Related items) update to those associated with the specific version. Learn more about versioning.
3 Execute a variety of vector database actions.
4 View the versioning history. When you click a version, the details and reported assets (Related items) update to those associated with the specific version. Learn more about versioning.
5 View items associated with the vector database, such as the related Use Case and LLM blueprints and deployed customer and registered models that use the vector database. Click on an entity to open it in the corresponding Console tab.
6 Create a playground that uses the selected database.

Vector database actions

The actions dropdown allows you to apply an action to the vector database you are viewing. The actions available are slightly different depending on the vector database type and where you are accessing the menu from.

  • Use the Actions menu from the vector database listing accessed from Vector databases tile.

  • Open a vector database from the Vector databases tile—a modal displays an expanded view of that database's configuration. A menu button, Vector database actions, provides access to a set of actions.

From the Actions menu , you can:

Action Description
Export latest vector database version to Data Registry Exports the most recent versions of the selected vector database to the Data Registry for re-use in a different Use Case.
Create playground from latest version Opens a new playground with the vector database loaded into the LLM configuration.
Create new vector database version Creates a new version of the vector database that is based on the version that is currently selected.
Edit vector database info Provides an input box for changing the vector database name.
Send to the workshop Sends the vector database to the workshop for modification and deployment. For more information, see register and deploy vector databases.
Deploy this version Deploys the latest version of the vector database to the selected prediction environment. For more information, see register and deploy vector databases.
Delete vector database and all versions Deletes the parent vector database and all versions. Because the vector databases used by deployments are snapshots, deleting a vector database in a Use Case does not affect the deployments using that vector database. The deployment uses an independent snapshot of the vector database.

Action Description
Add data Appends a selected data source to the current vector database source. Data is added from within DataRobot and then written back to the provider.
Create playground Opens a new playground with the vector database loaded into the LLM configuration.
Edit vector database info Provides an input box for changing the vector database name.
Send to the workshop Sends the vector database to the workshop for modification and deployment. For more information, see register and deploy vector databases.
Deploy vector database Deploys the updated vector database to the selected prediction environment. For more information, see register and deploy vector databases.
Delete vector database Deletes the vector database instance. Because the vector databases used by deployments are snapshots, deleting a vector database in a Use Case does not affect the deployments using that vector database. The deployment uses an independent snapshot of the vector database.

From the Vector database actions dropdown menu, you can:

Action Description
Create playground using this version Opens a new playground with the vector database loaded into the LLM configuration.
Create new version from this version Creates a new version of the vector database that is based on the version that is currently selected.
Export this version to Data Registry Exports the current vector database version to Data Registry for re-use in a different Use Case.
Send to the workshop Sends the vector database to the workshop for modification and deployment. For more information, see register and deploy vector databases.
Deploy this version Deploys the latest version of the vector database to the selected prediction environment. For more information, see register and deploy vector databases.
Delete vector database Deletes the parent vector database and all versions. Because the vector databases used by deployments are snapshots, deleting a vector database in a Use Case does not affect the deployments using that vector database. The deployment uses an independent snapshot of the vector database.

Action Description
Create playground using this version Opens a new playground with the vector database loaded into the LLM configuration.
Add data Appends a selected data source to the current vector database source. Data is added from within DataRobot and then written back to the provider.
Export this version to Data Registry Exports the latest vector database version to Data Registry. It can then be used in different Use Case playgrounds.
Send to the workshop Sends the vector database to the workshop for modification and deployment. For more information, see register and deploy vector databases.
Deploy vector database Deploys the vector database to the selected prediction environment. For more information, see register and deploy vector databases.
Edit authentication Provides a modal where you can change the display name for saved credentials or add new authentication credentials, which are then stored in the credential management system. See the provider-specific credential information above.
Delete vector database Deletes the vector database. Deleting a vector database in a Use Case does not affect the deployments using that vector database because the deployment uses an independent snapshot of the vector database.

For additional information, see also:

Vector database details

The details section of the vector database expanded view reports information for the selected version, whether you selected the version from the dropdown or the right-hand panel.

  • Basic vector database metadata: ID, creator and creation date, data source name and size.
  • Chunking configuration settings: Embedding column and chunking method and settings.
  • Metadata columns: Names of columns from the data source, which can later be used for metadata filtering..

Use this area to quickly compare versions to see how configuration changes impact chunking results. For example, notice how the size and number of chunks changes between the parent version that uses the DataRobot English language documentation:

And with the addition of the Japanese language documentation: