Generative AI documentation > Vector databases > Create a vector database

Create a vector database¶

GPU usage for Self-Managed users

When working with datasets over 1GB, Self-Managed users who do not have GPU usage configured on their cluster may experience serious delays. Email DataRobot Support, or visit the Support site, for installation guidance.

To create a vector database for use in a playground:

Add an appropriate data source to the Data Registry.
Set the basic configuration, including a data source and embedding model.
Set chunking.

Use the Vector databases tab in the Use Case directory to manage built vector databases and deployed embedding models.

Build a vector database¶

First, add a vector database from one of the multiple points within the application:

The Use Case assets tileThe Data assets tileThe Vector databases tileThe playground

From within a Use Case, click the Add dropdown and:

Choose Vector database > Create vector database to create a vector database from data in the AI Catalog or Data Registry.
Choose Add deployed vector database to add a deployment that contains the vector database to be used during LLM prompting.

From Data assets tile, open the Actions menu and select Create vector database. The Actions menu is only available if the data is detected as eligible, which means:

Processing of the dataset has finished.
The data source has the mandatory document and document_file_path (or ドキュメント and ドキュメントファイルパス) columns.
There are no more than 50 metadata columns.

From the Vector databases tile, either:

Click the Add vector database button to open the vector database creation modal. Expand the Data source dropdown in the resulting window, select from the data source(s) associated with the Use Case, or click Add data to add data from the Data Registry.
If there are not yet any vector databases associated with the Use Case, alternatively you can click Create vector database.

When in a playground, use the Vector database tab in the configuration section of the LLM blueprint:

Configure the database¶

When creating a vector database, you set a basic configuration and text chunking.

The following table describes the settings used in vector database creation:

Field	Description
Name	The name the vector database is saved with. This name displays in the Use Case Vector databases tile and is selectable when configuring playgrounds.
Data source	The dataset used as the knowledge source for the vector database. The list populates based on the entries under the Use Case's Vector databases tile. If you started the vector database creation from the action menu on the Data assets tile, the field is prepopulated with that dataset. Chunking settings become available after you select a data source.
Embedding model	The model that defines the type of embedding used for encoding data.
Chunking method	The method applied for splitting (or not) text documents into smaller sizes.

Set the embedding model¶

To encode your data, select the embedding model that best suits your Use Case. Use one of the DataRobot-provided embeddings or a BYO embedding. DataRobot supports the following types of embeddings; see the full embedding descriptions in the here.

Embedding type	Description
cl-nagoya/sup-simcse-ja-base	A medium-sized language model for Japanese RAG.
huggingface.co/intfloat/multilingual-e5-base	A medium-sized language model used for multilingual RAG performance across multiple languages.
huggingface.co/intfloat/multilingual-e5-small	A smaller-sized language model used for multilingual RAG performance with faster performance than the multilingual-e5-base.
intfloat/e5-base-v2	A medium-sized language model used for medium-to-high RAG performance. With fewer parameters and a smaller architecture, it is faster than e5_large_v2.
intfloat/e5-large-v2	A large language model designed for optimal RAG performance. It is classified as slow due to its architecture and size.
jinaai/jina-embedding-t-en-v1	A tiny language model pre-trained on the English corpus and is the fastest, and default, embedding model offered by DataRobot.
jinaai/jina-embedding-s-en-v2	Part of the Jina Embeddings v2 family, this embedding model is the optimal choice for long-document embeddings (large chunk sizes, up to 8192).
sentence-transformers/all-MiniLM-L6-v2	A small language model fine-tuned on a 1B sentence-pairs dataset that is relatively fast and pre-trained on the English corpus. It is not recommended for RAG, however, as it was trained on old data.
Add deployed embedding model	Select a deployed embedding model to use during vector database creation. Column names can be found on the deployment's overview page in the Console.

The embedding models that DataRobot provides are based on the SentenceBERT framework, providing an easy way to compute dense vector representations for sentences and paragraphs. The models are based on transformer networks (BERT, RoBERTA, T5) trained on a mixture of supervised and unsupervised data, and achieve state-of-the-art performance in various tasks. Text is embedded in a vector space such that similar text is grouped more closely and can efficiently be found using cosine similarity.

Add a deployed embedding model¶

While adding a vector database, you can select an embedding model deployed as an unstructured custom model. On the Create vector database panel, click the Embedding model dropdown, then click Add deployed embedding model.

On the Add deployed embedding model panel, configure the following settings:

Setting	Description
Name	Enter a descriptive name for the embedding model.
Deployment name	Select the unstructured custom model deployment.
Prompt column name	Enter the name of the column containing the user prompt, defined when you created the custom embedding model in the model workshop (for example, `promptText`).
Response (target) column name	Enter the name of the column containing the LLM response, defined when you created the custom embedding model in the model workshop (for example, `responseText` or `resultText`).

After you configure the deployed embedding model settings, click Validate and add. The deployed embedding model is added to the Embedding model dropdown list:

Chunking settings¶

Text chunking is the process of splitting a text document into smaller text chunks that are then used to generate embeddings. You can either:

Choose Text chunking and further configure how chunks are derived—method, separators, and other parameters.
Select No chunking. DataRobot will then treat each row as a chunk and directly generate an embedding on each row.

ChunkingNo chunking

Chunking method¶

The chunking method sets how text from the data source is divided into smaller, more manageable pieces. It is used to improve the efficiency of nearest-neighbor searches so that when queried, the database first identifies the relevant chunks that are likely to contain the nearest neighbors, and then searches within those chunks rather than searching the entire dataset.

Method	Description
Recursive	Splits text until chunks are smaller than a specified max size, discards oversized chunks, and if necessary, splits text by individual characters to maintain the chunk size limit.
Semantic	Splits larger text into smaller, meaningful units based on the semantic content instead of length (chunk size). It is a fully automatic method, meaning that when it is selected, no further chunking configuration is available—it creates chunks where sentences are semantically "closed.". See the deep dive below for more information.

Deep dive: Chunking methods

Recursive text chunking works by recursively splitting text documents according to an ordered list of text separators until a text chunk has a length that is less than the specified maximum chunk size. If generated chunks have a length/size that is already less than the max chunk size, the subsequent separators are ignored. Otherwise, DataRobot applies, sequentially, the list of separators until chunks have a length/size that is less than the max chunk size. In the end, if a generated chunk is larger than the specified length, it is discarded. In that case, DataRobot will use a "separate each character" strategy to split on each character and then merge consecutive split character chunks up to the point of the the max chunk size limit. If no "split on character" is listed as a separator, long chunks are cut off. That is, some parts of the text will be missing for the generation of embeddings but the entire chunk will still be available for document retrieval.

Semantic chunking is the process of breaking down a larger piece of text into smaller, meaningful units (or "chunks") based on the semantic content or meaning of the text, rather than just arbitrary character or word limits. Instead of splitting the text based solely on length, semantic chunking attempts to keep coherent ideas or topics intact within each chunk. This method is useful for tasks like natural language processing (NLP), where understanding the meaning and context of the text is important for tasks like information retrieval, summarization, or generating embeddings for machine learning models. For example, in a semantic chunking process, paragraphs might be kept together if they discuss the same topic, even if they exceed a specific size limit, ensuring that the chunks represent complete thoughts or concepts.

That said, the DataRobot implementation of semantic chunking for out-of-the-box embedding models automatically detects the maximum supported chunk size of the selected embedding model and uses it as a safety cutoff. This forces a chunk to be cut if it would exceed the embedding model's maximum input length to ensure that all text is actually embedded. BYO embeddings, which use the default version of the algorithm, do not support the safety cutoff. Chunks can be of any length, even exceeding the embedding model's maximum input length, which can result text being cut off (and therefore not embedded). The non-embedded text is still included in the citation returned from the VDB.

Work with separators¶

Separators are "rules" or search patterns (not regular expressions although they can be supported) for breaking up text by applying each separator, in order, to divide text into smaller components—they define the tokens by which the documents are split into chunks. Chunks will be large enough to group by topic, with size constraints determined by the model’s configuration. Recursive text chunking is the method applied to the chunking rules.

Each vector database starts with four default rules, which define what to split text on:

Double new lines
New lines
Spaces

While these rules use a word to identify them for easy understanding, on the backend they are interpreted as individual strings (i.e., \n\n, \n, " ", "").

There may be cases where none of the separators are present in the document, or there is not enough content to split into the desired chunk size. If this happens, DataRobot applies a "next-best character" fallback rule, moving characters into the next chunk until the chunk fits the defined chunk size. Otherwise, the embedding model would just truncate the chunk if it exceeds the inherent context size.

Add custom rule¶

You can add up to five custom separators to apply as part of your chunking strategy. This provides a total of nine separators (when considered together with the four defaults). The following applies to custom separators:

Each separator can have a maximum of 20 characters.
There is no "translation logic" that allows use of words as a separator. For example, if you want to chunk on punctuation, you would need to add a separator for each type.
The order of separators matters. To reorder separators, simply click the cell and drag it to the desired location.
To delete separators, whether in fine-tuning your chunking strategy or to free space for additional separators, click the trashcan icon. You cannot delete the default separators.

Use regular expressions¶

Select Interpret separators as regular expressions to allow regular expressions in separators. It is important to understand that with this feature activated, all separators are treated as regex. This means, for example, that adding "." matches and splits on every character. If you instead want to split on "dots," you must escape the expression (i.e., "\."). This rule applies to all separators, both custom and predefined (which are configured to act this way).

Chunking parameters¶

Chunking parameters further define the output of the vector database. The default values for chunking parameters are dependent on the embedding model.

Chunk overlap¶

Overlapping refers to the practice of allowing adjacent chunks to share some amount of data. The Chunk overlap parameter specifies the percentage of overlapping tokens between consecutive chunks. Overlap is useful for maintaining context continuity between chunks when processing the text with language models, at the cost of producing more chunks and increasing the size of the vector database.

Retrieval limits¶

The value you set for Top K (nearest neighbors) instructs the LLM on how many relevant chunks to retrieve from the vector database. Chunk selection is based on similarity scores. Consider:

Larger values provide more comprehensive coverage but also require more processing overhead and may include less relevant results.
Smaller values provide more focused results and faster processing, but may miss relevant information.

Max tokens specifies:

The maximum size (in tokens) of each text chunk extracted from the dataset when building the vector database.
The length of the text that is used to create embeddings.
The size of the citations used in RAG operations.

Save the vector database¶

Once the configuration is complete, click Create vector database to make the database available in the playground.

Manage vector databases¶

The Vector databases tile lists all the vector databases and deployed embedding models associated with a Use Case. Vector database entries include information on the versions derived from the parent; see the section on versioning for detailed information on vector database versioning.

You can view all vector databases (and associated versions) for a Use Case from the Vector database tab within the Use Case. For external vector databases, you can see only the source type. Because these vector databases aren't managed by DataRobot, other data is not available for reporting.

Click on any entry in the Vector databases tile listing to open a new modal where you can view an expanded view of that database's configuration and related items.

You can:

	Description
1	Select a different vector database to explore from the dropdown in the breadcrumbs.
2	Select a different version of the vector database to explore from the dropdown. When you click a version, the details and reported assets (Related items) update to those associated with the specific version. Learn more about versioning.
3	Execute a variety of vector database actions.
4	View the versioning history. When you click a version, the details and reported assets (Related items) update to those associated with the specific version. Learn more about versioning.
5	View items associated with the vector database, such as the related Use Case and LLM blueprints and deployed customer and registered models that use the vector database. Click on an entity to open it in the corresponding Console tab.
6	Create a playground that uses the selected database.

Vector database actions¶

The actions dropdown allows you to apply an action to the version of the vector database you are viewing. See the versioning documentation for additional information.

Action	Description
Create playground using this version	Opens a new playground with the vector database loaded into the LLM configuration.
Create new version from this version	Creates a new version of the vector database that is based on the version that is currently selected.
Export this version to the Data Registry	Exports the latest vector database version to Data Registry. It can then be used in different Use Case playgrounds.
Delete vector database	Deletes the parent vector database and all versions. Because the vector databases used by deployments are snapshots, deleting a vector database in a Use Case does not affect the deployments using that vector database. The deployment uses an independent snapshot of the vector database.

Vector database details¶

The details section of the vector database expanded view reports information for the selected version, whether you selected the version from the dropdown or the right-hand panel.

Basic vector database metadata: ID, creator and creation date, data source name and size.
Chunking configuration settings: Embedding column and chunking method and settings.
Metadata columns: Names of columns from the data source, which can later be used for metadata filtering..

Use this area to quickly compare versions to see how configuration changes impact chunking results. For example, notice how the size and number of chunks changes between the parent version that uses the DataRobot English language documentation:

And with the addition of the Japanese language documentation: