Create vector databases from BYO embeddings¶
The following notebook outlines how you can build and validate a vector database from a bring-your-own (BYO) embedding and register the vector database to the DataRobot platform using the Python client. This notebook is designed for use with DataRobot Notebooks; DataRobot recommends downloading this notebook and uploading it for use in the platform.
Setup¶
The following steps outline the necessary configuration for integrating vector databases with the DataRobot platform.
This workflow uses the following feature flags. Contact your DataRobot representative or administrator for information on enabling these features.
- Enable Notebooks Filesystem Management
- Enable Public Network Access for all Custom Models
- Enable Monitoring Support for Generative Models
- Enable Custom Inference Models
- Enable GenAI Experimentation
Use a codespace, not a DataRobot Notebook, to ensure this notebook has access to a filesystem.
Set the notebook session timeout to 180 minutes.
Restart the notebook container using at least a "Medium" (16GB RAM) instance.
Optionally, upload your documents archive to the notebook filesystem.
Install requirements¶
Import the following libraries and modules to interface with the DataRobot platform, prediction environments, and vector database functionality.
import datarobot as dr
from datarobot.enums import PredictionEnvironmentPlatform
from datarobot.enums import PredictionEnvironmentModelFormats
from datarobot.models.genai.custom_model_embedding_validation import CustomModelEmbeddingValidation
from datarobot.models.genai.vector_database import VectorDatabase
from datarobot.models.genai.vector_database import ChunkingParameters
Connect to DataRobot¶
Provide a DataRobot endpoint
and token
to connect to DataRobot through the DataRobot Python client. Read more about options for connecting to DataRobot from the Python client.
endpoint="https://app.datarobot.com/api/v2"
token="ADD_TOKEN_HERE>"
dr.Client(endpoint=endpoint, token=token)
Select an environment¶
Using dr.ExecutionEnvironment.list()
, iterate through the environments available to your organization, selecting the environment named [DataRobot] Python 3.11 GenAI
to be the base environment of the vector database and assigning it to base_environment
.
execution_environments = dr.ExecutionEnvironment.list()
base_environment = None
environment_versions = None
for execution_environment in execution_environments:
# print(execution_environment)
if execution_environment.name == "[DataRobot] Python 3.11 GenAI":
base_environment = execution_environment
environment_versions = dr.ExecutionEnvironmentVersion.list(
execution_environment.id
)
break
environment_version = environment_versions[0]
print(base_environment)
print(environment_version)
Create a custom embedding model¶
Using dr.CustomInferenceModel.list()
search the available custom models for all-MiniLM-L6-v2-embedding-model
. If the custom model doesn't exist, create it as custom_model
using dr.CustomInferenceModel.create()
. If the custom model does exist, assign it to custom_model
.
CUSTOM_MODEL_NAME = "all-MiniLM-L6-v2-embedding-model"
if CUSTOM_MODEL_NAME not in [c.name for c in dr.CustomInferenceModel.list()]:
# Create a new custom model
print("Creating new custom model")
custom_model = dr.CustomInferenceModel.create(
name=CUSTOM_MODEL_NAME,
target_type=dr.TARGET_TYPE.UNSTRUCTURED,
is_training_data_for_versions_permanently_enabled=True
)
else:
print("Custom Model Exists")
custom_model = [c for c in dr.CustomInferenceModel.list() if c.name == CUSTOM_MODEL_NAME].pop()
Create a directory for custom code¶
Create a directory called custom_embedding_model
to write custom embedding model code into.
import os
os.mkdir('custom_embedding_model')
Write custom embedding model code¶
Write custom embedding model code into the custom.py
file, creating an unstructured model from all-MiniLM-L6-v2
.
%%writefile ./custom_embedding_model/custom.py
from sentence_transformers import SentenceTransformer
def load_model(input_dir):
return SentenceTransformer("all-MiniLM-L6-v2")
def score_unstructured(model, data, query, **kwargs):
import json
data_dict = json.loads(data)
outputs = model.encode(data_dict["input"])
return json.dumps(
{
"result": outputs.tolist(),
"device": str(model._target_device)
}
)
Write a requirements file¶
Write the requirements for the custom embedding model into the requirements.txt
file, ensuring the custom model environment includes the embedding model's dependencies.
%%writefile ./custom_embedding_model/requirements.txt
sentence-transformers==3.0.0
Create a custom model version¶
Using dr.CustomModelVersion.create_clean
, create a custom model version with the custom_model
, base_environment
, and files
defined in previous steps. In addition, enable public network access using dr.NETWORK_EGRESS_POLICY.PUBLIC
.
# Create a new custom model version in DataRobot
print("Upload new model version to DataRobot")
model_version = dr.CustomModelVersion.create_clean(
custom_model_id=custom_model.id,
base_environment_id=base_environment.id,
files=[("./custom_embedding_model/custom.py", "custom.py")],
network_egress_policy=dr.NETWORK_EGRESS_POLICY.PUBLIC,
)
Build custom model environment¶
Using dr.CustomModelVersionDependencyBuild
, build a custom model environment with the required dependencies installed.
# Build the custom model environment to ensure dependencies from `requirements.txt` are installed.
build_info = dr.CustomModelVersionDependencyBuild.start_build(
custom_model_id=custom_model.id,
custom_model_version_id=model_version.id,
max_wait=60*10, # Set a long timeout
)
Register the custom model¶
Using dr.RegisteredModel.list()
, search the available custom models for all-MiniLM-L6-v2-embedding-model
(assigned to CUSTOM_MODEL_NAME
). If the registered model doesn't exist, create a registered_model_version
using dr.RegisteredModelVersion.create_for_custom_model_version
. If the registered model does exist, assign it to registered_model
and create a registered_model_version
using dr.RegisteredModelVersion.create_for_custom_model_version
.
if CUSTOM_MODEL_NAME not in [m.name for m in dr.RegisteredModel.list()]:
print("Creating New Registered Model")
registered_model_version = dr.RegisteredModelVersion.create_for_custom_model_version(
model_version.id,
name=CUSTOM_MODEL_NAME,
registered_model_name=CUSTOM_MODEL_NAME
)
else:
print("Using Existing Model")
registered_model = [m for m in dr.RegisteredModel.list() if m.name == CUSTOM_MODEL_NAME].pop()
registered_model_version = dr.RegisteredModelVersion.create_for_custom_model_version(
model_version.id,
name=CUSTOM_MODEL_NAME,
registered_model_id=registered_model.id
)
Create a prediction environment for embedding models¶
Using dr.PredictionEnvironment.list()
, search the available prediction environments for Prediction environment for BYO embeddings models
. If the prediction environment doesn't exist, create a DataRobot Serverless prediction_environment
using dr.PredictionEnvironment.create
. If the prediction environment does exist, assign it to prediction_environment
.
PREDICTION_ENVIRONMENT_NAME = "Prediction environment for BYO embeddings models"
prediction_environment = None
for _prediction_environment in dr.PredictionEnvironment.list():
if _prediction_environment.name == PREDICTION_ENVIRONMENT_NAME:
prediction_environment = _prediction_environment
if prediction_environment is None:
prediction_environment = dr.PredictionEnvironment.create(
name=PREDICTION_ENVIRONMENT_NAME,
platform=PredictionEnvironmentPlatform.DATAROBOT_SERVERLESS,
supported_model_formats=[
PredictionEnvironmentModelFormats.DATAROBOT,
PredictionEnvironmentModelFormats.CUSTOM_MODEL
],
)
Deploy the custom embedding model¶
Using dr.Deployment.list()
, search the available deployments for Deployment for all-MiniLM-L6-v2
. If the prediction environment doesn't exist, create a deployment
, deploying the registered_model_version
created in a previous section with dr.Deployment.create_from_registered_model_version
. If the deployment does exist, assign it to deployment
.
MODEL_DEPLOYMENT_NAME = "Deployment for all-MiniLM-L6-v2"
if MODEL_DEPLOYMENT_NAME not in [d.label for d in dr.Deployment.list()]:
deployment = dr.Deployment.create_from_registered_model_version(
registered_model_version.id,
label=MODEL_DEPLOYMENT_NAME,
max_wait=1000,
prediction_environment_id=prediction_environment.id
)
else:
deployment = [d for d in dr.Deployment.list() if d.label == MODEL_DEPLOYMENT_NAME][0]
Create a Use Case for BYO embeddings¶
Using dr.UseCase.create
, create the Use Case to use the vector database with and assign it to use_case
.
use_case = dr.UseCase.create(name="For BYO embeddings")
Upload a dataset to the Use Case¶
Using dr.Dataset.create_from_url
, upload the example dataset for the vector database and assign it to dataset
.
# this can be updated with any public URL that is pointing to a .zip file
# in the expected format
dataset_url = "https://s3.amazonaws.com/datarobot_public_datasets/genai/pirate_resumes.zip"
# We will use a vector database with our GenAI models. Let's upload a dataset with our documents.
# If you wish to use a local file as dataset, change this to
# `dataset = dr.Dataset.create_from_file(local_file_path)`
dataset = dr.Dataset.create_from_url(dataset_url)
Then, add the dataset to the use_case
created in the previous section.
# Attach dataset to use case.
use_case.add(dataset)
Validate and create the custom embedding model¶
The CustomModelVectorDatabaseValidation.create
function executes the validation of the vector database, setting the required settings and associating the custom embedding model with the use_case
and deployment
created earlier in this notebook. This step stores the validation ID in custom_model_embedding_validation
.
# Create BYO embeddings validation using prepared deployment
custom_model_embedding_validation = CustomModelEmbeddingValidation.create(
prompt_column_name="input",
target_column_name="result",
deployment_id=deployment.id,
use_case = use_case,
name="BYO embeddings",
wait_for_completion=True,
prediction_timeout=300,
)
Set chunking parameters and create a vector database¶
After validation completes, set the ChunkingParameter()
and use VectorDatabase.create()
to integrate the vector database. This step uses the custom_model_embedding_validation
, dataset
, and use_case
defined in previous sections.
# Use created validation to set up chunking parameters
chunking_parameters = ChunkingParameters(
embedding_validation=custom_model_embedding_validation,
chunking_method="recursive",
chunk_size=256,
chunk_overlap_percentage=50,
separators=["\n\n", "\n", " ", ""],
embedding_model=None,
)
vdb = VectorDatabase.create(
dataset_id=dataset.id,
chunking_parameters=chunking_parameters,
use_case=use_case
)
vdb = VectorDatabase.get(vdb.id)
assert vdb.execution_status == "COMPLETED"