Create vector databases from BYO embeddings¶
The following notebook outlines how you can build and validate a vector database from a bring-your-own (BYO) embedding and register the vector database to the DataRobot platform using the Python client. This notebook is designed for use with DataRobot Notebooks; DataRobot recommends downloading this notebook and uploading it for use in the platform.
Setup¶
The following steps outline the necessary configuration for integrating vector databases with the DataRobot platform.
This workflow uses the following feature flags. Contact your DataRobot representative or administrator for information on enabling these features.
- Enable Notebooks Filesystem Management
- Enable Public Network Access for all Custom Models
- Enable Monitoring Support for Generative Models
- Enable Custom Inference Models
- Enable GenAI Experimentation
Enable the notebook filesystem for this notebook in the notebook sidebar.
Set the notebook session timeout to 180 minutes.
Restart the notebook container using at least a "Medium" (16GB RAM) instance.
Optionally, upload your documents archive to the notebook filesystem.
Install requirements¶
Import the following libraries and modules to interface with the DataRobot platform, prediction environments, and vector database functionality.
import datarobot as dr
from datarobot.enums import PredictionEnvironmentPlatform
from datarobot.enums import PredictionEnvironmentModelFormats
from datarobot.models.genai.custom_model_embedding_validation import CustomModelEmbeddingValidation
from datarobot.models.genai.vector_database import VectorDatabase
from datarobot.models.genai.vector_database import ChunkingParameters
Connect to DataRobot¶
Provide a DataRobot endpoint and token to connect to DataRobot through the DataRobot Python client. Read more about options for connecting to DataRobot from the Python client.
endpoint="https://app.datarobot.com/api/v2"
token="ADD_TOKEN_HERE>"
dr.Client(endpoint=endpoint, token=token)
<datarobot.rest.RESTClientObject at 0x17e7f2d50>
Select an environment¶
Using dr.ExecutionEnvironment.list(), iterate through the environments available to your organization, selecting the environment named [DataRobot] Python 3.11 GenAI to be the base environment of the vector database and assigning it to base_environment.
execution_environments = dr.ExecutionEnvironment.list()
base_environment = None
environment_versions = None
for execution_environment in execution_environments:
# print(execution_environment)
if execution_environment.name == "[DataRobot] Python 3.11 GenAI":
base_environment = execution_environment
environment_versions = dr.ExecutionEnvironmentVersion.list(
execution_environment.id
)
break
environment_version = environment_versions[0]
print(base_environment)
print(environment_version)
ExecutionEnvironment('[DataRobot] Python 3.11 GenAI')
ExecutionEnvironmentVersion('v23')
Create a custom embedding model¶
Using dr.CustomInferenceModel.list() search the available custom models for all-MiniLM-L6-v2-embedding-model. If the custom model doesn't exist, create it as custom_model using dr.CustomInferenceModel.create(). If the custom model does exist, assign it to custom_model.
CUSTOM_MODEL_NAME = "all-MiniLM-L6-v2-embedding-model"
if CUSTOM_MODEL_NAME not in [c.name for c in dr.CustomInferenceModel.list()]:
# Create a new custom model
print("Creating new custom model")
custom_model = dr.CustomInferenceModel.create(
name=CUSTOM_MODEL_NAME,
target_type=dr.TARGET_TYPE.UNSTRUCTURED,
is_training_data_for_versions_permanently_enabled=True
)
else:
print("Custom Model Exists")
custom_model = [c for c in dr.CustomInferenceModel.list() if c.name == CUSTOM_MODEL_NAME].pop()
Creating new custom model
Create a directory for custom code¶
Create a directory called custom_embedding_model to write custom embedding model code into.
import os
os.mkdir('custom_embedding_model')
Write custom embedding model code¶
Write custom embedding model code into the custom.py file, creating an unstructured model from all-MiniLM-L6-v2.
%%writefile ./custom_embedding_model/custom.py
from sentence_transformers import SentenceTransformer
def load_model(input_dir):
return SentenceTransformer("all-MiniLM-L6-v2")
def score_unstructured(model, data, query, **kwargs):
import json
data_dict = json.loads(data)
outputs = model.encode(data_dict["input"])
return json.dumps(
{
"result": outputs.tolist(),
"device": str(model._target_device)
}
)
Writing ./custom_embedding_model/custom.py
Create a custom model version¶
Using dr.CustomModelVersion.create_clean, create a custom model version with the custom_model, base_environment, and files defined in previous steps. In addition, enable public network access using dr.NETWORK_EGRESS_POLICY.PUBLIC.
# Create a new custom model version in DataRobot
print("Upload new version of model to DataRobot")
model_version = dr.CustomModelVersion.create_clean(
custom_model_id=custom_model.id,
base_environment_id=base_environment.id,
files=[("./custom_embedding_model/custom.py", "custom.py")],
network_egress_policy=dr.NETWORK_EGRESS_POLICY.PUBLIC,
)
Upload new version of model to DataRobot
Register the custom model¶
Using dr.RegisteredModel.list(), search the available custom models for all-MiniLM-L6-v2-embedding-model (assigned to CUSTOM_MODEL_NAME). If the registered model doesn't exist, create a registered_model_version using dr.RegisteredModelVersion.create_for_custom_model_version. If the registered model does exist, assign it to registered_model and create a registered_model_version using dr.RegisteredModelVersion.create_for_custom_model_version.
if CUSTOM_MODEL_NAME not in [m.name for m in dr.RegisteredModel.list()]:
print("Creating New Registered Model")
registered_model_version = dr.RegisteredModelVersion.create_for_custom_model_version(
model_version.id,
name=CUSTOM_MODEL_NAME,
registered_model_name=CUSTOM_MODEL_NAME
)
else:
print("Using Existing Model")
registered_model = [m for m in dr.RegisteredModel.list() if m.name == CUSTOM_MODEL_NAME].pop()
registered_model_version = dr.RegisteredModelVersion.create_for_custom_model_version(
model_version.id,
name=CUSTOM_MODEL_NAME,
registered_model_id=registered_model.id
)
Creating New Registered Model
Create a prediction environment for embedding models¶
Using dr.PredictionEnvironment.list(), search the available prediction environments for Prediction environment for BYO embeddings models. If the prediction environment doesn't exist, create a DataRobot Serverless prediction_environment using dr.PredictionEnvironment.create. If the prediction environment does exist, assign it to prediction_environment.
PREDICTION_ENVIRONMENT_NAME = "Prediction environment for BYO embeddings models"
prediction_environment = None
for _prediction_environment in dr.PredictionEnvironment.list():
if _prediction_environment.name == PREDICTION_ENVIRONMENT_NAME:
prediction_environment = _prediction_environment
if prediction_environment is None:
prediction_environment = dr.PredictionEnvironment.create(
name=PREDICTION_ENVIRONMENT_NAME,
platform=PredictionEnvironmentPlatform.DATAROBOT_SERVERLESS,
supported_model_formats=[
PredictionEnvironmentModelFormats.DATAROBOT,
PredictionEnvironmentModelFormats.CUSTOM_MODEL
],
)
Deploy the custom embedding model¶
Using dr.Deployment.list(), search the available deployments for Deployment for all-MiniLM-L6-v2. If the prediction environment doesn't exist, create a deployment, deploying the registered_model_version created in a previous section with dr.Deployment.create_from_registered_model_version. If the deployment does exist, assign it to deployment.
MODEL_DEPLOYMENT_NAME = "Deployment for all-MiniLM-L6-v2"
if MODEL_DEPLOYMENT_NAME not in [d.label for d in dr.Deployment.list()]:
deployment = dr.Deployment.create_from_registered_model_version(
registered_model_version.id,
label=MODEL_DEPLOYMENT_NAME,
max_wait=1000,
prediction_environment_id=prediction_environment.id
)
else:
deployment = [d for d in dr.Deployment.list() if d.label == MODEL_DEPLOYMENT_NAME]
Create a Use Case for BYO embeddings¶
Using dr.UseCase.create, create the Use Case to use the vector database with and assign it to use_case.
use_case = dr.UseCase.create(name="For BYO embeddings")
Upload a dataset to the Use Case¶
Using dr.Dataset.create_from_url, upload the example dataset for the vector database and assign it to dataset.
# this can be updated with any public URL that is pointing to a .zip file
# in the expected format
dataset_url = "https://s3.amazonaws.com/datarobot_public_datasets/genai/pirate_resumes.zip"
# We will use a vector database with our GenAI models. Let's upload a dataset with our documents.
# If you wish to use a local file as dataset, change this to
# `dataset = dr.Dataset.create_from_file(local_file_path)`
dataset = dr.Dataset.create_from_url(dataset_url)
Then, add the dataset to the use_case created in the previous section.
# Attach dataset to use case.
use_case.add(dataset)
<datarobot.models.use_cases.use_case.UseCaseReferenceEntity at 0x17e4b6150>
Validate and create the custom embedding model¶
The CustomModelVectorDatabaseValidation.create function executes the validation of the vector database, setting the required settings and associating the custom embedding model with the use_case and deployment created earlier in this notebook. This step stores the validation ID in custom_model_embedding_validation.
# Create BYO embeddings validation using prepared deployment
custom_model_embedding_validation = CustomModelEmbeddingValidation.create(
prompt_column_name="input",
target_column_name="result",
deployment_id=deployment.id,
use_case = use_case,
name="BYO embeddings",
wait_for_completion=True,
prediction_timeout=300,
)
Set chunking parameters and create a vector database¶
After validation completes, set the ChunkingParameter() and use VectorDatabase.create() to integrate the vector database. This step uses the custom_model_embedding_validation, dataset, and use_case defined in previous sections.
# Use created validation to set up chunking parameters
chunking_parameters = ChunkingParameters(
embedding_validation=custom_model_embedding_validation,
chunking_method="recursive",
chunk_size=256,
chunk_overlap_percentage=50,
separators=["\n\n", "\n", " ", ""],
embedding_model=None,
)
vdb = VectorDatabase.create(
dataset_id=dataset.id,
chunking_parameters=chunking_parameters,
use_case=use_case
)
vdb = VectorDatabase.get(vdb.id)
assert vdb.execution_status == "COMPLETED"