Create external vector databases with code¶
The following notebook outlines how you can build, validate, and register an external vector database to the DataRobot platform using DataRobot's Python client. This notebook is designed for use with DataRobot Notebooks, so DataRobot recommends downloading this notebook and uploading it for use in the platform.
Setup¶
The following steps outline the necessary configuration to integrate external vector databases with the DataRobot platform.
This workflow uses the following feature flags. Contact your DataRobot representative or administrator for information on enabling these features.
- Enable Notebooks Filesystem Management
- Enable Proxy models
- Enable Public Network Access for all Custom Models
- Enable the Injection of Runtime Parameters for Custom Models
- Enable Monitoring Support for Generative Models
- Enable Custom Inference Models (GA: on by default)
Enable the notebook filesystem for this notebook in the notebook sidebar.
Add the following notebook environment variables and set the values with your Azure OpenAI credentials.
OPENAI_API_KEY
OPENAI_ORGANIZATION
OPENAI_API_BASE
OPENAI_API_TYPE
OPENAI_API_VERSION
OPENAI_DEPLOYMENT_NAME
Set the notebook session timeout to 180 minutes.
Restart the notebook container using at least a "Medium" (16GB RAM) instance.
Upload your documents archive to the notebook.
# Before making a call to validate an external VDB or LLM,
# you must configure the DataRobot client in the DR Notebook with a pre-existing API Token.
# This is because DR Notebooks use temporary API tokens.
import datarobot as dr
dr.Client(token="PUT_YOUR_API_TOKEN_HERE")
Install libraries¶
try:
import os
assert os.path.isfile('./storage/dr_docs.tar')
except Exception as e:
raise RuntimeError('Please follow the setup steps before running the notebook.') from e
!pip install "langchain==0.0.244" \
"faiss-cpu==1.7.4" \
"sentence-transformers==2.2.2" \
"unstructured==0.8.4" \
"openai==0.27.8" \
"datarobotx==0.1.14"
# Decompress the documents
!tar -xf ./storage/dr_docs.tar -C ./storage/
Load and split text¶
Next, load the DataRobot documentation dataset and split it into chunks. If you are applying this recipe to a different use case, consider the following:
- Use additional or alternative document loaders.
- Filter out extraneous and noisy documents.
- Choose an appropriate
chunk_size
andoverlap
. Thse are counted by number of characters, not tokens.
import re
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import MarkdownTextSplitter
SOURCE_DOCUMENTS_DIR = "storage/datarobot_docs/"
SOURCE_DOCUMENTS_FILTER = "**/*.md"
loader = DirectoryLoader(f"{SOURCE_DOCUMENTS_DIR}", glob=SOURCE_DOCUMENTS_FILTER)
splitter = MarkdownTextSplitter(
chunk_size=2000,
chunk_overlap=1000,
)
print(f"Loading {SOURCE_DOCUMENTS_DIR} directory")
data = loader.load()
print(f"Splitting {len(data)} documents")
docs = splitter.split_documents(data)
for doc in docs:
doc.metadata['source'] = re.sub(r'storage/datarobot_docs/en/(.+)\.md', r'https://docs.datarobot.com/en/docs/\1.html', doc.metadata['source'])
print(f"Created {len(docs)} documents")
Loading storage/datarobot_docs/ directory
Splitting 726 documents Created 3464 documents
Create a vector database from documents¶
Use the following cell to build a vector database from the DataRobot documentation dataset. Note that this notebook uses FAISS, an open source, in-memory vector store that can be serialized and loaded to disk (and is compatible wth the DataRobot Notebooks). Additionally, this notebook uses the HuggingFace all-MiniLM-L6-v2
embeddings model (open source).
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.docstore.document import Document
import torch
if not torch.cuda.is_available():
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
else:
EMBEDDING_MODEL_NAME = "all-mpnet-base-v2"
# Will download the model the first time it runs
embedding_function = SentenceTransformerEmbeddings(
model_name=EMBEDDING_MODEL_NAME,
cache_folder="storage/deploy/sentencetransformers",
)
try:
# Load existing db from disk if previously built
db = FAISS.load_local("storage/deploy/faiss-db", embedding_function)
except:
texts = [doc.page_content for doc in docs]
metadatas = [doc.metadata for doc in docs]
# Build and save the FAISS db to persistent notebook storage; this can take some time w/o GPUs
db = FAISS.from_texts(texts, embedding_function, metadatas=metadatas)
db.save_local("storage/deploy/faiss-db")
print(f"FAISS VectorDB has {db.index.ntotal} documents")
FAISS VectorDB has 3464 documents
Test the vector database¶
Use the following cell to test the vector database by having the model perform a similarity search with the query provided.
db.similarity_search("How do I replace a custom model on an existing custom environment?")
#db.max_marginal_relevance_search("How do I replace a custom model on an existing custom environment?")
[Document(page_content="title: Add custom model versions\ndescription: Update a model's contents to create a new version of the model due to new package versions, different preprocessing steps, hyperparameters, etc.\n\nAdd custom model versions {: #add-custom-model-versions }\n\nIf you want to update a model due to new package versions, different preprocessing steps, hyperparameters, and more, you can update the file contents to create a new version of the model. To upload a new version of a custom model environment, see Add an environment version.\n\nCreate a new minor version\n\nWhen you update the contents of a model, the minor version (1.1, 1.2, etc.) of the model automatically updates. To create a minor custom model version, select the model from the Custom Model Workshop and navigate to the Assemble tab. Under the Model header, click Add files and upload the files or folders you updated. The minor version is also updated if you delete a file.\n\nCreate a new major version\n\nTo create a new major version of a model (1.0, 2.0, etc.):\n\nSelect the model from the Custom Model Workshop and navigate to the Assemble tab.\n\nUnder the Model header, click New Version and, in the Create new model version dialog box: \n\n\nSelect a custom model version creation strategy: \n\n\nCopy contents of previous version: Bring the contents of the current version to the new version of the custom model.\n\n\nCreate empty version: Discard the contents of the current version and add new files for the new version of the custom model.\n\n\n\n\nSelect a Base Environment. The environment of the current version is selected by default.\n\n\nEnter a New version description. The version description is optional.", metadata={'source': 'https://docs.datarobot.com/en/docs/mlops/deployment/custom-models/custom-model-workshop/custom-model-versions.html'}), Document(page_content="When all fields are complete, click Add. The custom environment is ready for use in the Workshop.\n\nAfter you upload an environment, it is only available to you unless you share it with other individuals.\n\nTo make changes to an existing environment, create a new version.\n\nAdd an environment version {: #add-an-environment-version }\n\nTroubleshoot or update a custom environment by adding a new version of it to the Workshop. In the Versions tab, select New version.\n\nUpload the file for a new version and provide a brief description, then click Add.\n\nThe new version is available in the Verison tab; all past environment versions are saved for later use.\n\nView environment information {: #view-environment-information }\n\nThere is a variety of information available for each custom and built-in environment. To view:\n\nNavigate to Model Registry > Custom Model Workshop > Environments. The resulting list shows all environments available to your account, with summary information.\n\nFor more information on an individual environment, click to select:\n\nThe versions tab lists a variety of version-specific information and provides a link for downloading that version's environment context file.\n\nClick Current Deployments to see a list of all deployments in which the current environment has been used.\n\nClick Environment Info to view information about the general environment, not including version information.\n\nShare and download an environment {: #share-and-download-an-environment }\n\nYou can share custom environments with anyone in your organization from the menu options on the right. These options are not available to built-in environments because all organization members have access and these environment options should not be removed.", metadata={'source': 'https://docs.datarobot.com/en/docs/mlops/deployment/custom-models/custom-model-environments/custom-environments.html'}), Document(page_content="Navigate to Model Registry > Custom Model Workshop > Environments. The resulting list shows all environments available to your account, with summary information.\n\nFor more information on an individual environment, click to select:\n\nThe versions tab lists a variety of version-specific information and provides a link for downloading that version's environment context file.\n\nClick Current Deployments to see a list of all deployments in which the current environment has been used.\n\nClick Environment Info to view information about the general environment, not including version information.\n\nShare and download an environment {: #share-and-download-an-environment }\n\nYou can share custom environments with anyone in your organization from the menu options on the right. These options are not available to built-in environments because all organization members have access and these environment options should not be removed.\n\n!!! note\n An environment is not available in the model registry to other users unless it was explicitly shared. That does not, however, limit users' ability to use blueprints that include tasks that use that environment. See the description of implicit sharing for more information.\n\nFrom Model Registry > Custom Model Workshop > Environments, use the menu to share and/or delete any custom environment that you have appropriate permissions for. (Note that the link points to custom model actions, but the options are the same for custom tasks and environments.)\n\nSelf-Managed AI Platform admins {: #self-managed-ai-platform-admins }\n\nThe following is available only on the Self-Managed AI Platform.\n\nEnvironment availability {: #environment-availability }", metadata={'source': 'https://docs.datarobot.com/en/docs/mlops/deployment/custom-models/custom-model-environments/custom-environments.html'}), Document(page_content="Replace a model package {: #replace-a-model-package }\n\nActions (\n\nInventory or the\n\nOverview pages.\n\nYou are redirected to the Overview tab of the deployment. Click Import from to choose your method of model replacement.\n\nLocal File: Upload a model package exported from DataRobot AutoML to replace an existing model package (standalone MLOps users only).\n\nModel Registry: Select a model package from the Model Registry to replace an existing model package.\n\nPaste AutoML URL: Copy the URL of the model from the Leaderboard and paste it into the Replacement Model field.\n\nWhen you have confirmed the model package for replacement, select the replacement reason and click Accept and replace.\n\nModel replacement considerations {: #model-replacement-considerations }\n\nWhen replacing a deployed model, note the following:\n\nModel replacement is available for all deployments. Each deployment's model is provided as a model package, which can be replaced with another model package, provided it is compatible.\n!!! note\n The new model package cannot be the same leaderboad model as an existing champion or challenger; each challenger must be a unique model. If you create multiple model packages from the same leaderboard model, you can't use those models as challengers in the same deployment.\n\nWhile only the most current model is deployed, model history is maintained and can be used as a baseline for data drift.\n\nModel replacement validation {: #model-replacement-validation }\n\nDataRobot validates whether the new model is an appropriate replacement for the existing model and provides warning messages if issues are found. DataRobot compares the models to ensure that:\n\nThe target names and types match. For classification targets, the class names must match.\n\nThe feature types match.", metadata={'source': 'https://docs.datarobot.com/en/docs/mlops/manage-mlops/deploy-replace.html'})]
Define hooks for deploying an unstructured custom model¶
The following cell defines the methods used to deploy an unstructured custom model. These include loading the custom model and using the model for scoring.
import os
def load_model(input_dir):
"""Custom model hook for loading our knowledge base."""
import os
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores.faiss import FAISS
embedding_function = SentenceTransformerEmbeddings(
model_name=EMBEDDING_MODEL_NAME,
cache_folder=input_dir + '/' + 'storage/deploy/sentencetransformers',
)
db = FAISS.load_local(input_dir + "/" + "storage/deploy/faiss-db", embedding_function)
return db
def score_unstructured(model, data, query, **kwargs) -> str:
"""Custom model hook for making completions with our knowledge base.
When requesting predictions from the deployment, pass a dictionary
with the following keys:
- 'question' the question to be passed to the retrieval chain
- 'openai_api_key' the openai token to be used
- 'chat_history' (optional) a list of two-element lists corresponding to
preceding dialogue between the Human and AI, respectively
datarobot-user-models (DRUM) handles loading the model and calling
this function with the appropriate parameters.
Returns:
--------
rv : str
Json dictionary with keys:
- 'question' user's original question
- 'chat_history' chat history that was provided with the original question
- 'answer' the generated answer to the question
- 'references' list of references that were used to generate the answer
- 'error' - error message if exception in handling request
"""
import json
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores.base import VectorStoreRetriever
from langchain.chat_models import AzureChatOpenAI
try:
db = model
data_dict = json.loads(data)
retriever = VectorStoreRetriever(vectorstore=db)
documents = retriever.get_relevant_documents(data_dict['question'])
relevant_text_list = [doc.page_content for doc in documents]
rv = {"relevant": relevant_text_list}
#rv['references'] = [doc.metadata['source'] for doc in rv.pop('source_documents')]
except Exception as e:
rv = {'error': f"{e.__class__.__name__}: {str(e)}"}
return json.dumps(rv), {"mimetype": "application/json", "charset": "utf8"}
Test hooks locally¶
Before proceeding with deployment, test that the custom model hooks function correctly using the cell below.
import json
# Test the hooks locally
score_unstructured(
load_model("."),
json.dumps(
{
"question": "How do I replace a custom model on an existing custom environment?",
}
),
None,
)
Deploy the knowledge base¶
The cell below uses a convenience method that does the following:
- Builds a new custom model environment containing the contents of
storage/deploy/
. - Assembles a new custom model with the provided hooks.
- Deploys an unstructured custom model to DataRobot.
- Returns an object which can be used to make predictions.
You can also provide an environment_id
to instead use an existing custom model environment for shorter iteration cycles on the custom model hooks.
import datarobotx as drx
deployment = drx.deploy(
"storage/deploy/",
name="External DR Knowledge Base",
hooks={
"score_unstructured": score_unstructured,
"load_model": load_model
},
# extra_requirements=["langchain", "faiss-cpu", "sentence-transformers", "openai"],
# Re-use existing environment if you want to change the hook code,
# and not requirements
environment_id="64c964448dd3f0c07f47d040"
)
# Enable storing prediction data, necessary for Data Export for monitoring purposes
deployment.dr_deployment.update_predictions_data_collection_settings(enabled=True)
# Deploying custom model - Unable to auto-detect model type; any provided paths and files will be exported - dependencies should be explicitly specified using extra_requirements - Preparing model and environment... - Using environment [[DataRobot] Python 3.9 GenAI v4](https://staging.datarobot.com/model-registry/custom-environments/64c964448dd3f0c07f47d040) for deployment - Configuring and uploading custom model... 100%|█████████████████████████████████████| 104M/104M [00:00<00:00, 113MB/s]
- Registered custom model [External DR Knowledge Base](https://staging.datarobot.com/model-registry/custom-models/65170268672fcc8004090943/info) with target type: Unstructured - Creating and deploying model package...
- Created deployment [External DR Knowledge Base](https://staging.datarobot.com/deployments/65170277d708c73281b95de4/overview) # Custom model deployment complete
Test the deployment¶
Test that the deployment can successfully provide responses to questions.
deployment.predict_unstructured(
{
"question": "How do I replace a custom model on an existing custom environment?",
}
)
# Making predictions - Making predictions with deployment [External DR Knowledge Base](https://staging.datarobot.com/deployments/65170277d708c73281b95de4/overview) # Predictions complete {'relevant': ["title: Add custom model versions\ndescription: Update a model's contents to create a new version of the model due to new package versions, different preprocessing steps, hyperparameters, etc.\n\nAdd custom model versions {: #add-custom-model-versions }\n\nIf you want to update a model due to new package versions, different preprocessing steps, hyperparameters, and more, you can update the file contents to create a new version of the model. To upload a new version of a custom model environment, see Add an environment version.\n\nCreate a new minor version\n\nWhen you update the contents of a model, the minor version (1.1, 1.2, etc.) of the model automatically updates. To create a minor custom model version, select the model from the Custom Model Workshop and navigate to the Assemble tab. Under the Model header, click Add files and upload the files or folders you updated. The minor version is also updated if you delete a file.\n\nCreate a new major version\n\nTo create a new major version of a model (1.0, 2.0, etc.):\n\nSelect the model from the Custom Model Workshop and navigate to the Assemble tab.\n\nUnder the Model header, click New Version and, in the Create new model version dialog box: \n\n\nSelect a custom model version creation strategy: \n\n\nCopy contents of previous version: Bring the contents of the current version to the new version of the custom model.\n\n\nCreate empty version: Discard the contents of the current version and add new files for the new version of the custom model.\n\n\n\n\nSelect a Base Environment. The environment of the current version is selected by default.\n\n\nEnter a New version description. The version description is optional.", "When all fields are complete, click Add. The custom environment is ready for use in the Workshop.\n\nAfter you upload an environment, it is only available to you unless you share it with other individuals.\n\nTo make changes to an existing environment, create a new version.\n\nAdd an environment version {: #add-an-environment-version }\n\nTroubleshoot or update a custom environment by adding a new version of it to the Workshop. In the Versions tab, select New version.\n\nUpload the file for a new version and provide a brief description, then click Add.\n\nThe new version is available in the Verison tab; all past environment versions are saved for later use.\n\nView environment information {: #view-environment-information }\n\nThere is a variety of information available for each custom and built-in environment. To view:\n\nNavigate to Model Registry > Custom Model Workshop > Environments. The resulting list shows all environments available to your account, with summary information.\n\nFor more information on an individual environment, click to select:\n\nThe versions tab lists a variety of version-specific information and provides a link for downloading that version's environment context file.\n\nClick Current Deployments to see a list of all deployments in which the current environment has been used.\n\nClick Environment Info to view information about the general environment, not including version information.\n\nShare and download an environment {: #share-and-download-an-environment }\n\nYou can share custom environments with anyone in your organization from the menu options on the right. These options are not available to built-in environments because all organization members have access and these environment options should not be removed.", "Navigate to Model Registry > Custom Model Workshop > Environments. The resulting list shows all environments available to your account, with summary information.\n\nFor more information on an individual environment, click to select:\n\nThe versions tab lists a variety of version-specific information and provides a link for downloading that version's environment context file.\n\nClick Current Deployments to see a list of all deployments in which the current environment has been used.\n\nClick Environment Info to view information about the general environment, not including version information.\n\nShare and download an environment {: #share-and-download-an-environment }\n\nYou can share custom environments with anyone in your organization from the menu options on the right. These options are not available to built-in environments because all organization members have access and these environment options should not be removed.\n\n!!! note\n An environment is not available in the model registry to other users unless it was explicitly shared. That does not, however, limit users' ability to use blueprints that include tasks that use that environment. See the description of implicit sharing for more information.\n\nFrom Model Registry > Custom Model Workshop > Environments, use the menu to share and/or delete any custom environment that you have appropriate permissions for. (Note that the link points to custom model actions, but the options are the same for custom tasks and environments.)\n\nSelf-Managed AI Platform admins {: #self-managed-ai-platform-admins }\n\nThe following is available only on the Self-Managed AI Platform.\n\nEnvironment availability {: #environment-availability }", "Replace a model package {: #replace-a-model-package }\n\nActions (\n\nInventory or the\n\nOverview pages.\n\nYou are redirected to the Overview tab of the deployment. Click Import from to choose your method of model replacement.\n\nLocal File: Upload a model package exported from DataRobot AutoML to replace an existing model package (standalone MLOps users only).\n\nModel Registry: Select a model package from the Model Registry to replace an existing model package.\n\nPaste AutoML URL: Copy the URL of the model from the Leaderboard and paste it into the Replacement Model field.\n\nWhen you have confirmed the model package for replacement, select the replacement reason and click Accept and replace.\n\nModel replacement considerations {: #model-replacement-considerations }\n\nWhen replacing a deployed model, note the following:\n\nModel replacement is available for all deployments. Each deployment's model is provided as a model package, which can be replaced with another model package, provided it is compatible.\n!!! note\n The new model package cannot be the same leaderboad model as an existing champion or challenger; each challenger must be a unique model. If you create multiple model packages from the same leaderboard model, you can't use those models as challengers in the same deployment.\n\nWhile only the most current model is deployed, model history is maintained and can be used as a baseline for data drift.\n\nModel replacement validation {: #model-replacement-validation }\n\nDataRobot validates whether the new model is an appropriate replacement for the existing model and provides warning messages if issues are found. DataRobot compares the models to ensure that:\n\nThe target names and types match. For classification targets, the class names must match.\n\nThe feature types match."]}
Import experimental generative AI methods to the Python client¶
Import two generative AI methods for use with DataRobot's Python client. These methods are experimental, so they are subject to change and require the experimental version of the DataRobot Python library. Contact your DataRobot representative or administrator for information on accessing the experimental DataRobot Python library, or use pip install datarobot-early-access
. These methods execute, validate, and integrate the external vector database.
from datarobot._experimental.models.genai.vector_database import CustomModelVectorDatabaseValidation
from datarobot._experimental.models.genai.vector_database import VectorDatabase
CustomModelVectorDatabaseValidation.create
executes the validation of the vector database. Be sure to provide the Deployment ID.
external_vdb_validation = CustomModelVectorDatabaseValidation.create(
prompt_column_name="question",
target_column_name="relevant",
deployment_id=deployment.dr_deployment.id,
wait_for_completion=True
)
After validation completes, use VectorDatabase.create_from_custom_model()
. You must provide the Use Case ID (accessible from DataRobot Workbench in the UI), a name for the external vector database, and the validation ID returned from the previous cell.
use_case_id = "63e5063007c97ab5edd150a5"
vdb = VectorDatabase.create_from_custom_model(
use_case_id,
name="DR Custom Model",
validation_id= external_vdb_validation.id
)