Create external vector databases with code¶
The following notebook outlines how you can build, validate, and register an external vector database to the DataRobot platform using DataRobot's Python client. This notebook is designed for use with DataRobot Notebooks, so DataRobot recommends downloading this notebook and uploading it for use in the platform.
Setup¶
The following steps outline the necessary configuration to integrate external vector databases with the DataRobot platform.
This workflow uses the following feature flags. Contact your DataRobot representative or administrator for information on enabling these features.
- Enable Notebooks Filesystem Management
- Enable Proxy models
- Enable Public Network Access for all Custom Models
- Enable Monitoring Support for Generative Models
- Enable Custom Inference Models
Enable the notebook filesystem for this notebook in the notebook sidebar.
Set the notebook session timeout to 180 minutes.
Restart the notebook container using at least a "Medium" (16GB RAM) instance.
Optionally, upload your documents archive to the notebook filesystem.
Install libraries¶
Install the following libraries:
!pip install "langchain==0.0.244" \
"faiss-cpu==1.7.4" \
"sentence-transformers==2.2.2" \
"datarobotx==0.1.25"
import datarobot as dr
import datarobotx as drx
from datarobot.models.genai.vector_database import CustomModelVectorDatabaseValidation
from datarobot.models.genai.vector_database import VectorDatabase
Connect to DataRobot¶
Read more about options for connecting to DataRobot from the Python client.
endpoint = "https://app.datarobot.com/api/v2"
token="<ADD_VALUE_HERE>"
dr.Client(endpoint=endpoint, token=token)
drx.Context(token=token, endpoint=endpoint)
Download sample data¶
This example references a sample dataset made from the DataRobot english documentation. To experiment with your own data, modify this section and/or the "Load and split text" section by referencing your own local dataset.
Note: For self-managed users, code samples that reference app.datarobot.com
need to be changed to the appropriate URL for your instance.
import requests, zipfile, io
SOURCE_DOCUMENTS_ZIP_URL = "https://s3.amazonaws.com/datarobot_public_datasets/ai_accelerators/datarobot_english_documentation_5th_December.zip"
UNZIPPED_DOCS_DIR = "datarobot_english_documentation"
STORAGE_DIR = "storage"
r = requests.get(SOURCE_DOCUMENTS_ZIP_URL)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall(f"{STORAGE_DIR}/")
Load and split text¶
Next, load the DataRobot documentation dataset and split it into chunks. If you are applying this recipe to a different use case, consider the following:
- Use additional or alternative document loaders.
- Filter out extraneous and noisy documents.
- Choose an appropriate
chunk_size
andoverlap
. These are counted by number of characters, not tokens.
import re
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
SOURCE_DOCUMENTS_DIR = f"{STORAGE_DIR}/{UNZIPPED_DOCS_DIR}/"
SOURCE_DOCUMENTS_FILTER = "**/*.txt"
loader = DirectoryLoader(f"{SOURCE_DOCUMENTS_DIR}", glob=SOURCE_DOCUMENTS_FILTER)
splitter = RecursiveCharacterTextSplitter(
chunk_size=2000,
chunk_overlap=1000,
)
print(f"Loading {SOURCE_DOCUMENTS_DIR} directory")
data = loader.load()
print(f"Splitting {len(data)} documents")
docs = splitter.split_documents(data)
for doc in docs:
doc.metadata['source'] = re.sub(
rf'{STORAGE_DIR}/{UNZIPPED_DOCS_DIR}/datarobot_docs/en/(.+)\.md',
r'https://docs.datarobot.com/en/docs/\1.html',
doc.metadata['source']
)
print(f"Created {len(docs)} documents")
Create a vector database from documents¶
Use the following cell to build a vector database from the DataRobot documentation dataset. Note that this notebook uses FAISS, an open source, in-memory vector store that can be serialized and loaded to disk (and is compatible with the DataRobot Notebooks). Additionally, this notebook uses the HuggingFace all-MiniLM-L6-v2
embeddings model (open source).
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.docstore.document import Document
import torch
if not torch.cuda.is_available():
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
else:
EMBEDDING_MODEL_NAME = "all-mpnet-base-v2"
# Will download the model the first time it runs
embedding_function = SentenceTransformerEmbeddings(
model_name=EMBEDDING_MODEL_NAME,
cache_folder="storage/deploy/sentencetransformers",
)
try:
# Load existing db from disk if previously built
db = FAISS.load_local("storage/deploy/faiss-db", embedding_function)
except:
texts = [doc.page_content for doc in docs]
metadatas = [doc.metadata for doc in docs]
# Build and save the FAISS db to persistent notebook storage; this can take some time w/o GPUs
db = FAISS.from_texts(texts, embedding_function, metadatas=metadatas)
db.save_local("storage/deploy/faiss-db")
print(f"FAISS VectorDB has {db.index.ntotal} documents")
Test the vector database¶
Use the following cell to test the vector database by having the model perform a similarity search with the query provided.
db.similarity_search("How do I replace a custom model on an existing custom environment?")
#db.max_marginal_relevance_search("How do I replace a custom model on an existing custom environment?")
Define hooks for deploying an unstructured custom model¶
The following cell defines the methods used to deploy an unstructured custom model. These include loading the custom model and using the model for scoring.
import os
def load_model(input_dir):
"""Custom model hook for loading our knowledge base."""
import os
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores.faiss import FAISS
embedding_function = SentenceTransformerEmbeddings(
model_name=EMBEDDING_MODEL_NAME,
cache_folder=input_dir + '/' + 'storage/deploy/sentencetransformers',
)
db = FAISS.load_local(input_dir + "/" + "storage/deploy/faiss-db", embedding_function)
return db
def score_unstructured(model, data, query, **kwargs) -> str:
"""Custom model hook for retrieving relevant docs with our knowledge base.
When requesting predictions from the deployment, pass a dictionary
with the following keys:
- 'question' the question to be passed to the vector store retriever
datarobot-user-models (DRUM) handles loading the model and calling
this function with the appropriate parameters.
Returns:
--------
rv : str
Json dictionary with keys:
- 'question' user's original question
- 'relevant' the generated answer to the question
- 'references' list of references that were used to generate the answer
- 'error' - error message if exception in handling request
"""
import json
from langchain.vectorstores.base import VectorStoreRetriever
try:
db = model
data_dict = json.loads(data)
retriever = VectorStoreRetriever(vectorstore=db)
documents = retriever.get_relevant_documents(data_dict['question'])
relevant_text_list = [doc.page_content for doc in documents]
references = [doc.metadata['source'] for doc in documents]
rv = {
"question": data_dict["question"],
"relevant": relevant_text_list,
"references": references
}
except Exception as e:
rv = {'error': f"{e.__class__.__name__}: {str(e)}"}
return json.dumps(rv), {"mimetype": "application/json", "charset": "utf8"}
Test hooks locally¶
Before proceeding with deployment, test that the custom model hooks function correctly using the cell below.
import json
# Test the hooks locally
score_unstructured(
load_model("."),
json.dumps(
{
"question": "How do I replace a custom model on an existing custom environment?",
}
),
None,
)
Deploy the knowledge base¶
The cell below uses a convenience method that does the following:
- Builds a new custom model environment containing the contents of
storage/deploy/
. - Assembles a new custom model with the provided hooks.
- Deploys an unstructured custom model to DataRobot.
- Returns an object which can be used to make predictions.
This example uses a pre-built environment.
You can also provide an environment_id
and instead use an existing custom model environment for shorter iteration cycles on the custom model hooks.
See your account's existing pre-built environments from the DataRobot Custom Model Workshop.
deployment = drx.deploy(
model="storage/deploy/",
name="External DR Knowledge Base",
hooks={
"score_unstructured": score_unstructured,
"load_model": load_model
},
# extra_requirements=["langchain", "faiss-cpu", "sentence-transformers", "openai"],
# Re-use existing environment if you want to change the hook code,
# and not requirements
environment_id=dr.ExecutionEnvironment.list("Python 3.9 GenAI")[0].id,
)
Test the deployment¶
Test that the deployment can successfully provide responses to questions.
deployment.predict_unstructured(
{
"question": "How do I replace a custom model on an existing custom environment?",
}
)
Validate and create the vector database from the external VDB¶
These methods execute, validate, and integrate the external vector database.
This example associates a Use Case with the validation and creates the vector database within that Use Case.
Set the use_case_id
to specify an existing Use Case or create a new one with that name
use_case_id = "<ADD_VALUE_HERE>"
use_case = dr.UseCase.get(use_case_id)
# UNCOMMENT if you wish to create a new UseCase
# use_case = dr.UseCase.create()
CustomModelVectorDatabaseValidation.create
executes the validation of the vector database. Be sure to provide the Deployment ID.
external_vdb_validation = CustomModelVectorDatabaseValidation.create(
prompt_column_name="question",
target_column_name="relevant",
deployment_id=deployment.dr_deployment.id,
use_case=use_case,
wait_for_completion=True
)
assert external_vdb_validation.validation_status == "PASSED"
After validation completes, use VectorDatabase.create_from_custom_model()
to integrate the VDB. You must provide the Use Case name (or Use Case ID), a name for the external vector database, and the validation ID returned from the previous cell.
vdb = VectorDatabase.create_from_custom_model(
name="DR External Vector Database",
use_case=use_case,
validation_id=external_vdb_validation.id
)
assert vdb.execution_status == "COMPLETED"
print(f"Vector Database ID: {vdb.id}")
This vector database ID can now be used in the GenAI E2E walk-through to create the LLM blueprint with an external vector database.