コードで外部ベクターデータベースを作成する¶
以下のノートブックでは、DataRobotのPythonクライアントを使用して外部ベクターデータベースを構築、検証、およびDataRobotプラットフォームに登録する方法について説明します。このノートブックはDataRobot Notebooksで使用するように設計されているため、このノートブックをダウンロードし、プラットフォームにアップロードして使用することを推奨します。
セットアップ¶
以下の手順では、外部ベクターデータベースをDataRobotプラットフォームと連携させるために必要な設定について説明します。
このワークフローは以下の機能フラグを使用します。これらの機能を有効にするには、DataRobotの担当者または管理者にお問い合わせください。
- Notebooksでファイルシステム管理を有効にする
- プロキシモデルを有効にする
- すべてのカスタムモデルでパブリックネットワークへのアクセスを有効にする
- 生成モデルの監視サポートを有効にする
- カスタム推論モデルを有効化(一般提供機能:デフォルトでオン)
ノートブックサイドバーで、このノートブックのノートブックファイルシステムを有効にします。
以下のノートブック環境変数を追加し、Azure OpenAIの資格情報で値を設定します。
OPENAI_API_KEY
OPENAI_ORGANIZATION
OPENAI_API_BASE
OPENAI_API_TYPE
OPENAI_API_VERSION
OPENAI_DEPLOYMENT_NAME
ノートブックのセッションタイムアウトを180分に設定します。
少なくとも"Medium"(16GB RAM)インスタンスを使用して、ノートブックコンテナを再起動します。
ノートブックにドキュメントアーカイブをアップロードします。
# Before making a call to validate an external VDB or LLM,
# you must configure the DataRobot client in the DR Notebook with a pre-existing API Token.
# This is because DR Notebooks use temporary API tokens.
import datarobot as dr
dr.Client(token="PUT_YOUR_API_TOKEN_HERE")
ライブラリのインストール¶
try:
import os
assert os.path.isfile('./storage/dr_docs.tar')
except Exception as e:
raise RuntimeError('Please follow the setup steps before running the notebook.') from e
!pip install "langchain==0.0.244" \
"faiss-cpu==1.7.4" \
"sentence-transformers==2.2.2" \
"unstructured==0.8.4" \
"openai==0.27.8" \
"datarobotx==0.1.14"
# Decompress the documents
!tar -xf ./storage/dr_docs.tar -C ./storage/
テキストの読み込みと分割¶
次に、DataRobotのドキュメントデータセットをロードし、チャンクに分割します。このレシピを別のユースケースに適用する場合は、以下の点に注意してください。
- 追加または代替のドキュメントローダーを使用します。
- 余分で不要なドキュメントを除外します。
- 適切な
chunk_size
とoverlap
を選択しします。これらはトークンではなく、文字数でカウントされます。
import re
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import MarkdownTextSplitter
SOURCE_DOCUMENTS_DIR = "storage/datarobot_docs/"
SOURCE_DOCUMENTS_FILTER = "**/*.md"
loader = DirectoryLoader(f"{SOURCE_DOCUMENTS_DIR}", glob=SOURCE_DOCUMENTS_FILTER)
splitter = MarkdownTextSplitter(
chunk_size=2000,
chunk_overlap=1000,
)
print(f"Loading {SOURCE_DOCUMENTS_DIR} directory")
data = loader.load()
print(f"Splitting {len(data)} documents")
docs = splitter.split_documents(data)
for doc in docs:
doc.metadata['source'] = re.sub(r'storage/datarobot_docs/en/(.+)\.md', r'https://docs.datarobot.com/en/docs/\1.html', doc.metadata['source'])
print(f"Created {len(docs)} documents")
Loading storage/datarobot_docs/ directory
Splitting 726 documents Created 3464 documents
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores.faiss import FAISS
from langchain.docstore.document import Document
import torch
if not torch.cuda.is_available():
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
else:
EMBEDDING_MODEL_NAME = "all-mpnet-base-v2"
# Will download the model the first time it runs
embedding_function = SentenceTransformerEmbeddings(
model_name=EMBEDDING_MODEL_NAME,
cache_folder="storage/deploy/sentencetransformers",
)
try:
# Load existing db from disk if previously built
db = FAISS.load_local("storage/deploy/faiss-db", embedding_function)
except:
texts = [doc.page_content for doc in docs]
metadatas = [doc.metadata for doc in docs]
# Build and save the FAISS db to persistent notebook storage; this can take some time w/o GPUs
db = FAISS.from_texts(texts, embedding_function, metadatas=metadatas)
db.save_local("storage/deploy/faiss-db")
print(f"FAISS VectorDB has {db.index.ntotal} documents")
FAISS VectorDB has 3464 documents
ベクターデータベースのテスト¶
以下のセルを使って、指定されたクエリーでモデルに類似度検索を実行させ、ベクターデータベースをテストします。
db.similarity_search("How do I replace a custom model on an existing custom environment?")
#db.max_marginal_relevance_search("How do I replace a custom model on an existing custom environment?")
[Document(page_content="title: Add custom model versions\ndescription: Update a model's contents to create a new version of the model due to new package versions, different preprocessing steps, hyperparameters, etc.\n\nAdd custom model versions {: #add-custom-model-versions }\n\nIf you want to update a model due to new package versions, different preprocessing steps, hyperparameters, and more, you can update the file contents to create a new version of the model. To upload a new version of a custom model environment, see Add an environment version.\n\nCreate a new minor version\n\nWhen you update the contents of a model, the minor version (1.1, 1.2, etc.) of the model automatically updates. To create a minor custom model version, select the model from the Custom Model Workshop and navigate to the Assemble tab. Under the Model header, click Add files and upload the files or folders you updated. The minor version is also updated if you delete a file.\n\nCreate a new major version\n\nTo create a new major version of a model (1.0, 2.0, etc.):\n\nSelect the model from the Custom Model Workshop and navigate to the Assemble tab.\n\nUnder the Model header, click New Version and, in the Create new model version dialog box: \n\n\nSelect a custom model version creation strategy: \n\n\nCopy contents of previous version: Bring the contents of the current version to the new version of the custom model.\n\n\nCreate empty version: Discard the contents of the current version and add new files for the new version of the custom model.\n\n\n\n\nSelect a Base Environment. The environment of the current version is selected by default.\n\n\nEnter a New version description. The version description is (Optional)", metadata={'source': 'https://docs.datarobot.com/en/docs/mlops/deployment/custom-models/custom-model-workshop/custom-model-versions.html'}), Document(page_content="When all fields are complete, click Add. The custom environment is ready for use in the Workshop.\n\nAfter you upload an environment, it is only available to you unless you share it with other individuals.\n\nTo make changes to an existing environment, create a new version.\n\nAdd an environment version {: #add-an-environment-version }\n\nTroubleshoot or update a custom environment by adding a new version of it to the Workshop. In the Versions tab, select New version.\n\nUpload the file for a new version and provide a brief description, then click Add.\n\nThe new version is available in the Verison tab; all past environment versions are saved for later use.\n\nView environment information {: #view-environment-information }\n\nThere is a variety of information available for each custom and built-in environment. To view:\n\nNavigate to Model Registry > Custom Model Workshop > Environments. The resulting list shows all environments available to your account, with summary information.\n\nFor more information on an individual environment, click to select:\n\nThe versions tab lists a variety of version-specific information and provides a link for downloading that version's environment context file.\n\nClick Current Deployments to see a list of all deployments in which the current environment has been used.\n\nClick Environment Info to view information about the general environment, not including version information.\n\nShare and download an environment {: #share-and-download-an-environment }\n\nYou can share custom environments with anyone in your organization from the menu options on the right. These options are not available to built-in environments because all organization members have access and these environment options should not be removed.", metadata={'source': 'https://docs.datarobot.com/en/docs/mlops/deployment/custom-models/custom-model-environments/custom-environments.html'}), Document(page_content="Navigate to Model Registry > Custom Model Workshop > Environments. The resulting list shows all environments available to your account, with summary information.\n\nFor more information on an individual environment, click to select:\n\nThe versions tab lists a variety of version-specific information and provides a link for downloading that version's environment context file.\n\nClick Current Deployments to see a list of all deployments in which the current environment has been used.\n\nClick Environment Info to view information about the general environment, not including version information.\n\nShare and download an environment {: #share-and-download-an-environment }\n\nYou can share custom environments with anyone in your organization from the menu options on the right. These options are not available to built-in environments because all organization members have access and these environment options should not be removed.\n\n!!! note\n An environment is not available in the model registry to other users unless it was explicitly shared. That does not, however, limit users' ability to use blueprints that include tasks that use that environment. See the description of implicit sharing for more information.\n\nFrom Model Registry > Custom Model Workshop > Environments, use the menu to share and/or delete any custom environment that you have appropriate permissions for. (Note that the link points to custom model actions, but the options are the same for custom tasks and environments.)\n\nSelf-Managed AI Platform admins {: #self-managed-ai-platform-admins }\n\nThe following is available only on the Self-Managed AI Platform.\n\nEnvironment availability {: #environment-availability }", metadata={'source': 'https://docs.datarobot.com/en/docs/mlops/deployment/custom-models/custom-model-environments/custom-environments.html'}), Document(page_content="Replace a model package {: #replace-a-model-package }\n\nActions (\n\nInventory or the\n\nOverview pages.\n\nYou are redirected to the Overview tab of the deployment. Click Import from to choose your method of model replacement.\n\nLocal File: Upload a model package exported from DataRobot AutoML to replace an existing model package (standalone MLOps users only).\n\nModel Registry: Select a model package from the Model Registry to replace an existing model package.\n\nPaste AutoML URL: Copy the URL of the model from the Leaderboard and paste it into the Replacement Model field.\n\nWhen you have confirmed the model package for replacement, select the replacement reason and click Accept and replace.\n\nModel replacement considerations {: #model-replacement-considerations }\n\nWhen replacing a deployed model, note the following:\n\nModel replacement is available for all deployments. Each deployment's model is provided as a model package, which can be replaced with another model package, provided it is compatible.\n!!! note\n The new model package cannot be the same leaderboad model as an existing champion or challenger; each challenger must be a unique model. If you create multiple model packages from the same leaderboard model, you can't use those models as challengers in the same deployment.\n\nWhile only the most current model is deployed, model history is maintained and can be used as a baseline for data drift.\n\nModel replacement validation {: #model-replacement-validation }\n\nDataRobot validates whether the new model is an appropriate replacement for the existing model and provides warning messages if issues are found. DataRobot compares the models to ensure that:\n\nThe target names and types match. For classification targets, the class names must match.\n\nThe feature types match.", metadata={'source': 'https://docs.datarobot.com/en/docs/mlops/manage-mlops/deploy-replace.html'})]
非構造化カスタムモデルをデプロイするためのフックを定義する¶
以下のセルは、非構造化カスタムモデルのデプロイに使用されるメソッドを定義します。これらには、カスタムモデルのロードと、スコアリングのためのモデルの使用が含まれます。
import os
def load_model(input_dir):
"""Custom model hook for loading our knowledge base."""
import os
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores.faiss import FAISS
embedding_function = SentenceTransformerEmbeddings(
model_name=EMBEDDING_MODEL_NAME,
cache_folder=input_dir + '/' + 'storage/deploy/sentencetransformers',
)
db = FAISS.load_local(input_dir + "/" + "storage/deploy/faiss-db", embedding_function)
return db
def score_unstructured(model, data, query, **kwargs) -> str:
"""Custom model hook for making completions with our knowledge base.
When requesting predictions from the deployment, pass a dictionary
with the following keys:
- 'question' the question to be passed to the retrieval chain
- 'openai_api_key' the openai token to be used
- 'chat_history' (optional) a list of two-element lists corresponding to
preceding dialogue between the Human and AI, respectively
datarobot-user-models (DRUM) handles loading the model and calling
this function with the appropriate parameters.
Returns:
--------
rv : str
Json dictionary with keys:
- 'question' user's original question
- 'chat_history' chat history that was provided with the original question
- 'answer' the generated answer to the question
- 'references' list of references that were used to generate the answer
- 'error' - error message if exception in handling request
"""
import json
from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores.base import VectorStoreRetriever
from langchain.chat_models import AzureChatOpenAI
try:
db = model
data_dict = json.loads(data)
retriever = VectorStoreRetriever(vectorstore=db)
documents = retriever.get_relevant_documents(data_dict['question'])
relevant_text_list = [doc.page_content for doc in documents]
rv = {"relevant": relevant_text_list}
#rv['references'] = [doc.metadata['source'] for doc in rv.pop('source_documents')]
except Exception as e:
rv = {'error': f"{e.__class__.__name__}: {str(e)}"}
return json.dumps(rv), {"mimetype": "application/json", "charset": "utf8"}
ローカルでフックをテストする¶
デプロイに進む前に、以下のセルを使用して、カスタムモデルフックが正しく機能することをテストします。
import json
# Test the hooks locally
score_unstructured(
load_model("."),
json.dumps(
{
"question": "How do I replace a custom model on an existing custom environment?",
}
),
None,
)
ナレッジベースのデプロイ¶
以下のセルでは、次のことを行う便利なメソッドを使用しています。
storage/deploy/
の内容を含む新しいカスタムモデル環境を構築する。- 提供されたフックで新しいカスタムモデルをアセンブルする。
- 非構造化カスタムモデルをDataRobotにデプロイする。
- 予測に使用できるオブジェクトを返す。
また、environment_id
を指定して、既存のカスタムモデル環境を代わりに使用することで、カスタムモデルフックの反復サイクルを短くすることもできます。
import datarobotx as drx
deployment = drx.deploy(
"storage/deploy/",
name="External DR Knowledge Base",
hooks={
"score_unstructured": score_unstructured,
"load_model": load_model
},
# extra_requirements=["langchain", "faiss-cpu", "sentence-transformers", "openai"],
# Re-use existing environment if you want to change the hook code,
# and not requirements
environment_id="64c964448dd3f0c07f47d040"
)
# Enable storing prediction data, necessary for Data Export for monitoring purposes
deployment.dr_deployment.update_predictions_data_collection_settings(enabled=True)
# Deploying custom model - Unable to auto-detect model type; any provided paths and files will be exported - dependencies should be explicitly specified using extra_requirements - Preparing model and environment... - Using environment [[DataRobot] Python 3.9 GenAI v4](https://staging.datarobot.com/model-registry/custom-environments/64c964448dd3f0c07f47d040) for deployment - Configuring and uploading custom model... 100%|█████████████████████████████████████| 104M/104M [00:00<00:00, 113MB/s]
- Registered custom model [External DR Knowledge Base](https://staging.datarobot.com/model-registry/custom-models/65170268672fcc8004090943/info) with target type: Unstructured - Creating and deploying model package...
- Created deployment [External DR Knowledge Base](https://staging.datarobot.com/deployments/65170277d708c73281b95de4/overview) # Custom model deployment complete
デプロイのテスト¶
デプロイが質問に対して正常に回答できるかどうかをテストします。
deployment.predict_unstructured(
{
"question": "How do I replace a custom model on an existing custom environment?",
}
)
# Making predictions - Making predictions with deployment [External DR Knowledge Base](https://staging.datarobot.com/deployments/65170277d708c73281b95de4/overview) # Predictions complete {'relevant': ["title: Add custom model versions\ndescription: Update a model's contents to create a new version of the model due to new package versions, different preprocessing steps, hyperparameters, etc.\n\nAdd custom model versions {: #add-custom-model-versions }\n\nIf you want to update a model due to new package versions, different preprocessing steps, hyperparameters, and more, you can update the file contents to create a new version of the model. To upload a new version of a custom model environment, see Add an environment version.\n\nCreate a new minor version\n\nWhen you update the contents of a model, the minor version (1.1, 1.2, etc.) of the model automatically updates. To create a minor custom model version, select the model from the Custom Model Workshop and navigate to the Assemble tab. Under the Model header, click Add files and upload the files or folders you updated. The minor version is also updated if you delete a file.\n\nCreate a new major version\n\nTo create a new major version of a model (1.0, 2.0, etc.):\n\nSelect the model from the Custom Model Workshop and navigate to the Assemble tab.\n\nUnder the Model header, click New Version and, in the Create new model version dialog box: \n\n\nSelect a custom model version creation strategy: \n\n\nCopy contents of previous version: Bring the contents of the current version to the new version of the custom model.\n\n\nCreate empty version: Discard the contents of the current version and add new files for the new version of the custom model.\n\n\n\n\nSelect a Base Environment. The environment of the current version is selected by default.\n\n\nEnter a New version description. The version description is optional.", "When all fields are complete, click Add. The custom environment is ready for use in the Workshop.\n\nAfter you upload an environment, it is only available to you unless you share it with other individuals.\n\nTo make changes to an existing environment, create a new version.\n\nAdd an environment version {: #add-an-environment-version }\n\nTroubleshoot or update a custom environment by adding a new version of it to the Workshop. In the Versions tab, select New version.\n\nUpload the file for a new version and provide a brief description, then click Add.\n\nThe new version is available in the Verison tab; all past environment versions are saved for later use.\n\nView environment information {: #view-environment-information }\n\nThere is a variety of information available for each custom and built-in environment. To view:\n\nNavigate to Model Registry > Custom Model Workshop > Environments. The resulting list shows all environments available to your account, with summary information.\n\nFor more information on an individual environment, click to select:\n\nThe versions tab lists a variety of version-specific information and provides a link for downloading that version's environment context file.\n\nClick Current Deployments to see a list of all deployments in which the current environment has been used.\n\nClick Environment Info to view information about the general environment, not including version information.\n\nShare and download an environment {: #share-and-download-an-environment }\n\nYou can share custom environments with anyone in your organization from the menu options on the right. These options are not available to built-in environments because all organization members have access and these environment options should not be removed.", "Navigate to Model Registry > Custom Model Workshop > Environments. The resulting list shows all environments available to your account, with summary information.\n\nFor more information on an individual environment, click to select:\n\nThe versions tab lists a variety of version-specific information and provides a link for downloading that version's environment context file.\n\nClick Current Deployments to see a list of all deployments in which the current environment has been used.\n\nClick Environment Info to view information about the general environment, not including version information.\n\nShare and download an environment {: #share-and-download-an-environment }\n\nYou can share custom environments with anyone in your organization from the menu options on the right. These options are not available to built-in environments because all organization members have access and these environment options should not be removed.\n\n!!! note\n An environment is not available in the model registry to other users unless it was explicitly shared. That does not, however, limit users' ability to use blueprints that include tasks that use that environment. See the description of implicit sharing for more information.\n\nFrom Model Registry > Custom Model Workshop > Environments, use the menu to share and/or delete any custom environment that you have appropriate permissions for. (Note that the link points to custom model actions, but the options are the same for custom tasks and environments.)\n\nSelf-Managed AI Platform admins {: #self-managed-ai-platform-admins }\n\nThe following is available only on the Self-Managed AI Platform.\n\nEnvironment availability {: #environment-availability }", "Replace a model package {: #replace-a-model-package }\n\nActions (\n\nInventory or the\n\nOverview pages.\n\nYou are redirected to the Overview tab of the deployment. Click Import from to choose your method of model replacement.\n\nLocal File: Upload a model package exported from DataRobot AutoML to replace an existing model package (standalone MLOps users only).\n\nModel Registry: Select a model package from the Model Registry to replace an existing model package.\n\nPaste AutoML URL: Copy the URL of the model from the Leaderboard and paste it into the Replacement Model field.\n\nWhen you have confirmed the model package for replacement, select the replacement reason and click Accept and replace.\n\nModel replacement considerations {: #model-replacement-considerations }\n\nWhen replacing a deployed model, note the following:\n\nModel replacement is available for all deployments. Each deployment's model is provided as a model package, which can be replaced with another model package, provided it is compatible.\n!!! note\n The new model package cannot be the same leaderboad model as an existing champion or challenger; each challenger must be a unique model. If you create multiple model packages from the same leaderboard model, you can't use those models as challengers in the same deployment.\n\nWhile only the most current model is deployed, model history is maintained and can be used as a baseline for data drift.\n\nModel replacement validation {: #model-replacement-validation }\n\nDataRobot validates whether the new model is an appropriate replacement for the existing model and provides warning messages if issues are found. DataRobot compares the models to ensure that:\n\nThe target names and types match. For classification targets, the class names must match.\n\nThe feature types match."]}
実験的な生成AIメソッドをPythonクライアントにインポートする¶
DataRobotのPythonクライアントで使用する2つの生成AIメソッドをインポートします。これらのメソッドは実験的なものなので、変更される可能性があり、DataRobot PythonライブラリのExperimentalバージョンが必要です。
from datarobot._experimental.models.genai.vector_database import CustomModelVectorDatabaseValidation
from datarobot._experimental.models.genai.vector_database import VectorDatabase
CustomModelVectorDatabaseValidation.create
は、ベクターデータベースの検証を実行します。デプロイIDを必ず入力します。
external_vdb_validation = CustomModelVectorDatabaseValidation.create(
prompt_column_name="question",
target_column_name="relevant",
deployment_id=deployment.dr_deployment.id,
wait_for_completion=True
)
検証が完了したら、VectorDatabase.create_from_custom_model()
を使用します。ユースケースID(UIのDataRobotワークベンチからアクセス可能)、外部ベクターデータベースの名前、前のセルから返された検証IDを指定する必要があります。
use_case_id = "63e5063007c97ab5edd150a5"
vdb = VectorDatabase.create_from_custom_model(
use_case_id,
name="DR Custom Model",
validation_id= external_vdb_validation.id
)