Deploying LLMs from the Hugging Face Hub requires access to premium features for GenAI experimentation and GPU inference. Contact your DataRobot representative or administrator for information on enabling the required features.
This infrastructure uses the vLLM library, an open source framework for LLM inference and serving, to integrate with Hugging Face libraries to seamlessly download and load popular open source LLMs from Hugging Face Hub. To get started, customize the text generation model template. It uses Llama-3.1-8b LLM by default; however, you can change the selected model by modifying the engine_config.json file to specify the name of the OSS model you would like to use.
Before uploading the custom LLM's required files, select the GenAI vLLM Inference Server from the Base environment list. The model environment is used for testing the custom model and deploying the registered custom model.
After adding the model files, you can select the Hugging Face model to load. By default, this text generation example uses the Llama-3.1-8b model. To change the selected model, click edit next to the engine_config.json file to modify the --model argument.
After the custom LLM's files are assembled, configure the runtime parameters defined in the custom model's model-metadata.yaml file. The following runtime parameters are Not set and require configuration:
A "universal" prompt prepended to all individual prompts for the custom LLM. It instructs and formats the LLM response. The system prompt can impact the structure, tone, format, and content that is created during the generation of the response.
In addition, you can update the default values of the following runtime parameters.
Runtime parameter
Description
max_tokens
The maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via the API.
max_model_len
The model context length. If unspecified, this parameter is automatically derived from the model configuration.
prompt_column_name
The name of the input column containing the LLM prompt.
gpu_memory_utilization
The fraction of GPU memory to use for the model executor, which can range from 0 to 1. For example, a value of 0.5 indicates 50% GPU memory utilization. If unspecified, uses the default value of 0.9.
Advanced configuration
For more in-depth information on the runtime parameters and configuration options available for the [GenAI] vLLM Inference Server execution environment, see the environment README.
You can also calculate GPU memory requirements in Hugging Face.
For our example model, Llama-3.1-8b, this evaluates to the following:
Therefore, our model requires 19.2 GB of memory, indicating that you should select the GPU - L bundle (1 x NVIDIA A10G | 24GB VRAM | 8 CPU | 32GB RAM).
On the lines highlighted in the code snippet below, the {DATAROBOT_DEPLOYMENT_ID} and {DATAROBOT_API_KEY} placeholders should be replaced with the LLM deployment's ID and your DataRobot API key.
Call the Chat API for the deployment
1 2 3 4 5 6 7 8 91011121314151617181920
fromopenaiimportOpenAIfromopenaiimportStreamimportosdr_base_url="https://app.datarobot.com/api/v2/deployments/{DATAROBOT_DEPLOYMENT_ID}/"dr_api_key=os.getenv("{DATAROBOT_API_KEY}")client=OpenAI(base_url=dr_base_url,api_key=dr_api_key)completion=client.chat.completions.create(model="datarobot-deployed-llm",messages=[{"role":"system","content":"You are a helpful assistant"},{"role":"user","content":"Where is DataRobot headquartered?"},],stream=False)print(completion.to_json(indent=2))