Developer documentation > Developer learning > AI accelerators > LLM and GenAI applications > Direct Preference Optimization for reinforcement learning

Direct Preference Optimization for reinforcement learning¶

Access this AI accelerator on GitHub

Direct Preference Optimization (DPO) is a machine learning technique used to train AI agents, providing a simpler, more stable alternative to traditional Reinforcement Learning from Human Feedback (RLHF). It helps align AI models with human preferences in ways that are hard to capture through pre-training alone, making models more helpful, harmless, and honest. It can improve things like following instructions accurately, avoiding harmful content, and producing more useful responses.

Generally, the technique works as follows:

A model is trained on large amounts of text data to predict what comes next in sequences (pre-training).
Humans evaluate and rank the model prompt outputs.
Based on the feedback, a different model learns to predict the preferred response.
The main model is fine-tuned using reinforcement learning algorithms to generate higher-scoring outputs.

This accelerator automates the process of fine-tuning an LLM using Direct Preference Optimization (DPO) and then deploying that model to DataRobot. Essentially, it takes a base model, teaches it to prefer specific types of responses based on a provided dataset, and prepares it for production use.

Specific accelerator actions:

Data preparation actions

Downloads a specific preference dataset from the DataRobot Registry.
Uses the Hugging Face datasets library to load the CSV, which typically contains three columns: a prompt, a chosen (preferred) response, and a rejected response.

Model training (DPO)

Initializes the Qwen2-0.5B-Instruct model in bfloat16 precision to save memory.
Applies DPOTrainer from the TRL (Transformer Reinforcement Learning) library. This is a modern alternative to RLHF that aligns models to human preferences without needing a separate reward model.
For hardware efficiency, the script is configured for FSDP (Fully Sharded Data Parallel). This allows the model to be trained across multiple GPUs by "sharding" the model weights, and it uses Gradient Checkpointing to further reduce VRAM usage.

Model consolidation and saving

Weight gathering, which applies a specialized routine to "gather" the FSDP shards back into a single, cohesive model file.
Ensures that only the "Main Process" (Rank 0) handles the final file writing to avoid data corruption or redundant saves.

DataRobot deployment

Once the model is saved locally, the script uses the DataRobot API to create a custom model, making the model ready for deployment and accessible via the REST API.