Fine-Tuning an Open-Source LLM with Axolotl Using Direct Preference Optimization (DPO)

Published on December 10th, 2024

Introduction

The power of Large Language Models (LLMs) has unlocked a range of new possibilities in artificial intelligence (AI) applications. These models, trained on massive datasets, are capable of understanding and generating human-like responses. However, while pre-trained models like GPT-4 are incredibly powerful, fine-tuning a smaller model can be a cost-effective way to achieve similar results. This article will walk you through the process of fine-tuning an open-source LLM using Axolotl and Direct Preference Optimization (DPO), allowing you to customize the model without needing to write any code.

What Is an LLM?

A Large Language Model (LLM) is an AI model trained on a vast amount of text data, enabling it to predict the next word in a sequence. The success of these models has been made possible by advances in GPU compute, allowing models with tens of billions of parameters to be trained in a matter of weeks. These models, like ChatGPT or Claude, are capable of understanding context and generating human-like responses, making them invaluable tools for various AI applications.

Why Fine-Tune an LLM?

While using powerful models like GPT-4 might seem like a good solution, fine-tuning smaller models can offer comparable results at a significantly reduced cost. Fine-tuning also allows you to maintain control over your intellectual property, eliminating the reliance on third-party service providers. Furthermore, it enables customization tailored to specific needs, whether it’s for improved accuracy, specialized language, or even compliance with particular guidelines.

Types of LLMs: Base, Instruct, and Chat Models

To understand fine-tuning, it’s important to know the different types of LLMs:

Base Models: These are pretrained on large, unstructured text datasets. They have an understanding of language but are not optimized for specific tasks or inference.
Instruct Models: Built on top of base models, these models are fine-tuned with structured data like prompt-response pairs. They are better at following specific instructions and answering questions.
Chat Models: Similar to instruct models, chat models are trained using conversational data to engage in more natural, back-and-forth dialogue.

Understanding Direct Preference Optimization (DPO)

Reinforcement Learning (RL) is one approach used to refine model behavior, but Direct Preference Optimization (DPO) offers a more targeted way to optimize models. DPO works by training models using pairs of “good” and “bad” responses for the same prompt, teaching the model to prioritize better responses.

DPO is particularly useful for:

Style Adjustments: Tweaking response length, detail, or confidence levels.
Safety Measures: Teaching models to refuse unsafe or inappropriate requests.

However, DPO is not suitable for teaching new facts or knowledge, as it focuses on refining responses rather than expanding the model’s knowledge base. For new information, Supervised Fine-Tuning (SFT) or Retrieval-Augmented Generation (RAG) would be better alternatives.

Creating a DPO Dataset

Creating an effective DPO dataset can involve either user feedback or synthetic data generation. For example:

User Feedback: Collect responses to prompts and ask users to rate them or choose the better option.
Synthetic Data: If user data is unavailable, you can generate pairs using a larger model like GPT-4. This helps to create a synthetic dataset of “good” and “bad” responses.

A typical DPO dataset should include at least 500–1,000 pairs to avoid overfitting. Larger datasets can range up to 15,000–20,000 pairs, which offer more robust training.

Fine-Tuning a Model Using Axolotl

In this guide, we’ll fine-tune the Qwen2.5 3B Instruct model using Axolotl. Axolotl is a powerful tool that allows you to fine-tune models using a simple YAML configuration file, eliminating the need to write any code.

Configuration File

Here’s an example of the config.yml file we’ll use to fine-tune the model:

yaml

base_model: Qwen/Qwen2.5-3B-Instruct
 strict: false
 chat_template: qwen_25
 rl: dpo
 datasets:
 - path: olivermolenschot/alpaca_messages_dpo_test
 type: chat_template.default
 field_messages: conversation
 field_chosen: chosen
 field_rejected: rejected
 message_field_role: role
 message_field_content: content
 output_dir: /workspace/dpo-output
 sequence_len: 8192
 sample_packing: false
 pad_to_sequence_len: true
 wandb_project:
 wandb_entity:
 wandb_watch:
 wandb_name:
 wandb_log_model:
 gradient_accumulation_steps: 1
 micro_batch_size: 1
 num_epochs: 1
 optimizer: adamw_torch
 lr_scheduler: constant
 learning_rate: 0.00005
 bf16: auto
 gradient_checkpointing: true
 flash_attention: true
 saves_per_epoch: 1
 logging_steps: 1
 warmup_steps: 0

Setting Up the Cloud Environment

To run the training, you’ll need cloud infrastructure such as Runpod or Vultr. Here are the key requirements:

Docker Image: Use the winglian/axolotl-cloud:main Docker image.
Hardware: At least 80GB VRAM GPU (e.g., A100 PCIe node).
Storage: 200GB of volume storage for model files.
CUDA Version: Version 12.1 or higher.

Steps to Start Training

Set HuggingFace Cache Directory:

bash

export HF_HOME=/workspace/hf
Create the Configuration File: Save the config.yml to /workspace/config.yml.
Start Training: Run the following command to begin the fine-tuning process:

bash

python -m axolotl.cli.train /workspace/config.yml

Uploading the Fine-Tuned Model

Once the training is complete, you can upload the model to HuggingFace using the CLI:

Install HuggingFace Hub CLI:

bash

pip install huggingface_hub[cli]
Upload the Model:

bash

huggingface-cli upload /workspace/dpo-output yourname/yourrepo

Evaluating the Fine-Tuned Model

To evaluate the fine-tuned model, you can host both the original and fine-tuned models using a tool like Text Generation Inference (TGI). By performing inference with both models, you can manually compare their outputs to ensure that the fine-tuning process has met your expectations.

Conclusion

Fine-tuning an LLM using DPO with Axolotl provides a simple, efficient way to customize an open-source model to fit your needs. Whether you want to adjust the style, improve safety, or enhance the model’s performance for specific tasks, DPO offers a straightforward approach to refining your LLM. With tools like Axolotl, you can easily leverage cloud computing to perform this training without writing any code.

Fine-Tuning an Open-Source LLM with Axolotl Using Direct Preference Optimization (DPO) — SitePoint