Table of Contents

How to Fine-Tune an LLM

Fine-tuning adapts a pre-trained language model to a specific domain or task by training it on curated data. This guide covers when fine-tuning makes sense, how to prepare data, which methods to use, and how to evaluate results.

When to Fine-Tune

Fine-tuning is not always the right choice. Consider this decision framework:

Approach When to Use Cost Effort
Prompt Engineering Output format or tone adjustments Low Minutes to hours
RAG Need access to external/current knowledge Medium Days
Fine-Tuning Domain-specific language, consistent style, or instruction following High Days to weeks

Fine-tune when the model needs to learn patterns that cannot be expressed through prompts alone – specialized terminology, consistent output formats, or domain-specific reasoning. 1)

Rule of thumb: If LoRA/QLoRA fine-tuning does not improve results, full fine-tuning likely will not either. Start with parameter-efficient methods first. 2)

Data Preparation

Data quality matters far more than quantity. Key principles:

Example JSONL entry:

{"instruction": "Summarize this medical report", "input": "Patient presented with...", "output": "Summary: The patient..."}

3)

Fine-Tuning Methods

Method Description VRAM Required When to Use
Full Fine-Tuning Updates all model weights Very high (multi-GPU) Drastically different domains only
LoRA Freezes weights, adds trainable low-rank adapters Moderate (single GPU) Most use cases
QLoRA LoRA with 4-bit quantized base model Low (consumer GPU) Large models on limited hardware
Spectrum Selects informative layers via SNR analysis Moderate Distributed training

LoRA (Low-Rank Adaptation) is the recommended starting point. It trains only a small number of additional parameters while keeping the base model frozen, drastically reducing compute requirements. 4)

QLoRA extends LoRA by quantizing the base model to 4-bit precision, enabling fine-tuning of 70B parameter models on a single consumer GPU (24GB VRAM). 5)

Tools and Platforms

Tool Strengths Best For
Hugging Face TRL + SFTTrainer Industry standard, supports QLoRA, DeepSpeed, Flash Attention Full control over training
Unsloth 2x faster training, beginner-friendly notebooks Quick experiments, consumer hardware
Axolotl YAML-config training pipelines Reproducible workflows
OpenAI Fine-Tuning API Managed service, no hardware needed GPT model customization

A typical Hugging Face QLoRA setup:

from trl import SFTTrainer
from peft import LoraConfig

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05
)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    packing=True
)
trainer.train()

6)

Hardware Requirements

Method Model Size Minimum GPU
QLoRA 7B RTX 4080 (16GB)
QLoRA 70B RTX 4090 (24GB)
LoRA 7B RTX 4090 (24GB)
Full 7B 2-4x A100 (80GB)

Optimizations that reduce memory usage:

7)

Evaluation

Track these metrics during and after training:

Use early stopping and save checkpoints frequently. Test on held-out data that the model has never seen. 8)

Common Pitfalls

See Also

References