Fine-tuning adapts a pre-trained language model to a specific domain or task by training it on curated data. This guide covers when fine-tuning makes sense, how to prepare data, which methods to use, and how to evaluate results.
Fine-tuning is not always the right choice. Consider this decision framework:
| Approach | When to Use | Cost | Effort |
|---|---|---|---|
| Prompt Engineering | Output format or tone adjustments | Low | Minutes to hours |
| RAG | Need access to external/current knowledge | Medium | Days |
| Fine-Tuning | Domain-specific language, consistent style, or instruction following | High | Days to weeks |
Fine-tune when the model needs to learn patterns that cannot be expressed through prompts alone – specialized terminology, consistent output formats, or domain-specific reasoning. 1)
Rule of thumb: If LoRA/QLoRA fine-tuning does not improve results, full fine-tuning likely will not either. Start with parameter-efficient methods first. 2)
Data quality matters far more than quantity. Key principles:
instruction, input, and outputExample JSONL entry:
{"instruction": "Summarize this medical report", "input": "Patient presented with...", "output": "Summary: The patient..."}
| Method | Description | VRAM Required | When to Use |
|---|---|---|---|
| Full Fine-Tuning | Updates all model weights | Very high (multi-GPU) | Drastically different domains only |
| LoRA | Freezes weights, adds trainable low-rank adapters | Moderate (single GPU) | Most use cases |
| QLoRA | LoRA with 4-bit quantized base model | Low (consumer GPU) | Large models on limited hardware |
| Spectrum | Selects informative layers via SNR analysis | Moderate | Distributed training |
LoRA (Low-Rank Adaptation) is the recommended starting point. It trains only a small number of additional parameters while keeping the base model frozen, drastically reducing compute requirements. 4)
QLoRA extends LoRA by quantizing the base model to 4-bit precision, enabling fine-tuning of 70B parameter models on a single consumer GPU (24GB VRAM). 5)
| Tool | Strengths | Best For |
|---|---|---|
| Hugging Face TRL + SFTTrainer | Industry standard, supports QLoRA, DeepSpeed, Flash Attention | Full control over training |
| Unsloth | 2x faster training, beginner-friendly notebooks | Quick experiments, consumer hardware |
| Axolotl | YAML-config training pipelines | Reproducible workflows |
| OpenAI Fine-Tuning API | Managed service, no hardware needed | GPT model customization |
A typical Hugging Face QLoRA setup:
from trl import SFTTrainer
from peft import LoraConfig
peft_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
peft_config=peft_config,
packing=True
)
trainer.train()
| Method | Model Size | Minimum GPU |
|---|---|---|
| QLoRA | 7B | RTX 4080 (16GB) |
| QLoRA | 70B | RTX 4090 (24GB) |
| LoRA | 7B | RTX 4090 (24GB) |
| Full | 7B | 2-4x A100 (80GB) |
Optimizations that reduce memory usage:
Track these metrics during and after training:
Use early stopping and save checkpoints frequently. Test on held-out data that the model has never seen. 8)