When to Fine-Tune
Data Preparation
Fine-Tuning Methods
Tools and Platforms
Hardware Requirements
Evaluation
Common Pitfalls
See Also
References

How to Fine-Tune an LLM

Fine-tuning adapts a pre-trained language model to a specific domain or task by training it on curated data. This guide covers when fine-tuning makes sense, how to prepare data, which methods to use, and how to evaluate results.

When to Fine-Tune

Fine-tuning is not always the right choice. Consider this decision framework:

Approach	When to Use	Cost	Effort
Prompt Engineering	Output format or tone adjustments	Low	Minutes to hours
RAG	Need access to external/current knowledge	Medium	Days
Fine-Tuning	Domain-specific language, consistent style, or instruction following	High	Days to weeks

Fine-tune when the model needs to learn patterns that cannot be expressed through prompts alone – specialized terminology, consistent output formats, or domain-specific reasoning. ¹⁾

Rule of thumb: If LoRA/QLoRA fine-tuning does not improve results, full fine-tuning likely will not either. Start with parameter-efficient methods first. ²⁾

Data Preparation

Data quality matters far more than quantity. Key principles:

1,000 curated examples outperform 50,000 scraped ones – focus on high-confidence, diverse samples
Format as instruction-response pairs – the standard format for supervised fine-tuning (SFT)
Use JSONL format with fields like instruction, input, and output
Clean aggressively – remove duplicates, fix formatting, validate accuracy
Consider synthetic data – generate QA pairs from documents using a stronger model

Example JSONL entry:

{"instruction": "Summarize this medical report", "input": "Patient presented with...", "output": "Summary: The patient..."}

³⁾

Fine-Tuning Methods

Method	Description	VRAM Required	When to Use
Full Fine-Tuning	Updates all model weights	Very high (multi-GPU)	Drastically different domains only
LoRA	Freezes weights, adds trainable low-rank adapters	Moderate (single GPU)	Most use cases
QLoRA	LoRA with 4-bit quantized base model	Low (consumer GPU)	Large models on limited hardware
Spectrum	Selects informative layers via SNR analysis	Moderate	Distributed training

LoRA (Low-Rank Adaptation) is the recommended starting point. It trains only a small number of additional parameters while keeping the base model frozen, drastically reducing compute requirements. ⁴⁾

QLoRA extends LoRA by quantizing the base model to 4-bit precision, enabling fine-tuning of 70B parameter models on a single consumer GPU (24GB VRAM). ⁵⁾

Tools and Platforms

Tool	Strengths	Best For
Hugging Face TRL + SFTTrainer	Industry standard, supports QLoRA, DeepSpeed, Flash Attention	Full control over training
Unsloth	2x faster training, beginner-friendly notebooks	Quick experiments, consumer hardware
Axolotl	YAML-config training pipelines	Reproducible workflows
OpenAI Fine-Tuning API	Managed service, no hardware needed	GPT model customization

A typical Hugging Face QLoRA setup:

from trl import SFTTrainer
from peft import LoraConfig

peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05
)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    packing=True
)
trainer.train()

⁶⁾

Hardware Requirements

Method	Model Size	Minimum GPU
QLoRA	7B	RTX 4080 (16GB)
QLoRA	70B	RTX 4090 (24GB)
LoRA	7B	RTX 4090 (24GB)
Full	7B	2-4x A100 (80GB)

Optimizations that reduce memory usage:

Flash Attention – faster attention computation with lower memory overhead
Gradient checkpointing – trades compute for memory
DeepSpeed ZeRO – distributes optimizer state across GPUs
Liger Kernels – fused CUDA kernels for training efficiency

⁷⁾

Evaluation

Track these metrics during and after training:

Training and validation loss – watch for divergence indicating overfitting
Task-specific benchmarks – GSM8K for math, MMLU for general knowledge
Perplexity – lower is better for generation quality
Human evaluation – blind comparison against the base model

Use early stopping and save checkpoints frequently. Test on held-out data that the model has never seen. ⁸⁾

Common Pitfalls

Skipping prompt engineering – always try prompts and RAG before fine-tuning
Poor data quality – garbage in, garbage out applies strongly here
Overfitting – large gap between training and validation loss
Wrong method – jumping to full fine-tuning when QLoRA would suffice
Ignoring evaluation – fine-tuned models can degrade on general tasks (catastrophic forgetting)