====== How to Fine-Tune an LLM ======

Fine-tuning adapts a pre-trained language model to a specific domain or task by training it on curated data. This guide covers when fine-tuning makes sense, how to prepare data, which methods to use, and how to evaluate results.

===== When to Fine-Tune =====

Fine-tuning is not always the right choice. Consider this decision framework:

^ Approach ^ When to Use ^ Cost ^ Effort ^
| Prompt Engineering | Output format or tone adjustments | Low | Minutes to hours |
| RAG | Need access to external/current knowledge | Medium | Days |
| Fine-Tuning | Domain-specific language, consistent style, or instruction following | High | Days to weeks |

Fine-tune when the model needs to learn patterns that cannot be expressed through prompts alone -- specialized terminology, consistent output formats, or domain-specific reasoning. ((Source: [[https://www.heavybit.com/library/article/llm-fine-tuning|Heavybit LLM Fine-Tuning Guide]]))

**Rule of thumb:** If LoRA/QLoRA fine-tuning does not improve results, full fine-tuning likely will not either. Start with parameter-efficient methods first. ((Source: [[https://unsloth.ai/docs/get-started/fine-tuning-llms-guide|Unsloth Fine-Tuning Guide]]))

===== Data Preparation =====

Data quality matters far more than quantity. Key principles:

  * **1,000 curated examples outperform 50,000 scraped ones** -- focus on high-confidence, diverse samples
  * **Format as instruction-response pairs** -- the standard format for supervised fine-tuning (SFT)
  * **Use JSONL format** with fields like ''instruction'', ''input'', and ''output''
  * **Clean aggressively** -- remove duplicates, fix formatting, validate accuracy
  * **Consider synthetic data** -- generate QA pairs from documents using a stronger model

Example JSONL entry:

  {"instruction": "Summarize this medical report", "input": "Patient presented with...", "output": "Summary: The patient..."}

((Source: [[https://unsloth.ai/docs/get-started/fine-tuning-llms-guide|Unsloth Fine-Tuning Guide]]))

===== Fine-Tuning Methods =====

^ Method ^ Description ^ VRAM Required ^ When to Use ^
| Full Fine-Tuning | Updates all model weights | Very high (multi-GPU) | Drastically different domains only |
| LoRA | Freezes weights, adds trainable low-rank adapters | Moderate (single GPU) | Most use cases |
| QLoRA | LoRA with 4-bit quantized base model | Low (consumer GPU) | Large models on limited hardware |
| Spectrum | Selects informative layers via SNR analysis | Moderate | Distributed training |

**LoRA** (Low-Rank Adaptation) is the recommended starting point. It trains only a small number of additional parameters while keeping the base model frozen, drastically reducing compute requirements. ((Source: [[https://aisera.com/blog/fine-tuning-llms/|Aisera Fine-Tuning LLMs]]))

**QLoRA** extends LoRA by quantizing the base model to 4-bit precision, enabling fine-tuning of 70B parameter models on a single consumer GPU (24GB VRAM). ((Source: [[https://www.philschmid.de/fine-tune-llms-in-2025|Phil Schmid - Fine-Tune LLMs in 2025]]))

===== Tools and Platforms =====

^ Tool ^ Strengths ^ Best For ^
| Hugging Face TRL + SFTTrainer | Industry standard, supports QLoRA, DeepSpeed, Flash Attention | Full control over training |
| Unsloth | 2x faster training, beginner-friendly notebooks | Quick experiments, consumer hardware |
| Axolotl | YAML-config training pipelines | Reproducible workflows |
| OpenAI Fine-Tuning API | Managed service, no hardware needed | GPT model customization |

A typical Hugging Face QLoRA setup:

  from trl import SFTTrainer
  from peft import LoraConfig
  
  peft_config = LoraConfig(
      r=16,
      lora_alpha=32,
      target_modules=["q_proj", "v_proj"],
      lora_dropout=0.05
  )
  trainer = SFTTrainer(
      model=model,
      train_dataset=dataset,
      peft_config=peft_config,
      packing=True
  )
  trainer.train()

((Source: [[https://www.philschmid.de/fine-tune-llms-in-2025|Phil Schmid - Fine-Tune LLMs in 2025]]))

===== Hardware Requirements =====

^ Method ^ Model Size ^ Minimum GPU ^
| QLoRA | 7B | RTX 4080 (16GB) |
| QLoRA | 70B | RTX 4090 (24GB) |
| LoRA | 7B | RTX 4090 (24GB) |
| Full | 7B | 2-4x A100 (80GB) |

Optimizations that reduce memory usage:

  * **Flash Attention** -- faster attention computation with lower memory overhead
  * **Gradient checkpointing** -- trades compute for memory
  * **DeepSpeed ZeRO** -- distributes optimizer state across GPUs
  * **Liger Kernels** -- fused CUDA kernels for training efficiency

((Source: [[https://www.philschmid.de/fine-tune-llms-in-2025|Phil Schmid - Fine-Tune LLMs in 2025]]))

===== Evaluation =====

Track these metrics during and after training:

  * **Training and validation loss** -- watch for divergence indicating overfitting
  * **Task-specific benchmarks** -- GSM8K for math, MMLU for general knowledge
  * **Perplexity** -- lower is better for generation quality
  * **Human evaluation** -- blind comparison against the base model

Use early stopping and save checkpoints frequently. Test on held-out data that the model has never seen. ((Source: [[https://www.heavybit.com/library/article/llm-fine-tuning|Heavybit LLM Fine-Tuning Guide]]))

===== Common Pitfalls =====

  * **Skipping prompt engineering** -- always try prompts and RAG before fine-tuning
  * **Poor data quality** -- garbage in, garbage out applies strongly here
  * **Overfitting** -- large gap between training and validation loss
  * **Wrong method** -- jumping to full fine-tuning when QLoRA would suffice
  * **Ignoring evaluation** -- fine-tuned models can degrade on general tasks (catastrophic forgetting)

===== See Also =====

  * [[how_to_self_host_an_llm|How to Self-Host an LLM]]
  * [[how_to_use_ollama|How to Use Ollama]]
  * [[how_to_implement_guardrails|How to Implement Guardrails]]

===== References =====