====== How to Fine-Tune an LLM ====== Fine-tuning adapts a pre-trained language model to a specific domain or task by training it on curated data. This guide covers when fine-tuning makes sense, how to prepare data, which methods to use, and how to evaluate results. ===== When to Fine-Tune ===== Fine-tuning is not always the right choice. Consider this decision framework: ^ Approach ^ When to Use ^ Cost ^ Effort ^ | Prompt Engineering | Output format or tone adjustments | Low | Minutes to hours | | RAG | Need access to external/current knowledge | Medium | Days | | Fine-Tuning | Domain-specific language, consistent style, or instruction following | High | Days to weeks | Fine-tune when the model needs to learn patterns that cannot be expressed through prompts alone -- specialized terminology, consistent output formats, or domain-specific reasoning. ((Source: [[https://www.heavybit.com/library/article/llm-fine-tuning|Heavybit LLM Fine-Tuning Guide]])) **Rule of thumb:** If LoRA/QLoRA fine-tuning does not improve results, full fine-tuning likely will not either. Start with parameter-efficient methods first. ((Source: [[https://unsloth.ai/docs/get-started/fine-tuning-llms-guide|Unsloth Fine-Tuning Guide]])) ===== Data Preparation ===== Data quality matters far more than quantity. Key principles: * **1,000 curated examples outperform 50,000 scraped ones** -- focus on high-confidence, diverse samples * **Format as instruction-response pairs** -- the standard format for supervised fine-tuning (SFT) * **Use JSONL format** with fields like ''instruction'', ''input'', and ''output'' * **Clean aggressively** -- remove duplicates, fix formatting, validate accuracy * **Consider synthetic data** -- generate QA pairs from documents using a stronger model Example JSONL entry: {"instruction": "Summarize this medical report", "input": "Patient presented with...", "output": "Summary: The patient..."} ((Source: [[https://unsloth.ai/docs/get-started/fine-tuning-llms-guide|Unsloth Fine-Tuning Guide]])) ===== Fine-Tuning Methods ===== ^ Method ^ Description ^ VRAM Required ^ When to Use ^ | Full Fine-Tuning | Updates all model weights | Very high (multi-GPU) | Drastically different domains only | | LoRA | Freezes weights, adds trainable low-rank adapters | Moderate (single GPU) | Most use cases | | QLoRA | LoRA with 4-bit quantized base model | Low (consumer GPU) | Large models on limited hardware | | Spectrum | Selects informative layers via SNR analysis | Moderate | Distributed training | **LoRA** (Low-Rank Adaptation) is the recommended starting point. It trains only a small number of additional parameters while keeping the base model frozen, drastically reducing compute requirements. ((Source: [[https://aisera.com/blog/fine-tuning-llms/|Aisera Fine-Tuning LLMs]])) **QLoRA** extends LoRA by quantizing the base model to 4-bit precision, enabling fine-tuning of 70B parameter models on a single consumer GPU (24GB VRAM). ((Source: [[https://www.philschmid.de/fine-tune-llms-in-2025|Phil Schmid - Fine-Tune LLMs in 2025]])) ===== Tools and Platforms ===== ^ Tool ^ Strengths ^ Best For ^ | Hugging Face TRL + SFTTrainer | Industry standard, supports QLoRA, DeepSpeed, Flash Attention | Full control over training | | Unsloth | 2x faster training, beginner-friendly notebooks | Quick experiments, consumer hardware | | Axolotl | YAML-config training pipelines | Reproducible workflows | | OpenAI Fine-Tuning API | Managed service, no hardware needed | GPT model customization | A typical Hugging Face QLoRA setup: from trl import SFTTrainer from peft import LoraConfig peft_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05 ) trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_config, packing=True ) trainer.train() ((Source: [[https://www.philschmid.de/fine-tune-llms-in-2025|Phil Schmid - Fine-Tune LLMs in 2025]])) ===== Hardware Requirements ===== ^ Method ^ Model Size ^ Minimum GPU ^ | QLoRA | 7B | RTX 4080 (16GB) | | QLoRA | 70B | RTX 4090 (24GB) | | LoRA | 7B | RTX 4090 (24GB) | | Full | 7B | 2-4x A100 (80GB) | Optimizations that reduce memory usage: * **Flash Attention** -- faster attention computation with lower memory overhead * **Gradient checkpointing** -- trades compute for memory * **DeepSpeed ZeRO** -- distributes optimizer state across GPUs * **Liger Kernels** -- fused CUDA kernels for training efficiency ((Source: [[https://www.philschmid.de/fine-tune-llms-in-2025|Phil Schmid - Fine-Tune LLMs in 2025]])) ===== Evaluation ===== Track these metrics during and after training: * **Training and validation loss** -- watch for divergence indicating overfitting * **Task-specific benchmarks** -- GSM8K for math, MMLU for general knowledge * **Perplexity** -- lower is better for generation quality * **Human evaluation** -- blind comparison against the base model Use early stopping and save checkpoints frequently. Test on held-out data that the model has never seen. ((Source: [[https://www.heavybit.com/library/article/llm-fine-tuning|Heavybit LLM Fine-Tuning Guide]])) ===== Common Pitfalls ===== * **Skipping prompt engineering** -- always try prompts and RAG before fine-tuning * **Poor data quality** -- garbage in, garbage out applies strongly here * **Overfitting** -- large gap between training and validation loss * **Wrong method** -- jumping to full fine-tuning when QLoRA would suffice * **Ignoring evaluation** -- fine-tuned models can degrade on general tasks (catastrophic forgetting) ===== See Also ===== * [[how_to_self_host_an_llm|How to Self-Host an LLM]] * [[how_to_use_ollama|How to Use Ollama]] * [[how_to_implement_guardrails|How to Implement Guardrails]] ===== References =====