====== Parameter-Efficient Fine-Tuning (PEFT) and LoRA ====== Parameter-Efficient Fine-Tuning (PEFT) refers to a family of techniques that adapt large pre-trained models to new tasks by updating only 0.01–1% of model parameters while keeping the rest frozen. This dramatically reduces compute, memory, and storage requirements compared to full fine-tuning.((Ding et al., "Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models" [[https://arxiv.org/abs/2203.06904|arXiv:2203.06904]])) ===== Definition and Motivation ===== Full fine-tuning requires updating every parameter of a model — often billions of weights — for each downstream task. This is prohibitively expensive at scale. PEFT addresses this by identifying a small, task-specific parameter subspace to update. **Why PEFT matters:** * **Cost reduction** — Training a 7B model with LoRA requires ~1/10th the GPU memory of full fine-tuning * **Mitigates catastrophic forgetting** — Frozen backbone preserves general capabilities while adapters capture task-specific behavior * **Lower overfitting risk** — Fewer trainable parameters means less risk of memorizing small datasets * **Multi-task scalability** — A single base model can host dozens of lightweight adapters, one per task, swapped at inference time **Full Fine-Tuning vs. PEFT:** ^ Dimension ^ Full Fine-Tuning ^ PEFT (e.g. LoRA) ^ | Trainable params | 100% | 0.01–1% | | GPU memory (7B model) | ~80 GB (bf16) | ~8–16 GB | | Training time | Days–weeks | Hours–days | | Catastrophic forgetting | High risk | Low risk | | Per-task storage | Full model copy | Small adapter (~10–100 MB) | | Multi-task serving | One model per task | One base + N adapters | ===== LoRA: Low-Rank Adaptation ===== LoRA((Hu et al. 2021, "LoRA: Low-Rank Adaptation of Large Language Models" [[https://arxiv.org/abs/2106.09685|arXiv:2106.09685]])) is the most widely adopted PEFT method. Instead of updating a weight matrix **W** directly, it injects a low-rank decomposition: delta_W = B * A where: W is the original frozen weight (d x k) B is a trainable matrix (d x r) A is a trainable matrix (r x k) r << min(d, k) [the rank] During forward pass: //h = W_0 x + (B A) x// ==== Alpha Scaling ==== LoRA introduces a scaling hyperparameter //alpha// applied to the adapter output: output = W_0 x + (alpha / r) * B * A * x Typically //alpha// is set equal to //r// (so the scale factor is 1), or to //2r// for slightly stronger adaptation. This decouples learning rate sensitivity from rank choice. ==== Rank Selection ==== * **r = 4–8**: Most common default; sufficient for instruction following and style adaptation * **r = 16–64**: For domain adaptation or tasks requiring significant behavioral shift * **r = 128+**: Rarely needed; approaches full fine-tuning parameter counts The rule of thumb is to start with **r = 8** and increase only if validation loss stagnates. ==== Target Layers ==== LoRA is typically applied to the attention projection matrices: * **W_q** (query) and **W_v** (value) — minimum effective set * **W_k** (key) and **W_o** (output projection) — common addition * MLP layers (up/down/gate projections) — for stronger adaptation Applying LoRA to all linear layers generally yields the best results at modest rank. ==== Zero Inference Latency ==== After training, LoRA adapters can be **merged** back into the base weights: W_merged = W_0 + (alpha / r) * B @ A The merged model is identical in architecture to the original — no extra compute at inference. ===== QLoRA and Variants ===== ==== QLoRA ==== QLoRA((Dettmers et al. 2023, "QLoRA: Efficient Finetuning of Quantized LLMs" [[https://arxiv.org/abs/2305.14314|arXiv:2305.14314]])) combines LoRA with 4-bit NF4 (NormalFloat4) quantization of the frozen base model weights, plus double quantization of the quantization constants. This enables fine-tuning a **65B parameter model on a single 48 GB GPU** — previously impossible. Key innovations: * **NF4 quantization**: Optimally bins weights assuming a normal distribution * **Double quantization**: Quantizes the quantization constants themselves, saving ~0.37 bits/param * **Paged optimizers**: Uses NVIDIA unified memory to handle optimizer state spikes ==== DoRA (Weight-Decomposed Low-Rank Adaptation) ==== DoRA((Liu et al. 2024 [[https://arxiv.org/abs/2402.09353|arXiv:2402.09353]])) decomposes weight updates into **magnitude** and **direction** components, applying LoRA only to the directional component. This typically outperforms LoRA at the same rank, especially on reasoning tasks. ==== AdaLoRA ==== AdaLoRA((Zhang et al. 2023 [[https://arxiv.org/abs/2303.10512|arXiv:2303.10512]])) dynamically allocates rank budget across weight matrices based on an SVD-based importance score. Less important layers receive rank 0 (pruned); critical layers receive higher rank. Achieves better performance per trainable parameter. ==== rsLoRA ==== rsLoRA((Kalajdzic et al. 2023)) changes the scaling from //alpha/r// to //alpha/sqrt(r)//, which stabilizes training at higher ranks and allows effective use of r=128+ without learning rate collapse. ==== VeRA (Vector-based Random Matrix Adaptation) ==== VeRA((Kopiczko et al. 2024 [[https://arxiv.org/abs/2310.11454|arXiv:2310.11454]])) shares frozen random matrices **B** and **A** across all layers and trains only small per-layer scaling vectors. Reduces trainable parameters by another ~10x vs. LoRA. ==== GaLore (Gradient Low-Rank Projection) ==== GaLore((Zhao et al. 2024 [[https://arxiv.org/abs/2403.03507|arXiv:2403.03507]])) projects gradients into a low-rank subspace during full fine-tuning rather than restricting weight updates. Enables full fine-tuning memory efficiency without changing the model architecture. ==== LoRA+ ==== LoRA+((Hayou et al. 2024 [[https://arxiv.org/abs/2402.12354|arXiv:2402.12354]])) sets different learning rates for the **A** and **B** matrices (typically lr_B = 16x lr_A), which better matches the optimal signal-propagation regime and often improves convergence speed. ===== PEFT Methods Comparison ===== ^ Method ^ Trainable Params ^ Performance ^ Inference Overhead ^ Notes ^ | LoRA | 0.1–1% | High | None (mergeable) | Best general choice | | QLoRA | 0.1–1% | High | None (mergeable) | LoRA + 4-bit base; max memory savings | | DoRA | 0.1–1% | Higher than LoRA | None (mergeable) | Better on reasoning; slightly slower train | | AdaLoRA | 0.1–1% | High | None (mergeable) | Dynamic rank; best param efficiency | | Prefix Tuning | 0.1% | Medium | Low (+prefix tokens) | Prepends learned tokens; no merging | | Prompt Tuning | < 0.01% | Low–Medium | Low (+prompt tokens) | Minimal params; underperforms at small scale | | Adapters | 0.5–3% | High | Low (serial layers) | Proven; inference latency from extra layers | | IA3 | < 0.01% | Medium | None (mergeable) | Learns rescaling vectors; very few params | | VeRA | < 0.01% | Medium–High | None (mergeable) | Minimal storage; shared random matrices | ===== Tools and Libraries ===== ==== Hugging Face PEFT ==== The [[https://github.com/huggingface/peft|Hugging Face PEFT library]] is the de-facto standard. Supports LoRA, QLoRA, DoRA, AdaLoRA, Prefix Tuning, Prompt Tuning, IA3, and more across transformers models. from peft import LoraConfig, get_peft_model config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(base_model, config) model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622 ==== Unsloth ==== [[https://github.com/unslothai/unsloth|Unsloth]] provides hand-optimized LoRA/QLoRA kernels achieving **2x faster training** with **60% less memory** vs. standard PEFT, via rewritten CUDA triton kernels. Drop-in compatible with Hugging Face. ==== NVIDIA NeMo ==== [[https://github.com/NVIDIA/NeMo|NVIDIA NeMo]] provides enterprise-grade PEFT pipelines with multi-GPU support, mixed precision, and deployment to Triton Inference Server. ==== BitsAndBytes ==== [[https://github.com/TimDettmers/bitsandbytes|BitsAndBytes]] provides the 4-bit and 8-bit quantization kernels underlying QLoRA. Required dependency for QLoRA workflows. ==== TRL (Transformer Reinforcement Learning) ==== [[https://github.com/huggingface/trl|TRL]] by Hugging Face integrates PEFT with SFT (SFTTrainer), RLHF, DPO, and PPO training loops. The standard toolkit for alignment fine-tuning. ==== Axolotl ==== [[https://github.com/OpenAccess-AI-Collective/axolotl|Axolotl]] is a config-driven fine-tuning framework supporting LoRA/QLoRA across Llama, Mistral, Mixtral, Falcon, and others. Emphasis on reproducibility via YAML configs. ==== LLaMA-Factory ==== [[https://github.com/hiyouga/LLaMA-Factory|LLaMA-Factory]] provides a unified training interface with WebUI for LoRA/QLoRA fine-tuning across 100+ models, with built-in dataset preprocessing and evaluation. ===== Agent Fine-Tuning Use Cases ===== PEFT is particularly well-suited for fine-tuning foundation models into specialized agents: ==== Tool Calling ==== Fine-tuning on tool-use datasets (function schemas + examples) with LoRA teaches models to produce syntactically correct JSON tool invocations. Small LoRA adapters (r=8) suffice for models already familiar with function calling; higher rank helps for models trained without it. ==== Structured Output ==== LoRA fine-tuning on domain-specific schemas trains reliable structured JSON/XML output — critical for agents that must interface with APIs and databases without post-hoc parsing workarounds. ==== Domain-Specific Agents ==== QLoRA enables adapting 13B–70B models to specialized domains (legal, medical, code review) on consumer hardware, producing agents that outperform GPT-4 on narrow benchmarks at a fraction of inference cost. ==== Model Distillation ==== LoRA adapters can encode distilled knowledge from a larger teacher model — the student is fine-tuned via LoRA on the teacher's outputs (often combined with DPO or KTO), compressing capability into a smaller deployable model. ==== Multi-LoRA Serving ==== Frameworks like [[https://github.com/vllm-project/vllm|vLLM]] and [[https://github.com/S-LoRA/S-LoRA|S-LoRA]] support serving **hundreds of LoRA adapters** on a single GPU cluster using a shared base model. Each user/task/tenant gets their own adapter, enabling personalization at scale with near-zero marginal cost per adapter. ==== LoRA-as-Tools Pattern ==== The LoRA-as-Tools pattern((Arxiv 2024, "LoRA-as-Tools" [[https://arxiv.org/abs/2510.15416|arXiv:2510.15416]])) treats individual LoRA adapters as callable **tools** within an agent architecture. A router model selects which LoRA adapter to activate per inference step, enabling compositional specialization: one adapter for code generation, one for retrieval formatting, one for safety filtering — dynamically composed at runtime without multi-model overhead. ===== See Also ===== * [[fine_tuning_agents]] * [[fireact_agent_finetuning]] * [[agent_distillation]] * [[direct_preference_optimization]]