====== Parameter-Efficient Fine-Tuning (PEFT) and LoRA ======
Parameter-Efficient Fine-Tuning (PEFT) refers to a family of techniques that adapt large pre-trained models to new tasks by updating only 0.01–1% of model parameters while keeping the rest frozen. This dramatically reduces compute, memory, and storage requirements compared to full fine-tuning.((Ding et al., "Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models" [[https://arxiv.org/abs/2203.06904|arXiv:2203.06904]]))
===== Definition and Motivation =====
Full fine-tuning requires updating every parameter of a model — often billions of weights — for each downstream task. This is prohibitively expensive at scale. PEFT addresses this by identifying a small, task-specific parameter subspace to update.
**Why PEFT matters:**
* **Cost reduction** — Training a 7B model with LoRA requires ~1/10th the GPU memory of full fine-tuning
* **Mitigates catastrophic forgetting** — Frozen backbone preserves general capabilities while adapters capture task-specific behavior
* **Lower overfitting risk** — Fewer trainable parameters means less risk of memorizing small datasets
* **Multi-task scalability** — A single base model can host dozens of lightweight adapters, one per task, swapped at inference time
**Full Fine-Tuning vs. PEFT:**
^ Dimension ^ Full Fine-Tuning ^ PEFT (e.g. LoRA) ^
| Trainable params | 100% | 0.01–1% |
| GPU memory (7B model) | ~80 GB (bf16) | ~8–16 GB |
| Training time | Days–weeks | Hours–days |
| Catastrophic forgetting | High risk | Low risk |
| Per-task storage | Full model copy | Small adapter (~10–100 MB) |
| Multi-task serving | One model per task | One base + N adapters |
===== LoRA: Low-Rank Adaptation =====
LoRA((Hu et al. 2021, "LoRA: Low-Rank Adaptation of Large Language Models" [[https://arxiv.org/abs/2106.09685|arXiv:2106.09685]])) is the most widely adopted PEFT method. Instead of updating a weight matrix **W** directly, it injects a low-rank decomposition:
delta_W = B * A
where:
W is the original frozen weight (d x k)
B is a trainable matrix (d x r)
A is a trainable matrix (r x k)
r << min(d, k) [the rank]
During forward pass: //h = W_0 x + (B A) x//
==== Alpha Scaling ====
LoRA introduces a scaling hyperparameter //alpha// applied to the adapter output:
output = W_0 x + (alpha / r) * B * A * x
Typically //alpha// is set equal to //r// (so the scale factor is 1), or to //2r// for slightly stronger adaptation. This decouples learning rate sensitivity from rank choice.
==== Rank Selection ====
* **r = 4–8**: Most common default; sufficient for instruction following and style adaptation
* **r = 16–64**: For domain adaptation or tasks requiring significant behavioral shift
* **r = 128+**: Rarely needed; approaches full fine-tuning parameter counts
The rule of thumb is to start with **r = 8** and increase only if validation loss stagnates.
==== Target Layers ====
LoRA is typically applied to the attention projection matrices:
* **W_q** (query) and **W_v** (value) — minimum effective set
* **W_k** (key) and **W_o** (output projection) — common addition
* MLP layers (up/down/gate projections) — for stronger adaptation
Applying LoRA to all linear layers generally yields the best results at modest rank.
==== Zero Inference Latency ====
After training, LoRA adapters can be **merged** back into the base weights:
W_merged = W_0 + (alpha / r) * B @ A
The merged model is identical in architecture to the original — no extra compute at inference.
===== QLoRA and Variants =====
==== QLoRA ====
QLoRA((Dettmers et al. 2023, "QLoRA: Efficient Finetuning of Quantized LLMs" [[https://arxiv.org/abs/2305.14314|arXiv:2305.14314]])) combines LoRA with 4-bit NF4 (NormalFloat4) quantization of the frozen base model weights, plus double quantization of the quantization constants. This enables fine-tuning a **65B parameter model on a single 48 GB GPU** — previously impossible.
Key innovations:
* **NF4 quantization**: Optimally bins weights assuming a normal distribution
* **Double quantization**: Quantizes the quantization constants themselves, saving ~0.37 bits/param
* **Paged optimizers**: Uses NVIDIA unified memory to handle optimizer state spikes
==== DoRA (Weight-Decomposed Low-Rank Adaptation) ====
DoRA((Liu et al. 2024 [[https://arxiv.org/abs/2402.09353|arXiv:2402.09353]])) decomposes weight updates into **magnitude** and **direction** components, applying LoRA only to the directional component. This typically outperforms LoRA at the same rank, especially on reasoning tasks.
==== AdaLoRA ====
AdaLoRA((Zhang et al. 2023 [[https://arxiv.org/abs/2303.10512|arXiv:2303.10512]])) dynamically allocates rank budget across weight matrices based on an SVD-based importance score. Less important layers receive rank 0 (pruned); critical layers receive higher rank. Achieves better performance per trainable parameter.
==== rsLoRA ====
rsLoRA((Kalajdzic et al. 2023)) changes the scaling from //alpha/r// to //alpha/sqrt(r)//, which stabilizes training at higher ranks and allows effective use of r=128+ without learning rate collapse.
==== VeRA (Vector-based Random Matrix Adaptation) ====
VeRA((Kopiczko et al. 2024 [[https://arxiv.org/abs/2310.11454|arXiv:2310.11454]])) shares frozen random matrices **B** and **A** across all layers and trains only small per-layer scaling vectors. Reduces trainable parameters by another ~10x vs. LoRA.
==== GaLore (Gradient Low-Rank Projection) ====
GaLore((Zhao et al. 2024 [[https://arxiv.org/abs/2403.03507|arXiv:2403.03507]])) projects gradients into a low-rank subspace during full fine-tuning rather than restricting weight updates. Enables full fine-tuning memory efficiency without changing the model architecture.
==== LoRA+ ====
LoRA+((Hayou et al. 2024 [[https://arxiv.org/abs/2402.12354|arXiv:2402.12354]])) sets different learning rates for the **A** and **B** matrices (typically lr_B = 16x lr_A), which better matches the optimal signal-propagation regime and often improves convergence speed.
===== PEFT Methods Comparison =====
^ Method ^ Trainable Params ^ Performance ^ Inference Overhead ^ Notes ^
| LoRA | 0.1–1% | High | None (mergeable) | Best general choice |
| QLoRA | 0.1–1% | High | None (mergeable) | LoRA + 4-bit base; max memory savings |
| DoRA | 0.1–1% | Higher than LoRA | None (mergeable) | Better on reasoning; slightly slower train |
| AdaLoRA | 0.1–1% | High | None (mergeable) | Dynamic rank; best param efficiency |
| Prefix Tuning | 0.1% | Medium | Low (+prefix tokens) | Prepends learned tokens; no merging |
| Prompt Tuning | < 0.01% | Low–Medium | Low (+prompt tokens) | Minimal params; underperforms at small scale |
| Adapters | 0.5–3% | High | Low (serial layers) | Proven; inference latency from extra layers |
| IA3 | < 0.01% | Medium | None (mergeable) | Learns rescaling vectors; very few params |
| VeRA | < 0.01% | Medium–High | None (mergeable) | Minimal storage; shared random matrices |
===== Tools and Libraries =====
==== Hugging Face PEFT ====
The [[https://github.com/huggingface/peft|Hugging Face PEFT library]] is the de-facto standard. Supports LoRA, QLoRA, DoRA, AdaLoRA, Prefix Tuning, Prompt Tuning, IA3, and more across transformers models.
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
==== Unsloth ====
[[https://github.com/unslothai/unsloth|Unsloth]] provides hand-optimized LoRA/QLoRA kernels achieving **2x faster training** with **60% less memory** vs. standard PEFT, via rewritten CUDA triton kernels. Drop-in compatible with Hugging Face.
==== NVIDIA NeMo ====
[[https://github.com/NVIDIA/NeMo|NVIDIA NeMo]] provides enterprise-grade PEFT pipelines with multi-GPU support, mixed precision, and deployment to Triton Inference Server.
==== BitsAndBytes ====
[[https://github.com/TimDettmers/bitsandbytes|BitsAndBytes]] provides the 4-bit and 8-bit quantization kernels underlying QLoRA. Required dependency for QLoRA workflows.
==== TRL (Transformer Reinforcement Learning) ====
[[https://github.com/huggingface/trl|TRL]] by Hugging Face integrates PEFT with SFT (SFTTrainer), RLHF, DPO, and PPO training loops. The standard toolkit for alignment fine-tuning.
==== Axolotl ====
[[https://github.com/OpenAccess-AI-Collective/axolotl|Axolotl]] is a config-driven fine-tuning framework supporting LoRA/QLoRA across Llama, Mistral, Mixtral, Falcon, and others. Emphasis on reproducibility via YAML configs.
==== LLaMA-Factory ====
[[https://github.com/hiyouga/LLaMA-Factory|LLaMA-Factory]] provides a unified training interface with WebUI for LoRA/QLoRA fine-tuning across 100+ models, with built-in dataset preprocessing and evaluation.
===== Agent Fine-Tuning Use Cases =====
PEFT is particularly well-suited for fine-tuning foundation models into specialized agents:
==== Tool Calling ====
Fine-tuning on tool-use datasets (function schemas + examples) with LoRA teaches models to produce syntactically correct JSON tool invocations. Small LoRA adapters (r=8) suffice for models already familiar with function calling; higher rank helps for models trained without it.
==== Structured Output ====
LoRA fine-tuning on domain-specific schemas trains reliable structured JSON/XML output — critical for agents that must interface with APIs and databases without post-hoc parsing workarounds.
==== Domain-Specific Agents ====
QLoRA enables adapting 13B–70B models to specialized domains (legal, medical, code review) on consumer hardware, producing agents that outperform GPT-4 on narrow benchmarks at a fraction of inference cost.
==== Model Distillation ====
LoRA adapters can encode distilled knowledge from a larger teacher model — the student is fine-tuned via LoRA on the teacher's outputs (often combined with DPO or KTO), compressing capability into a smaller deployable model.
==== Multi-LoRA Serving ====
Frameworks like [[https://github.com/vllm-project/vllm|vLLM]] and [[https://github.com/S-LoRA/S-LoRA|S-LoRA]] support serving **hundreds of LoRA adapters** on a single GPU cluster using a shared base model. Each user/task/tenant gets their own adapter, enabling personalization at scale with near-zero marginal cost per adapter.
==== LoRA-as-Tools Pattern ====
The LoRA-as-Tools pattern((Arxiv 2024, "LoRA-as-Tools" [[https://arxiv.org/abs/2510.15416|arXiv:2510.15416]])) treats individual LoRA adapters as callable **tools** within an agent architecture. A router model selects which LoRA adapter to activate per inference step, enabling compositional specialization: one adapter for code generation, one for retrieval formatting, one for safety filtering — dynamically composed at runtime without multi-model overhead.
===== See Also =====
* [[fine_tuning_agents]]
* [[fireact_agent_finetuning]]
* [[agent_distillation]]
* [[direct_preference_optimization]]