====== Parameter-Efficient Fine-Tuning (PEFT) and LoRA ======

Parameter-Efficient Fine-Tuning (PEFT) refers to a family of techniques that adapt large pre-trained models to new tasks by updating only 0.01–1% of model parameters while keeping the rest frozen. This dramatically reduces compute, memory, and storage requirements compared to full fine-tuning.((Ding et al., "Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models" [[https://arxiv.org/abs/2203.06904|arXiv:2203.06904]]))

===== Definition and Motivation =====

Full fine-tuning requires updating every parameter of a model — often billions of weights — for each downstream task. This is prohibitively expensive at scale. PEFT addresses this by identifying a small, task-specific parameter subspace to update.

**Why PEFT matters:**

  * **Cost reduction** — Training a 7B model with LoRA requires ~1/10th the GPU memory of full fine-tuning
  * **Mitigates catastrophic forgetting** — Frozen backbone preserves general capabilities while adapters capture task-specific behavior
  * **Lower overfitting risk** — Fewer trainable parameters means less risk of memorizing small datasets
  * **Multi-task scalability** — A single base model can host dozens of lightweight adapters, one per task, swapped at inference time

**Full Fine-Tuning vs. PEFT:**

^ Dimension ^ Full Fine-Tuning ^ PEFT (e.g. LoRA) ^
| Trainable params | 100% | 0.01–1% |
| GPU memory (7B model) | ~80 GB (bf16) | ~8–16 GB |
| Training time | Days–weeks | Hours–days |
| Catastrophic forgetting | High risk | Low risk |
| Per-task storage | Full model copy | Small adapter (~10–100 MB) |
| Multi-task serving | One model per task | One base + N adapters |

===== LoRA: Low-Rank Adaptation =====

LoRA((Hu et al. 2021, "LoRA: Low-Rank Adaptation of Large Language Models" [[https://arxiv.org/abs/2106.09685|arXiv:2106.09685]])) is the most widely adopted PEFT method. Instead of updating a weight matrix **W** directly, it injects a low-rank decomposition:

<code>
delta_W = B * A

where:
  W  is the original frozen weight  (d x k)
  B  is a trainable matrix           (d x r)
  A  is a trainable matrix           (r x k)
  r << min(d, k)   [the rank]
</code>

During forward pass: //h = W_0 x + (B A) x//

==== Alpha Scaling ====

LoRA introduces a scaling hyperparameter //alpha// applied to the adapter output:

<code>
output = W_0 x + (alpha / r) * B * A * x
</code>

Typically //alpha// is set equal to //r// (so the scale factor is 1), or to //2r// for slightly stronger adaptation. This decouples learning rate sensitivity from rank choice.

==== Rank Selection ====

  * **r = 4–8**: Most common default; sufficient for instruction following and style adaptation
  * **r = 16–64**: For domain adaptation or tasks requiring significant behavioral shift
  * **r = 128+**: Rarely needed; approaches full fine-tuning parameter counts

The rule of thumb is to start with **r = 8** and increase only if validation loss stagnates.

==== Target Layers ====

LoRA is typically applied to the attention projection matrices:

  * **W_q** (query) and **W_v** (value) — minimum effective set
  * **W_k** (key) and **W_o** (output projection) — common addition
  * MLP layers (up/down/gate projections) — for stronger adaptation

Applying LoRA to all linear layers generally yields the best results at modest rank.

==== Zero Inference Latency ====

After training, LoRA adapters can be **merged** back into the base weights:

<code python>
W_merged = W_0 + (alpha / r) * B @ A
</code>

The merged model is identical in architecture to the original — no extra compute at inference.

===== QLoRA and Variants =====

==== QLoRA ====

QLoRA((Dettmers et al. 2023, "QLoRA: Efficient Finetuning of Quantized LLMs" [[https://arxiv.org/abs/2305.14314|arXiv:2305.14314]])) combines LoRA with 4-bit NF4 (NormalFloat4) quantization of the frozen base model weights, plus double quantization of the quantization constants. This enables fine-tuning a **65B parameter model on a single 48 GB GPU** — previously impossible.

Key innovations:
  * **NF4 quantization**: Optimally bins weights assuming a normal distribution
  * **Double quantization**: Quantizes the quantization constants themselves, saving ~0.37 bits/param
  * **Paged optimizers**: Uses NVIDIA unified memory to handle optimizer state spikes

==== DoRA (Weight-Decomposed Low-Rank Adaptation) ====

DoRA((Liu et al. 2024 [[https://arxiv.org/abs/2402.09353|arXiv:2402.09353]])) decomposes weight updates into **magnitude** and **direction** components, applying LoRA only to the directional component. This typically outperforms LoRA at the same rank, especially on reasoning tasks.

==== AdaLoRA ====

AdaLoRA((Zhang et al. 2023 [[https://arxiv.org/abs/2303.10512|arXiv:2303.10512]])) dynamically allocates rank budget across weight matrices based on an SVD-based importance score. Less important layers receive rank 0 (pruned); critical layers receive higher rank. Achieves better performance per trainable parameter.

==== rsLoRA ====

rsLoRA((Kalajdzic et al. 2023)) changes the scaling from //alpha/r// to //alpha/sqrt(r)//, which stabilizes training at higher ranks and allows effective use of r=128+ without learning rate collapse.

==== VeRA (Vector-based Random Matrix Adaptation) ====

VeRA((Kopiczko et al. 2024 [[https://arxiv.org/abs/2310.11454|arXiv:2310.11454]])) shares frozen random matrices **B** and **A** across all layers and trains only small per-layer scaling vectors. Reduces trainable parameters by another ~10x vs. LoRA.

==== GaLore (Gradient Low-Rank Projection) ====

GaLore((Zhao et al. 2024 [[https://arxiv.org/abs/2403.03507|arXiv:2403.03507]])) projects gradients into a low-rank subspace during full fine-tuning rather than restricting weight updates. Enables full fine-tuning memory efficiency without changing the model architecture.

==== LoRA+ ====

LoRA+((Hayou et al. 2024 [[https://arxiv.org/abs/2402.12354|arXiv:2402.12354]])) sets different learning rates for the **A** and **B** matrices (typically lr_B = 16x lr_A), which better matches the optimal signal-propagation regime and often improves convergence speed.

===== PEFT Methods Comparison =====

^ Method ^ Trainable Params ^ Performance ^ Inference Overhead ^ Notes ^
| LoRA | 0.1–1% | High | None (mergeable) | Best general choice |
| QLoRA | 0.1–1% | High | None (mergeable) | LoRA + 4-bit base; max memory savings |
| DoRA | 0.1–1% | Higher than LoRA | None (mergeable) | Better on reasoning; slightly slower train |
| AdaLoRA | 0.1–1% | High | None (mergeable) | Dynamic rank; best param efficiency |
| Prefix Tuning | 0.1% | Medium | Low (+prefix tokens) | Prepends learned tokens; no merging |
| Prompt Tuning | < 0.01% | Low–Medium | Low (+prompt tokens) | Minimal params; underperforms at small scale |
| Adapters | 0.5–3% | High | Low (serial layers) | Proven; inference latency from extra layers |
| IA3 | < 0.01% | Medium | None (mergeable) | Learns rescaling vectors; very few params |
| VeRA | < 0.01% | Medium–High | None (mergeable) | Minimal storage; shared random matrices |

===== Tools and Libraries =====

==== Hugging Face PEFT ====

The [[https://github.com/huggingface/peft|Hugging Face PEFT library]] is the de-facto standard. Supports LoRA, QLoRA, DoRA, AdaLoRA, Prefix Tuning, Prompt Tuning, IA3, and more across transformers models.

<code python>
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
</code>

==== Unsloth ====

[[https://github.com/unslothai/unsloth|Unsloth]] provides hand-optimized LoRA/QLoRA kernels achieving **2x faster training** with **60% less memory** vs. standard PEFT, via rewritten CUDA triton kernels. Drop-in compatible with Hugging Face.

==== NVIDIA NeMo ====

[[https://github.com/NVIDIA/NeMo|NVIDIA NeMo]] provides enterprise-grade PEFT pipelines with multi-GPU support, mixed precision, and deployment to Triton Inference Server.

==== BitsAndBytes ====

[[https://github.com/TimDettmers/bitsandbytes|BitsAndBytes]] provides the 4-bit and 8-bit quantization kernels underlying QLoRA. Required dependency for QLoRA workflows.

==== TRL (Transformer Reinforcement Learning) ====

[[https://github.com/huggingface/trl|TRL]] by Hugging Face integrates PEFT with SFT (SFTTrainer), RLHF, DPO, and PPO training loops. The standard toolkit for alignment fine-tuning.

==== Axolotl ====

[[https://github.com/OpenAccess-AI-Collective/axolotl|Axolotl]] is a config-driven fine-tuning framework supporting LoRA/QLoRA across Llama, Mistral, Mixtral, Falcon, and others. Emphasis on reproducibility via YAML configs.

==== LLaMA-Factory ====

[[https://github.com/hiyouga/LLaMA-Factory|LLaMA-Factory]] provides a unified training interface with WebUI for LoRA/QLoRA fine-tuning across 100+ models, with built-in dataset preprocessing and evaluation.

===== Agent Fine-Tuning Use Cases =====

PEFT is particularly well-suited for fine-tuning foundation models into specialized agents:

==== Tool Calling ====

Fine-tuning on tool-use datasets (function schemas + examples) with LoRA teaches models to produce syntactically correct JSON tool invocations. Small LoRA adapters (r=8) suffice for models already familiar with function calling; higher rank helps for models trained without it.

==== Structured Output ====

LoRA fine-tuning on domain-specific schemas trains reliable structured JSON/XML output — critical for agents that must interface with APIs and databases without post-hoc parsing workarounds.

==== Domain-Specific Agents ====

QLoRA enables adapting 13B–70B models to specialized domains (legal, medical, code review) on consumer hardware, producing agents that outperform GPT-4 on narrow benchmarks at a fraction of inference cost.

==== Model Distillation ====

LoRA adapters can encode distilled knowledge from a larger teacher model — the student is fine-tuned via LoRA on the teacher's outputs (often combined with DPO or KTO), compressing capability into a smaller deployable model.

==== Multi-LoRA Serving ====

Frameworks like [[https://github.com/vllm-project/vllm|vLLM]] and [[https://github.com/S-LoRA/S-LoRA|S-LoRA]] support serving **hundreds of LoRA adapters** on a single GPU cluster using a shared base model. Each user/task/tenant gets their own adapter, enabling personalization at scale with near-zero marginal cost per adapter.

==== LoRA-as-Tools Pattern ====

The LoRA-as-Tools pattern((Arxiv 2024, "LoRA-as-Tools" [[https://arxiv.org/abs/2510.15416|arXiv:2510.15416]])) treats individual LoRA adapters as callable **tools** within an agent architecture. A router model selects which LoRA adapter to activate per inference step, enabling compositional specialization: one adapter for code generation, one for retrieval formatting, one for safety filtering — dynamically composed at runtime without multi-model overhead.

===== See Also =====

  * [[fine_tuning_agents]]
  * [[fireact_agent_finetuning]]
  * [[agent_distillation]]
  * [[direct_preference_optimization]]