Parameter-Efficient Fine-Tuning (PEFT) and LoRA

Parameter-Efficient Fine-Tuning (PEFT) refers to a family of techniques that adapt large pre-trained models to new tasks by updating only 0.01–1% of model parameters while keeping the rest frozen. This dramatically reduces compute, memory, and storage requirements compared to full fine-tuning.¹⁾

Definition and Motivation

Full fine-tuning requires updating every parameter of a model — often billions of weights — for each downstream task. This is prohibitively expensive at scale. PEFT addresses this by identifying a small, task-specific parameter subspace to update.

Why PEFT matters:

Cost reduction — Training a 7B model with LoRA requires ~1/10th the GPU memory of full fine-tuning
Mitigates catastrophic forgetting — Frozen backbone preserves general capabilities while adapters capture task-specific behavior
Lower overfitting risk — Fewer trainable parameters means less risk of memorizing small datasets
Multi-task scalability — A single base model can host dozens of lightweight adapters, one per task, swapped at inference time

Full Fine-Tuning vs. PEFT:

Dimension	Full Fine-Tuning	PEFT (e.g. LoRA)
Trainable params	100%	0.01–1%
GPU memory (7B model)	~80 GB (bf16)	~8–16 GB
Training time	Days–weeks	Hours–days
Catastrophic forgetting	High risk	Low risk
Per-task storage	Full model copy	Small adapter (~10–100 MB)
Multi-task serving	One model per task	One base + N adapters

LoRA: Low-Rank Adaptation

LoRA²⁾ is the most widely adopted PEFT method. Instead of updating a weight matrix W directly, it injects a low-rank decomposition:

delta_W = B * A

where:
  W  is the original frozen weight  (d x k)
  B  is a trainable matrix           (d x r)
  A  is a trainable matrix           (r x k)
  r << min(d, k)   [the rank]

During forward pass: h = W_0 x + (B A) x

Alpha Scaling

LoRA introduces a scaling hyperparameter alpha applied to the adapter output:

output = W_0 x + (alpha / r) * B * A * x

Typically alpha is set equal to r (so the scale factor is 1), or to 2r for slightly stronger adaptation. This decouples learning rate sensitivity from rank choice.

Rank Selection

r = 4–8: Most common default; sufficient for instruction following and style adaptation
r = 16–64: For domain adaptation or tasks requiring significant behavioral shift
r = 128+: Rarely needed; approaches full fine-tuning parameter counts

The rule of thumb is to start with r = 8 and increase only if validation loss stagnates.

Target Layers

LoRA is typically applied to the attention projection matrices:

W_q (query) and W_v (value) — minimum effective set
W_k (key) and W_o (output projection) — common addition
MLP layers (up/down/gate projections) — for stronger adaptation

Applying LoRA to all linear layers generally yields the best results at modest rank.

Zero Inference Latency

After training, LoRA adapters can be merged back into the base weights:

W_merged = W_0 + (alpha / r) * B @ A

The merged model is identical in architecture to the original — no extra compute at inference.

QLoRA and Variants

QLoRA

QLoRA³⁾ combines LoRA with 4-bit NF4 (NormalFloat4) quantization of the frozen base model weights, plus double quantization of the quantization constants. This enables fine-tuning a 65B parameter model on a single 48 GB GPU — previously impossible.

Key innovations:

NF4 quantization: Optimally bins weights assuming a normal distribution
Double quantization: Quantizes the quantization constants themselves, saving ~0.37 bits/param
Paged optimizers: Uses NVIDIA unified memory to handle optimizer state spikes

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA⁴⁾ decomposes weight updates into magnitude and direction components, applying LoRA only to the directional component. This typically outperforms LoRA at the same rank, especially on reasoning tasks.

AdaLoRA

AdaLoRA⁵⁾ dynamically allocates rank budget across weight matrices based on an SVD-based importance score. Less important layers receive rank 0 (pruned); critical layers receive higher rank. Achieves better performance per trainable parameter.

rsLoRA

rsLoRA⁶⁾ changes the scaling from alpha/r to alpha/sqrt®, which stabilizes training at higher ranks and allows effective use of r=128+ without learning rate collapse.

VeRA (Vector-based Random Matrix Adaptation)

VeRA⁷⁾ shares frozen random matrices B and A across all layers and trains only small per-layer scaling vectors. Reduces trainable parameters by another ~10x vs. LoRA.

GaLore (Gradient Low-Rank Projection)

GaLore⁸⁾ projects gradients into a low-rank subspace during full fine-tuning rather than restricting weight updates. Enables full fine-tuning memory efficiency without changing the model architecture.

LoRA+

LoRA+⁹⁾ sets different learning rates for the A and B matrices (typically lr_B = 16x lr_A), which better matches the optimal signal-propagation regime and often improves convergence speed.

PEFT Methods Comparison

Method	Trainable Params	Performance	Inference Overhead	Notes
LoRA	0.1–1%	High	None (mergeable)	Best general choice
QLoRA	0.1–1%	High	None (mergeable)	LoRA + 4-bit base; max memory savings
DoRA	0.1–1%	Higher than LoRA	None (mergeable)	Better on reasoning; slightly slower train
AdaLoRA	0.1–1%	High	None (mergeable)	Dynamic rank; best param efficiency
Prefix Tuning	0.1%	Medium	Low (+prefix tokens)	Prepends learned tokens; no merging
Prompt Tuning	< 0.01%	Low–Medium	Low (+prompt tokens)	Minimal params; underperforms at small scale
Adapters	0.5–3%	High	Low (serial layers)	Proven; inference latency from extra layers
IA3	< 0.01%	Medium	None (mergeable)	Learns rescaling vectors; very few params
VeRA	< 0.01%	Medium–High	None (mergeable)	Minimal storage; shared random matrices

Tools and Libraries

Hugging Face PEFT

The Hugging Face PEFT library is the de-facto standard. Supports LoRA, QLoRA, DoRA, AdaLoRA, Prefix Tuning, Prompt Tuning, IA3, and more across transformers models.

from peft import LoraConfig, get_peft_model
 
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

Unsloth

Unsloth provides hand-optimized LoRA/QLoRA kernels achieving 2x faster training with 60% less memory vs. standard PEFT, via rewritten CUDA triton kernels. Drop-in compatible with Hugging Face.

NVIDIA NeMo

NVIDIA NeMo provides enterprise-grade PEFT pipelines with multi-GPU support, mixed precision, and deployment to Triton Inference Server.

BitsAndBytes

BitsAndBytes provides the 4-bit and 8-bit quantization kernels underlying QLoRA. Required dependency for QLoRA workflows.

TRL (Transformer Reinforcement Learning)

TRL by Hugging Face integrates PEFT with SFT (SFTTrainer), RLHF, DPO, and PPO training loops. The standard toolkit for alignment fine-tuning.

Axolotl

Axolotl is a config-driven fine-tuning framework supporting LoRA/QLoRA across Llama, Mistral, Mixtral, Falcon, and others. Emphasis on reproducibility via YAML configs.

LLaMA-Factory

LLaMA-Factory provides a unified training interface with WebUI for LoRA/QLoRA fine-tuning across 100+ models, with built-in dataset preprocessing and evaluation.

Agent Fine-Tuning Use Cases

PEFT is particularly well-suited for fine-tuning foundation models into specialized agents:

Tool Calling

Fine-tuning on tool-use datasets (function schemas + examples) with LoRA teaches models to produce syntactically correct JSON tool invocations. Small LoRA adapters (r=8) suffice for models already familiar with function calling; higher rank helps for models trained without it.

Structured Output

LoRA fine-tuning on domain-specific schemas trains reliable structured JSON/XML output — critical for agents that must interface with APIs and databases without post-hoc parsing workarounds.

Domain-Specific Agents

QLoRA enables adapting 13B–70B models to specialized domains (legal, medical, code review) on consumer hardware, producing agents that outperform GPT-4 on narrow benchmarks at a fraction of inference cost.

Model Distillation

LoRA adapters can encode distilled knowledge from a larger teacher model — the student is fine-tuned via LoRA on the teacher's outputs (often combined with DPO or KTO), compressing capability into a smaller deployable model.

Multi-LoRA Serving

Frameworks like vLLM and S-LoRA support serving hundreds of LoRA adapters on a single GPU cluster using a shared base model. Each user/task/tenant gets their own adapter, enabling personalization at scale with near-zero marginal cost per adapter.

LoRA-as-Tools Pattern

The LoRA-as-Tools pattern¹⁰⁾ treats individual LoRA adapters as callable tools within an agent architecture. A router model selects which LoRA adapter to activate per inference step, enabling compositional specialization: one adapter for code generation, one for retrieval formatting, one for safety filtering — dynamically composed at runtime without multi-model overhead.

Table of Contents