Table of Contents

Parameter-Efficient Fine-Tuning (PEFT) and LoRA

Parameter-Efficient Fine-Tuning (PEFT) refers to a family of techniques that adapt large pre-trained models to new tasks by updating only 0.01–1% of model parameters while keeping the rest frozen. This dramatically reduces compute, memory, and storage requirements compared to full fine-tuning.1)

Definition and Motivation

Full fine-tuning requires updating every parameter of a model — often billions of weights — for each downstream task. This is prohibitively expensive at scale. PEFT addresses this by identifying a small, task-specific parameter subspace to update.

Why PEFT matters:

Full Fine-Tuning vs. PEFT:

Dimension Full Fine-Tuning PEFT (e.g. LoRA)
Trainable params 100% 0.01–1%
GPU memory (7B model) ~80 GB (bf16) ~8–16 GB
Training time Days–weeks Hours–days
Catastrophic forgetting High risk Low risk
Per-task storage Full model copy Small adapter (~10–100 MB)
Multi-task serving One model per task One base + N adapters

LoRA: Low-Rank Adaptation

LoRA2) is the most widely adopted PEFT method. Instead of updating a weight matrix W directly, it injects a low-rank decomposition:

delta_W = B * A

where:
  W  is the original frozen weight  (d x k)
  B  is a trainable matrix           (d x r)
  A  is a trainable matrix           (r x k)
  r << min(d, k)   [the rank]

During forward pass: h = W_0 x + (B A) x

Alpha Scaling

LoRA introduces a scaling hyperparameter alpha applied to the adapter output:

output = W_0 x + (alpha / r) * B * A * x

Typically alpha is set equal to r (so the scale factor is 1), or to 2r for slightly stronger adaptation. This decouples learning rate sensitivity from rank choice.

Rank Selection

The rule of thumb is to start with r = 8 and increase only if validation loss stagnates.

Target Layers

LoRA is typically applied to the attention projection matrices:

Applying LoRA to all linear layers generally yields the best results at modest rank.

Zero Inference Latency

After training, LoRA adapters can be merged back into the base weights:

W_merged = W_0 + (alpha / r) * B @ A

The merged model is identical in architecture to the original — no extra compute at inference.

QLoRA and Variants

QLoRA

QLoRA3) combines LoRA with 4-bit NF4 (NormalFloat4) quantization of the frozen base model weights, plus double quantization of the quantization constants. This enables fine-tuning a 65B parameter model on a single 48 GB GPU — previously impossible.

Key innovations:

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA4) decomposes weight updates into magnitude and direction components, applying LoRA only to the directional component. This typically outperforms LoRA at the same rank, especially on reasoning tasks.

AdaLoRA

AdaLoRA5) dynamically allocates rank budget across weight matrices based on an SVD-based importance score. Less important layers receive rank 0 (pruned); critical layers receive higher rank. Achieves better performance per trainable parameter.

rsLoRA

rsLoRA6) changes the scaling from alpha/r to alpha/sqrt®, which stabilizes training at higher ranks and allows effective use of r=128+ without learning rate collapse.

VeRA (Vector-based Random Matrix Adaptation)

VeRA7) shares frozen random matrices B and A across all layers and trains only small per-layer scaling vectors. Reduces trainable parameters by another ~10x vs. LoRA.

GaLore (Gradient Low-Rank Projection)

GaLore8) projects gradients into a low-rank subspace during full fine-tuning rather than restricting weight updates. Enables full fine-tuning memory efficiency without changing the model architecture.

LoRA+

LoRA+9) sets different learning rates for the A and B matrices (typically lr_B = 16x lr_A), which better matches the optimal signal-propagation regime and often improves convergence speed.

PEFT Methods Comparison

Method Trainable Params Performance Inference Overhead Notes
LoRA 0.1–1% High None (mergeable) Best general choice
QLoRA 0.1–1% High None (mergeable) LoRA + 4-bit base; max memory savings
DoRA 0.1–1% Higher than LoRA None (mergeable) Better on reasoning; slightly slower train
AdaLoRA 0.1–1% High None (mergeable) Dynamic rank; best param efficiency
Prefix Tuning 0.1% Medium Low (+prefix tokens) Prepends learned tokens; no merging
Prompt Tuning < 0.01% Low–Medium Low (+prompt tokens) Minimal params; underperforms at small scale
Adapters 0.5–3% High Low (serial layers) Proven; inference latency from extra layers
IA3 < 0.01% Medium None (mergeable) Learns rescaling vectors; very few params
VeRA < 0.01% Medium–High None (mergeable) Minimal storage; shared random matrices

Tools and Libraries

Hugging Face PEFT

The Hugging Face PEFT library is the de-facto standard. Supports LoRA, QLoRA, DoRA, AdaLoRA, Prefix Tuning, Prompt Tuning, IA3, and more across transformers models.

from peft import LoraConfig, get_peft_model
 
config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

Unsloth

Unsloth provides hand-optimized LoRA/QLoRA kernels achieving 2x faster training with 60% less memory vs. standard PEFT, via rewritten CUDA triton kernels. Drop-in compatible with Hugging Face.

NVIDIA NeMo

NVIDIA NeMo provides enterprise-grade PEFT pipelines with multi-GPU support, mixed precision, and deployment to Triton Inference Server.

BitsAndBytes

BitsAndBytes provides the 4-bit and 8-bit quantization kernels underlying QLoRA. Required dependency for QLoRA workflows.

TRL (Transformer Reinforcement Learning)

TRL by Hugging Face integrates PEFT with SFT (SFTTrainer), RLHF, DPO, and PPO training loops. The standard toolkit for alignment fine-tuning.

Axolotl

Axolotl is a config-driven fine-tuning framework supporting LoRA/QLoRA across Llama, Mistral, Mixtral, Falcon, and others. Emphasis on reproducibility via YAML configs.

LLaMA-Factory

LLaMA-Factory provides a unified training interface with WebUI for LoRA/QLoRA fine-tuning across 100+ models, with built-in dataset preprocessing and evaluation.

Agent Fine-Tuning Use Cases

PEFT is particularly well-suited for fine-tuning foundation models into specialized agents:

Tool Calling

Fine-tuning on tool-use datasets (function schemas + examples) with LoRA teaches models to produce syntactically correct JSON tool invocations. Small LoRA adapters (r=8) suffice for models already familiar with function calling; higher rank helps for models trained without it.

Structured Output

LoRA fine-tuning on domain-specific schemas trains reliable structured JSON/XML output — critical for agents that must interface with APIs and databases without post-hoc parsing workarounds.

Domain-Specific Agents

QLoRA enables adapting 13B–70B models to specialized domains (legal, medical, code review) on consumer hardware, producing agents that outperform GPT-4 on narrow benchmarks at a fraction of inference cost.

Model Distillation

LoRA adapters can encode distilled knowledge from a larger teacher model — the student is fine-tuned via LoRA on the teacher's outputs (often combined with DPO or KTO), compressing capability into a smaller deployable model.

Multi-LoRA Serving

Frameworks like vLLM and S-LoRA support serving hundreds of LoRA adapters on a single GPU cluster using a shared base model. Each user/task/tenant gets their own adapter, enabling personalization at scale with near-zero marginal cost per adapter.

LoRA-as-Tools Pattern

The LoRA-as-Tools pattern10) treats individual LoRA adapters as callable tools within an agent architecture. A router model selects which LoRA adapter to activate per inference step, enabling compositional specialization: one adapter for code generation, one for retrieval formatting, one for safety filtering — dynamically composed at runtime without multi-model overhead.

See Also

1)
Ding et al., “Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models” arXiv:2203.06904
2)
Hu et al. 2021, “LoRA: Low-Rank Adaptation of Large Language Models” arXiv:2106.09685
3)
Dettmers et al. 2023, “QLoRA: Efficient Finetuning of Quantized LLMs” arXiv:2305.14314
4)
Liu et al. 2024 arXiv:2402.09353
5)
Zhang et al. 2023 arXiv:2303.10512
6)
Kalajdzic et al. 2023
7)
Kopiczko et al. 2024 arXiv:2310.11454
8)
Zhao et al. 2024 arXiv:2403.03507
9)
Hayou et al. 2024 arXiv:2402.12354
10)
Arxiv 2024, “LoRA-as-Tools” arXiv:2510.15416