Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Parameter-Efficient Fine-Tuning (PEFT) refers to a family of techniques that adapt large pre-trained models to new tasks by updating only 0.01–1% of model parameters while keeping the rest frozen. This dramatically reduces compute, memory, and storage requirements compared to full fine-tuning.1)
Full fine-tuning requires updating every parameter of a model — often billions of weights — for each downstream task. This is prohibitively expensive at scale. PEFT addresses this by identifying a small, task-specific parameter subspace to update.
Why PEFT matters:
Full Fine-Tuning vs. PEFT:
| Dimension | Full Fine-Tuning | PEFT (e.g. LoRA) |
|---|---|---|
| Trainable params | 100% | 0.01–1% |
| GPU memory (7B model) | ~80 GB (bf16) | ~8–16 GB |
| Training time | Days–weeks | Hours–days |
| Catastrophic forgetting | High risk | Low risk |
| Per-task storage | Full model copy | Small adapter (~10–100 MB) |
| Multi-task serving | One model per task | One base + N adapters |
LoRA2) is the most widely adopted PEFT method. Instead of updating a weight matrix W directly, it injects a low-rank decomposition:
delta_W = B * A where: W is the original frozen weight (d x k) B is a trainable matrix (d x r) A is a trainable matrix (r x k) r << min(d, k) [the rank]
During forward pass: h = W_0 x + (B A) x
LoRA introduces a scaling hyperparameter alpha applied to the adapter output:
output = W_0 x + (alpha / r) * B * A * x
Typically alpha is set equal to r (so the scale factor is 1), or to 2r for slightly stronger adaptation. This decouples learning rate sensitivity from rank choice.
The rule of thumb is to start with r = 8 and increase only if validation loss stagnates.
LoRA is typically applied to the attention projection matrices:
Applying LoRA to all linear layers generally yields the best results at modest rank.
After training, LoRA adapters can be merged back into the base weights:
W_merged = W_0 + (alpha / r) * B @ A
The merged model is identical in architecture to the original — no extra compute at inference.
QLoRA3) combines LoRA with 4-bit NF4 (NormalFloat4) quantization of the frozen base model weights, plus double quantization of the quantization constants. This enables fine-tuning a 65B parameter model on a single 48 GB GPU — previously impossible.
Key innovations:
DoRA4) decomposes weight updates into magnitude and direction components, applying LoRA only to the directional component. This typically outperforms LoRA at the same rank, especially on reasoning tasks.
AdaLoRA5) dynamically allocates rank budget across weight matrices based on an SVD-based importance score. Less important layers receive rank 0 (pruned); critical layers receive higher rank. Achieves better performance per trainable parameter.
rsLoRA6) changes the scaling from alpha/r to alpha/sqrt®, which stabilizes training at higher ranks and allows effective use of r=128+ without learning rate collapse.
VeRA7) shares frozen random matrices B and A across all layers and trains only small per-layer scaling vectors. Reduces trainable parameters by another ~10x vs. LoRA.
GaLore8) projects gradients into a low-rank subspace during full fine-tuning rather than restricting weight updates. Enables full fine-tuning memory efficiency without changing the model architecture.
LoRA+9) sets different learning rates for the A and B matrices (typically lr_B = 16x lr_A), which better matches the optimal signal-propagation regime and often improves convergence speed.
| Method | Trainable Params | Performance | Inference Overhead | Notes |
|---|---|---|---|---|
| LoRA | 0.1–1% | High | None (mergeable) | Best general choice |
| QLoRA | 0.1–1% | High | None (mergeable) | LoRA + 4-bit base; max memory savings |
| DoRA | 0.1–1% | Higher than LoRA | None (mergeable) | Better on reasoning; slightly slower train |
| AdaLoRA | 0.1–1% | High | None (mergeable) | Dynamic rank; best param efficiency |
| Prefix Tuning | 0.1% | Medium | Low (+prefix tokens) | Prepends learned tokens; no merging |
| Prompt Tuning | < 0.01% | Low–Medium | Low (+prompt tokens) | Minimal params; underperforms at small scale |
| Adapters | 0.5–3% | High | Low (serial layers) | Proven; inference latency from extra layers |
| IA3 | < 0.01% | Medium | None (mergeable) | Learns rescaling vectors; very few params |
| VeRA | < 0.01% | Medium–High | None (mergeable) | Minimal storage; shared random matrices |
The Hugging Face PEFT library is the de-facto standard. Supports LoRA, QLoRA, DoRA, AdaLoRA, Prefix Tuning, Prompt Tuning, IA3, and more across transformers models.
from peft import LoraConfig, get_peft_model config = LoraConfig( r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" ) model = get_peft_model(base_model, config) model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
Unsloth provides hand-optimized LoRA/QLoRA kernels achieving 2x faster training with 60% less memory vs. standard PEFT, via rewritten CUDA triton kernels. Drop-in compatible with Hugging Face.
NVIDIA NeMo provides enterprise-grade PEFT pipelines with multi-GPU support, mixed precision, and deployment to Triton Inference Server.
BitsAndBytes provides the 4-bit and 8-bit quantization kernels underlying QLoRA. Required dependency for QLoRA workflows.
TRL by Hugging Face integrates PEFT with SFT (SFTTrainer), RLHF, DPO, and PPO training loops. The standard toolkit for alignment fine-tuning.
Axolotl is a config-driven fine-tuning framework supporting LoRA/QLoRA across Llama, Mistral, Mixtral, Falcon, and others. Emphasis on reproducibility via YAML configs.
LLaMA-Factory provides a unified training interface with WebUI for LoRA/QLoRA fine-tuning across 100+ models, with built-in dataset preprocessing and evaluation.
PEFT is particularly well-suited for fine-tuning foundation models into specialized agents:
Fine-tuning on tool-use datasets (function schemas + examples) with LoRA teaches models to produce syntactically correct JSON tool invocations. Small LoRA adapters (r=8) suffice for models already familiar with function calling; higher rank helps for models trained without it.
LoRA fine-tuning on domain-specific schemas trains reliable structured JSON/XML output — critical for agents that must interface with APIs and databases without post-hoc parsing workarounds.
QLoRA enables adapting 13B–70B models to specialized domains (legal, medical, code review) on consumer hardware, producing agents that outperform GPT-4 on narrow benchmarks at a fraction of inference cost.
LoRA adapters can encode distilled knowledge from a larger teacher model — the student is fine-tuned via LoRA on the teacher's outputs (often combined with DPO or KTO), compressing capability into a smaller deployable model.
Frameworks like vLLM and S-LoRA support serving hundreds of LoRA adapters on a single GPU cluster using a shared base model. Each user/task/tenant gets their own adapter, enabling personalization at scale with near-zero marginal cost per adapter.
The LoRA-as-Tools pattern10) treats individual LoRA adapters as callable tools within an agent architecture. A router model selects which LoRA adapter to activate per inference step, enabling compositional specialization: one adapter for code generation, one for retrieval formatting, one for safety filtering — dynamically composed at runtime without multi-model overhead.