====== What Is a LoRA Adapter ====== A **LoRA adapter** (Low-Rank Adaptation) is a lightweight, trainable module that allows fine-tuning of large language models without modifying their original weights. Instead of retraining an entire model — which may have billions of parameters — LoRA freezes the base weights and injects small, trainable matrices that learn task-specific adjustments. ((Source: [[https://arxiv.org/abs/2106.09685|Hu et al. 2021 - LoRA: Low-Rank Adaptation of Large Language Models]])) ===== The Problem LoRA Solves ===== Fine-tuning a large model traditionally requires updating **all** of its parameters. For a model with 70 billion parameters, this demands: * Hundreds of gigabytes of GPU memory * Expensive multi-GPU clusters * Hours or days of training time * A separate full copy of the model for each fine-tuned variant This makes fine-tuning prohibitively expensive for most organizations and individuals. LoRA reduces the trainable parameter count by up to **10,000x** and GPU memory requirements by approximately **3x**, while matching or exceeding full fine-tuning performance. ((Source: [[https://arxiv.org/abs/2106.09685|Hu et al. 2021 - LoRA]])) ===== How LoRA Works ===== LoRA exploits a key insight: when adapting a pre-trained model to a new task, the **weight updates occupy a low-dimensional subspace**. The full weight change matrix does not need to be high-rank. For a pre-trained weight matrix W of dimensions n x k, LoRA decomposes the update into two smaller matrices: * Matrix **A** with dimensions r x k (projects input down to low rank) * Matrix **B** with dimensions n x r (projects back up to full dimensions) Where **r** (the rank) is much smaller than both n and k — typically 4, 8, or 16. The forward pass becomes: output = W*x + (alpha/r) * B*A*x Key implementation details: * **A** is initialized with small random values (Gaussian) * **B** is initialized to zero, so the adapter has no effect at the start of training * Only A and B are updated via gradient descent — the base weights W are frozen * A scaling factor **alpha** controls the magnitude of the adaptation ((Source: [[https://www.ml6.eu/en/blog/low-rank-adaptation-a-technical-deep-dive|ML6 - Low-Rank Adaptation Deep Dive]])) The total trainable parameters drop from n*k (full matrix) to r*(n+k) — a dramatic reduction when r is small. ===== At Inference Time ===== After training, the adapter matrices can be **merged** into the base weights: W_merged = W + (alpha/r) * B*A This produces a single weight matrix with **zero additional inference latency** — the adapted model runs at the same speed as the original. Alternatively, adapters can be kept separate and swapped dynamically to switch between tasks. ((Source: [[https://arxiv.org/abs/2106.09685|Hu et al. 2021 - LoRA]])) ===== QLoRA ===== **QLoRA** (Quantized LoRA) combines LoRA with 4-bit quantization of the base model weights. This further reduces memory requirements, making it possible to fine-tune a 65-billion-parameter model on a single consumer GPU. The base weights are stored in 4-bit precision while the LoRA adapter matrices train in higher precision. ((Source: [[https://www.geeksforgeeks.org/deep-learning/what-is-low-rank-adaptation-lora/|GeeksforGeeks - LoRA]])) ===== Relationship to PEFT ===== LoRA belongs to the family of **PEFT** (Parameter-Efficient Fine-Tuning) methods. PEFT encompasses several approaches for adapting large models with minimal trainable parameters: * **LoRA** — Low-rank weight decomposition (no inference overhead) * **Prefix tuning** — Trainable prefix tokens prepended to inputs * **Adapters** — Small bottleneck layers inserted between transformer blocks * **Prompt tuning** — Learnable soft prompts LoRA is the most widely adopted PEFT method because it adds **no inference latency** after merging and consistently matches full fine-tuning quality. ((Source: [[https://arxiv.org/abs/2106.09685|Hu et al. 2021 - LoRA]])) For deeper coverage of PEFT methods, see [[peft_and_lora|PEFT and LoRA]]. ===== Democratizing AI Customization ===== LoRA's efficiency has fundamentally changed who can customize AI models. Individual researchers, small companies, and hobbyists can now fine-tune state-of-the-art models for specific domains — medical, legal, creative, multilingual — on modest hardware. The adapter files themselves are small (often megabytes rather than gigabytes), making them easy to share, version, and distribute. ((Source: [[https://www.ibm.com/think/topics/lora|IBM - LoRA]])) ===== Key References ===== * **Hu et al. (2021)** — "LoRA: Low-Rank Adaptation of Large Language Models" (arXiv:2106.09685) * **Dettmers et al. (2023)** — "QLoRA: Efficient Finetuning of Quantized Language Models" ===== See Also ===== * [[peft_and_lora|PEFT and LoRA]] * [[inference_economics|Inference Economics]] * [[open_weights_vs_open_source|Open-Weights vs Open-Source AI]] ===== References =====