====== What Is a LoRA Adapter ======

A **LoRA adapter** (Low-Rank Adaptation) is a lightweight, trainable module that allows fine-tuning of large language models without modifying their original weights. Instead of retraining an entire model — which may have billions of parameters — LoRA freezes the base weights and injects small, trainable matrices that learn task-specific adjustments. ((Source: [[https://arxiv.org/abs/2106.09685|Hu et al. 2021 - LoRA: Low-Rank Adaptation of Large Language Models]]))

===== The Problem LoRA Solves =====

Fine-tuning a large model traditionally requires updating **all** of its parameters. For a model with 70 billion parameters, this demands:

  * Hundreds of gigabytes of GPU memory
  * Expensive multi-GPU clusters
  * Hours or days of training time
  * A separate full copy of the model for each fine-tuned variant

This makes fine-tuning prohibitively expensive for most organizations and individuals. LoRA reduces the trainable parameter count by up to **10,000x** and GPU memory requirements by approximately **3x**, while matching or exceeding full fine-tuning performance. ((Source: [[https://arxiv.org/abs/2106.09685|Hu et al. 2021 - LoRA]]))

===== How LoRA Works =====

LoRA exploits a key insight: when adapting a pre-trained model to a new task, the **weight updates occupy a low-dimensional subspace**. The full weight change matrix does not need to be high-rank.

For a pre-trained weight matrix W of dimensions n x k, LoRA decomposes the update into two smaller matrices:

  * Matrix **A** with dimensions r x k (projects input down to low rank)
  * Matrix **B** with dimensions n x r (projects back up to full dimensions)

Where **r** (the rank) is much smaller than both n and k — typically 4, 8, or 16.

The forward pass becomes:

  output = W*x + (alpha/r) * B*A*x

Key implementation details:

  * **A** is initialized with small random values (Gaussian)
  * **B** is initialized to zero, so the adapter has no effect at the start of training
  * Only A and B are updated via gradient descent — the base weights W are frozen
  * A scaling factor **alpha** controls the magnitude of the adaptation ((Source: [[https://www.ml6.eu/en/blog/low-rank-adaptation-a-technical-deep-dive|ML6 - Low-Rank Adaptation Deep Dive]]))

The total trainable parameters drop from n*k (full matrix) to r*(n+k) — a dramatic reduction when r is small.

===== At Inference Time =====

After training, the adapter matrices can be **merged** into the base weights:

  W_merged = W + (alpha/r) * B*A

This produces a single weight matrix with **zero additional inference latency** — the adapted model runs at the same speed as the original. Alternatively, adapters can be kept separate and swapped dynamically to switch between tasks. ((Source: [[https://arxiv.org/abs/2106.09685|Hu et al. 2021 - LoRA]]))

===== QLoRA =====

**QLoRA** (Quantized LoRA) combines LoRA with 4-bit quantization of the base model weights. This further reduces memory requirements, making it possible to fine-tune a 65-billion-parameter model on a single consumer GPU. The base weights are stored in 4-bit precision while the LoRA adapter matrices train in higher precision. ((Source: [[https://www.geeksforgeeks.org/deep-learning/what-is-low-rank-adaptation-lora/|GeeksforGeeks - LoRA]]))

===== Relationship to PEFT =====

LoRA belongs to the family of **PEFT** (Parameter-Efficient Fine-Tuning) methods. PEFT encompasses several approaches for adapting large models with minimal trainable parameters:

  * **LoRA** — Low-rank weight decomposition (no inference overhead)
  * **Prefix tuning** — Trainable prefix tokens prepended to inputs
  * **Adapters** — Small bottleneck layers inserted between transformer blocks
  * **Prompt tuning** — Learnable soft prompts

LoRA is the most widely adopted PEFT method because it adds **no inference latency** after merging and consistently matches full fine-tuning quality. ((Source: [[https://arxiv.org/abs/2106.09685|Hu et al. 2021 - LoRA]]))

For deeper coverage of PEFT methods, see [[peft_and_lora|PEFT and LoRA]].

===== Democratizing AI Customization =====

LoRA's efficiency has fundamentally changed who can customize AI models. Individual researchers, small companies, and hobbyists can now fine-tune state-of-the-art models for specific domains — medical, legal, creative, multilingual — on modest hardware. The adapter files themselves are small (often megabytes rather than gigabytes), making them easy to share, version, and distribute. ((Source: [[https://www.ibm.com/think/topics/lora|IBM - LoRA]]))

===== Key References =====

  * **Hu et al. (2021)** — "LoRA: Low-Rank Adaptation of Large Language Models" (arXiv:2106.09685)
  * **Dettmers et al. (2023)** — "QLoRA: Efficient Finetuning of Quantized Language Models"

===== See Also =====

  * [[peft_and_lora|PEFT and LoRA]]
  * [[inference_economics|Inference Economics]]
  * [[open_weights_vs_open_source|Open-Weights vs Open-Source AI]]

===== References =====