LoRA
Definition
LoRA (Low-Rank Adaptation)
LoRA is a parameter-efficient fine-tuning (PEFT) method that adapts a large pretrained model to a downstream task without updating its original weights. Instead of fine-tuning a weight matrix directly, LoRA freezes and learns a low-rank additive update , where and with rank . Only and are trained, so the number of trainable parameters drops by orders of magnitude.
In this course LoRA is the mechanism that makes the LLM-as-RS formulation tractable: a frozen LLM backbone plus a small trainable LoRA adapter is fine-tuned on recommendation data (e.g. TALLRec, LLaRA, CoLLM).
Intuition
Fine-tune the change, not the weights — and keep it skinny
Full fine-tuning of a billion-parameter LLM means learning a dense update the same size as — expensive in compute, memory, and storage (one full copy of the model per task). The empirical observation behind LoRA is that the update needed to adapt a model has a low “intrinsic rank”: the meaningful change lives in a tiny subspace. So instead of a full matrix, we factor the update through a narrow -dimensional bottleneck, . With as small as 4–16, you train a fraction of a percent of the parameters yet recover most of the quality of full fine-tuning.
For recommendation this is the enabling trick: a vanilla LLM never saw click signal, Top-K ranking, or item structure during pretraining (the alignment problem). LoRA lets you inject that recommendation-specific signal cheaply while leaving the frozen LLM’s world knowledge intact — the snowflake (frozen LLM) + flame (trainable LoRA) picture from the LLM-as-RS slides.
Mathematical Formulation
Low-Rank Weight Update
For a pretrained weight matrix , LoRA replaces the forward pass with
where:
- — frozen pretrained weights (, not updated)
- — trainable down-projection, initialized to
- — trainable up-projection, initialized random Gaussian
- — rank of the adapter, (the bottleneck width)
- — scaling constant; rescales the update so it is roughly -independent
- — layer input, — layer output
Because is initialized to , at the start, so training begins exactly at the pretrained model and only learns to deviate as needed.
Trainable Parameter Count
A full update has parameters; the LoRA update has only e.g. for , : trainable params vs M — a reduction per adapted matrix.
In the LLM-as-RS training objective the LoRA adapter is optimized with the same next-token / instruction loss as the LLM, e.g. supervised fine-tuning on instances like “Given the user’s liked/disliked items and a target item, answer Yes/No”:
where the gradient flows only into and ; stays frozen. The output is item text titles (LLM-as-RS) or, more generally, the adapted LLM carries the injected collaborative signal.
Key Properties / Variants
- No inference latency. After training, can be merged into the frozen weights: . The merged model is identical in shape to the original, so unlike adapter-layer methods LoRA adds zero extra latency at serving time.
- Cheap task switching. Each task is just a small pair (megabytes, not gigabytes). You keep one frozen backbone and swap adapters — ideal for multi-domain / multi-task recommendation.
- Where it is applied. Typically injected into the attention projection matrices () of each Transformer block; sometimes the feed-forward layers too.
- Hyperparameters. Rank trades capacity vs cost; scaling controls the update magnitude. Both are small (e.g. ).
- In the LLM-based GR taxonomy LoRA appears in two of the three alignment paradigms:
- Text prompting (paradigm ①): TALLRec uses lightweight LoRA fine-tuning on natural-language preference instructions.
- Inject collaborative signal (paradigm ②): iLoRA, LLaRA, CoLLM project a learned CF embedding into the LLM’s token-embedding space and fine-tune a LoRA adapter on top of the frozen LLM. iLoRA additionally instance-customizes the adapter.
- Relation to other PEFT. Belongs to the broader parameter-efficient fine-tuning family (alongside prefix-tuning, prompt-tuning, soft prompts, and adapter layers); LoRA is the most widely used because of the zero-merge-latency property.
- Contrast with full fine-tuning / SFT. LoRA can implement SFT cheaply, and is compatible with later preference optimization or RL stages on the same frozen backbone.
Connections
- Subtype of: Parameter-Efficient Fine-Tuning
- Enables: LLM-as-RS (frozen LLM + trainable LoRA adapter)
- Applied to: Transformer attention projections in Large Language Models
- Training objective: Supervised Fine-Tuning; followed optionally by DPO / RL
- Alternative alignment routes that avoid text-only adapters: Item Tokenization → Semantic IDs (the SID-based GR formulation)
- Contrasts with: full fine-tuning, in-context learning (zero-training prompting)