LoRA

Resources

Main Idea

We fine-tune a matrix by adding a low rank matrix to it. This low rank matrix is stored efficiently. Let W0Rd×k be the matrix we wish to fine-tune. We construct a rank r update to this matrix (rmin(d,k)) as ΔW=BA, with BRd×r, ARr×k. This was famously applied to a Transformer architecture. Overall, this technique makes finetuning less compute and less memory intensive. This was motivated by noting that W0 often had a low intrinsic rank. Other methods perhaps select specific layers (such as the output layers) to fine-tune. Heuristically, LoRA is more efficient, with more representational capacity per parameter (or so I'd imagine).

The update is given formally as

ΔW=αrBA,

where for instance r=4 and α=16 (or maybe α=0.1), and we initialize A (kaiming) randomly and B as zeros.

While the limited capacity of ΔW might reduce the risk of Catastrophic Forgetting in practice. But if rank(W)r, then ΔW can entirely overwrite W, or in other words, we are capable of the most extreme case of catastrophic forgetting.

ReLoRA

Refined LoRA notes that we can make rank(ΔW)>r, while at a given time still only using two smaller matrices like before (e.g. same memory footprint). The update is formed as

ΔW=αri=1NBiAi.

At a certain part of the training, we incorporate Ai and Bi into the frozen weights and begin training Ai+1 and Bi+1. The optimizer must also be reset, as the moments in Adam would train Ai+1 and Bi+1 to align to closely to Ai and Bi. The authors also implement a learning rate scheduler that starts at 0. With other details, ReLoRA can go beyond just fine-tuning and replace the original training procedure, requiring a significantly smaller memory footprint for the vast majority of training. This revolves around the idea of "local low-rank training."