LoRA

Resources

YouTube
Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv. http://arxiv.org/abs/2106.09685.

Hypernetworks, e.g. for PINNs

Main Idea

We fine-tune a matrix by adding a low rank matrix to it. This low rank matrix is stored efficiently. Let $W_{0} \in R^{d \times k}$ be the matrix we wish to fine-tune. We construct a rank $r$ update to this matrix ( $r ≪ min (d, k)$ ) as $Δ W = B A$ , with $B \in R^{d \times r}$ , $A \in R^{r \times k}$ . This was famously applied to a Transformer architecture. Overall, this technique makes finetuning less compute and less memory intensive. This was motivated by noting that $W_{0}$ often had a low intrinsic rank. Other methods perhaps select specific layers (such as the output layers) to fine-tune. Heuristically, LoRA is more efficient, with more representational capacity per parameter (or so I'd imagine).

The update is given formally as

Δ W = \frac{α}{r} B A,

where for instance $r = 4$ and $α = 16$ (or maybe $α = 0.1$ ), and we initialize $A$ (kaiming) randomly and $B$ as zeros.

While the limited capacity of $Δ W$ might reduce the risk of Catastrophic Forgetting in practice. But if $rank (W) \leq r$ , then $Δ W$ can entirely overwrite $W$ , or in other words, we are capable of the most extreme case of catastrophic forgetting.

ReLoRA

Refined LoRA notes that we can make $rank (Δ W) > r$ , while at a given time still only using two smaller matrices like before (e.g. same memory footprint). The update is formed as

Δ W = \frac{α}{r} \sum_{i = 1}^{N} B^{i} A^{i} .

At a certain part of the training, we incorporate $A^{i}$ and $B^{i}$ into the frozen weights and begin training $A^{i + 1}$ and $B^{i + 1}$ . The optimizer must also be reset, as the moments in Adam would train $A^{i + 1}$ and $B^{i + 1}$ to align to closely to $A^{i}$ and $B^{i}$ . The authors also implement a learning rate scheduler that starts at $0$ . With other details, ReLoRA can go beyond just fine-tuning and replace the original training procedure, requiring a significantly smaller memory footprint for the vast majority of training. This revolves around the idea of "local low-rank training."

Resources

Related

Main Idea

ReLoRA