Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv. http://arxiv.org/abs/2106.09685.
We fine-tune a matrix by adding a low rank matrix to it. This low rank matrix is stored efficiently. Let be the matrix we wish to fine-tune. We construct a rank update to this matrix () as , with , . This was famously applied to a Transformer architecture. Overall, this technique makes finetuning less compute and less memory intensive. This was motivated by noting that often had a low intrinsic rank. Other methods perhaps select specific layers (such as the output layers) to fine-tune. Heuristically, LoRA is more efficient, with more representational capacity per parameter (or so I'd imagine).
The update is given formally as
where for instance and (or maybe ), and we initialize (kaiming) randomly and as zeros.
While the limited capacity of might reduce the risk of Catastrophic Forgetting in practice. But if , then can entirely overwrite , or in other words, we are capable of the most extreme case of catastrophic forgetting.
ReLoRA
Refined LoRA notes that we can make , while at a given time still only using two smaller matrices like before (e.g. same memory footprint). The update is formed as
At a certain part of the training, we incorporate and into the frozen weights and begin training and . The optimizer must also be reset, as the moments in Adam would train and to align to closely to and . The authors also implement a learning rate scheduler that starts at . With other details, ReLoRA can go beyond just fine-tuning and replace the original training procedure, requiring a significantly smaller memory footprint for the vast majority of training. This revolves around the idea of "local low-rank training."