A parameter-efficient method for fine-tuning large pre-trained models using low-rank matrices.
LoRA (Low-Rank Adaptation) is a technique for fine-tuning large pre-trained models—particularly transformers—without modifying their original weights directly. Instead of updating all parameters in a layer, LoRA injects pairs of small, trainable matrices into specific layers (typically the attention projections) whose product represents a low-rank approximation of the weight update. Because the rank of these matrices is kept much smaller than the full weight dimensions, the number of trainable parameters introduced is a tiny fraction of the original model size, often less than 1%. The original pre-trained weights remain frozen throughout training, preserving the knowledge encoded during pretraining.
The mechanics rely on a straightforward decomposition: for a weight matrix W of shape d × k, LoRA adds a bypass path BA, where B is d × r and A is r × k, with rank r ≪ min(d, k). During training, only A and B are updated. At inference time, the product BA can be merged directly into W with a simple addition, meaning LoRA introduces zero additional latency compared to a fully fine-tuned model. This merge-at-inference property is a key practical advantage over other parameter-efficient methods that require persistent adapter modules.
LoRA matters because fine-tuning frontier-scale language models—which can have tens or hundreds of billions of parameters—is prohibitively expensive for most practitioners. LoRA reduces GPU memory requirements dramatically, enabling adaptation on consumer-grade hardware and making customization of powerful models broadly accessible. It also simplifies multi-task deployment: a single base model can be paired with multiple lightweight LoRA adapters, each specialized for a different task, without storing redundant copies of the full model weights.
Since its introduction, LoRA has become a foundational tool in the LLM ecosystem, underpinning popular fine-tuning frameworks and spawning variants such as QLoRA (which combines quantization with LoRA for even greater memory efficiency) and DoRA (which decomposes weight updates into magnitude and direction components). Its combination of simplicity, efficiency, and strong empirical performance has made it the default starting point for practitioners adapting large models to specific domains or tasks.