LoRA低秩适配详解

LoRA(Low-Rank Adaptation)是目前最流行的PEFT方法之一,通过低秩分解实现高效微调。

核心思想

低秩分解

LoRA的核心假设:预训练模型微调过程中的权重更新具有低 intrinsic rank

对于预训练权重矩阵 ,LoRA将其更新表示为:

其中:

  • :下投影矩阵
  • :上投影矩阵
  • 低秩(通常取4-64)
原始方法:更新 W₀ (d×k 全部参数)
W_new = W₀ + ΔW

LoRA方法:将 ΔW 分解为两个小矩阵
W_new = W₀ + BA
       = W₀ + (d×r) × (r×k)
       
       参数量从 d×k 减少到 r×(d+k)

前向传播

import torch
import torch.nn as nn
import math
 
class LoRALinear(nn.Module):
    """
    LoRA线性层实现
    
    原始: y = Wx + b
    LoRA: y = (W₀ + BA)x + b = W₀x + BAx
    """
    def __init__(self, 
                 in_features: int, 
                 out_features: int, 
                 rank: int = 4,
                 alpha: int = 1,
                 dropout: float = 0.0,
                 bias: bool = True):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank  # 缩放因子
        
        # 冻结原始权重
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features), 
            requires_grad=False
        )
        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.register_parameter('bias', None)
        
        # LoRA可训练参数
        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.dropout = nn.Dropout(dropout)
        
        # 初始化
        nn.init.normal_(self.lora_A, mean=0.0, std=1.0)
        # B初始化为0,使得初始状态 BA = 0
        # 这样开始训练时等价于原始模型
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # 原始权重的前向传播
        base_output = nn.functional.linear(x, self.weight, self.bias)
        
        # LoRA分支(可训练)
        lora_output = (self.dropout(x) @ self.lora_A.T @ self.lora_B.T) * self.scaling
        
        return base_output + lora_output

数学推导

为什么有效?

假设全量微调的权重更新为 ,其奇异值分解为:

LoRA只保留前 个最大的奇异值对应的方向:

核心洞察:预训练模型微调时,权重变化的主要方向是低秩的。

梯度分析

设损失函数为 ,关于LoRA参数的梯度:

的选择

可训练参数量效果适用场景
2-4极少尚可极端资源受限
8良好轻量级任务
16-32中等很好通用场景
64-128较多接近全量复杂任务
参数量计算示例 (LLaMA-7B, d_model=4096):
- W_q, W_k, W_v 各一个,共3个矩阵
- r=8: 3 × 2 × 4096 × 8 = 196,608 参数
- r=64: 3 × 2 × 4096 × 64 = 1,572,864 参数
- 全量微调: 3 × 4096 × 4096 = 50,331,648 参数

LoRA变体

1. QLoRA

量化LoRA:结合4-bit量化与LoRA,大幅降低显存需求。

# QLoRA核心思想
class QLoRALinear(nn.Module):
    def __init__(self, base_model, lora_config):
        # 1. 基础模型使用NF4量化存储
        self.weight = QuantizedTensor(
            base_model.weight, 
            dtype='nf4'  # 4-bit NormalFloat
        )
        
        # 2. LoRA参数保持bfloat16(可训练)
        self.lora_A = nn.Parameter(...)
        self.lora_B = nn.Parameter(...)
        
        # 3. 前向传播时反量化
        weight_fp16 = dequantize(self.weight)
        
    def forward(self, x):
        # 反量化 + 计算
        w = dequantize(self.weight)
        base_out = x @ w.t()
        lora_out = x @ self.lora_A.t() @ self.lora_B.t()
        return base_out + lora_out

QLoRA显存节省

方法LLaMA-7B 显存LLaMA-65B 显存
全量微调~48GB~400GB
LoRA~12GB~80GB
QLoRA~6GB~48GB

2. DoRA (Weight-Decomposed LoRA)

将LoRA分解为幅度向量和方向矩阵:

class DoRALinear(nn.Module):
    def __init__(self, weight, rank=8):
        self.original_weight = weight
        self.m = nn.Parameter(torch.ones(weight.shape[0], 1))  # 幅度
        self.A = nn.Parameter(torch.randn(rank, weight.shape[1]))
        self.B = nn.Parameter(torch.zeros(weight.shape[0], rank))
    
    def forward(self, x):
        # 计算方向
        delta_W = self.B @ self.A
        direction = (self.original_weight + delta_W) / \
                    (self.original_weight + delta_W).norm(dim=1, keepdim=True)
        
        # 计算幅度
        magnitude = self.original_weight + delta_W
        magnitude = magnitude / magnitude.norm(dim=1, keepdim=True) * self.m
        
        return x @ magnitude.t()

3. LoRA+

与标准LoRA不同,为 使用不同的学习率:

# LoRA+ 的关键改进
optimizer = torch.optim.AdamW([
    {'params': model.lora_A, 'lr': lr_A},  # 通常设更高
    {'params': model.lora_B, 'lr': lr_B},  # 通常设更低
])

研究发现: 的最优学习率比约为

4. AdaLoRA

自适应调整LoRA的秩:

class AdaLoRALinear(nn.Module):
    def __init__(self, max_rank=16):
        self.max_rank = max_rank
        # 动态分配的秩(根据重要性)
        self.current_rank = nn.Parameter(
            torch.ones(max_rank) / max_rank  # 初始均匀分布
        )
    
    def compute_importance(self):
        # 基于梯度协方差估计重要性
        grad_A = self.lora_A.grad
        grad_B = self.lora_B.grad
        importance = (grad_A.norm() * grad_B.norm())
        return importance

5. PiSSA (Principal Singular Values Adaptation)

只对主奇异值方向进行适配:

class PiSSALinear(nn.Module):
    def forward(self, x):
        # SVD分解原始权重
        U, S, Vh = torch.svd(self.weight)
        
        # 只对前r个主方向应用LoRA
        Ur = U[:, :self.rank]
        Sr = S[:self.rank]
        Vr = Vh[:self.rank, :]
        
        # 修复主方向,适配残差
        delta = self.B @ self.A
        return x @ (Ur @ torch.diag(Sr) @ Vr + delta).t()

6. 最新进展(2024-2025)

Dynamic LoRA

动态调整不同层的LoRA配置:

class DynamicLoRA:
    def __init__(self, model):
        self.layer_configs = {}
        for name, module in model.named_modules():
            if 'attention' in name:
                # 根据层位置动态设置秩
                depth = extract_layer_depth(name)
                rank = min(64, 4 + depth * 4)  # 越深层秩越大
                self.layer_configs[name] = LoRAConfig(rank=rank)

GoRA (Gradient-driven Adaptive Rank)

基于梯度信息自适应分配秩:

class GoRAConfig:
    def __init__(self, model, total_budget):
        # 根据梯度幅度分配秩
        grads = compute_grad_norms(model)
        ranks = distribute_budget(total_budget, grads)

实践配置

推荐配置(LLaMA系列)

from peft import LoraConfig, get_peft_model
 
config = LoraConfig(
    r=16,                    # 秩,通常8-64
    lora_alpha=16,           # 缩放因子,通常设为r或2r
    target_modules=[          # 目标模块
        "q_proj", "k_proj", 
        "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,        # Dropout
    bias="none",             # 不训练bias
    task_type="CAUSAL_LM"    # 任务类型
)
 
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# 输出:trainable params: 41,843,520 || all params: 6,738,415,616 || trainable%: 0.621%

不同任务的配置建议

任务r目标模块alpha
文本分类4-8q_proj, v_proj8-16
对话生成16-32所有QKV32-64
指令微调8-16qkv + ffn16-32
领域适配32-64所有层64-128

使用HuggingFace PEFT库

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
 
# 加载基础模型
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
 
# 配置LoRA
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)
 
# 应用LoRA
model = get_peft_model(model, peft_config)
 
# 训练
trainer = transformers.Trainer(
    model=model,
    train_dataset=train_dataset,
    args=training_arguments,
    data_collator=data_collator
)
trainer.train()
 
# 保存
model.save_pretrained("lora_model")

多LoRA部署

from peft import PeftModel
 
# 加载多个LoRA适配器
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
 
# 动态切换LoRA
for task in ["math", "code", "creative"]:
    lora_path = f"./lohas/{task}"
    model = PeftModel.from_pretrained(base_model, lora_path)
    # 推理
    generate(model, prompt)

参考