矩阵分解与神经网络压缩

引言

随着大语言模型(LLM)参数规模突破万亿级别,模型压缩成为部署和应用的关键技术。矩阵分解是压缩的核心方法之一,通过将大矩阵分解为小矩阵的乘积,大幅减少参数量和计算成本。

核心思想:神经网络中的权重矩阵通常具有低秩结构,利用这种结构可以用更少的参数近似原始矩阵。


1. 低秩分解基础

1.1 问题形式化

对于权重矩阵 ,低秩分解的目标是找到:

其中

参数量对比

  • 原始:
  • 分解后:
  • 压缩比:

1.2 SVD与最优近似

根据Eckart-Young-Mirsky定理,最优低秩近似由截断SVD给出:

def low_rank_approximation_svd(W, rank):
    """
    使用SVD计算最优低秩近似
    
    W: (m, n) 原始权重矩阵
    rank: 目标秩 r
    """
    # SVD分解
    U, S, Vh = torch.linalg.svd(W, full_matrices=False)
    
    # 截断
    U_r = U[:, :rank]
    S_r = S[:rank]
    Vh_r = Vh[:rank, :]
    
    # 重构近似
    W_approx = U_r @ torch.diag(S_r) @ Vh_r
    
    # 返回分解后的表示
    A = U_r @ torch.diag(S_r)  # (m, r)
    B = Vh_r                    # (r, n)
    
    return A, B, W_approx

2. LoRA:低秩适配的经典方法

2.1 方法核心

LoRA(Low-Rank Adaptation)1假设预训练模型的权重更新 具有低秩结构:

关键优势:只训练 ,冻结 ,大幅减少可训练参数量。

class LoRALinear(nn.Module):
    """
    LoRA 层的实现
    
    W_0: 冻结的预训练权重
    A, B: 可训练的低秩矩阵
    """
    def __init__(self, in_features, out_features, rank=4, alpha=1.0):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # 冻结原始权重
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features), 
            requires_grad=False
        )
        
        # 低秩分解矩阵
        # A: 高斯初始化, B: 零初始化
        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
    
    def forward(self, x):
        # 原始输出
        base_output = F.linear(x, self.weight)
        
        # LoRA 更新
        lora_update = self.scaling * (x @ self.lora_A.T @ self.lora_B.T)
        
        return base_output + lora_update

2.2 秩的选择与性能权衡

def lora_rank_analysis(model, dataloader, ranks=[1, 2, 4, 8, 16, 32]):
    """
    分析不同LoRA秩对性能的影响
    """
    results = []
    
    for rank in ranks:
        # 创建LoRA模型
        lora_model = apply_lora(model, rank=rank)
        
        # 训练
        train(lora_model, dataloader, epochs=3)
        
        # 评估
        performance = evaluate(lora_model, dataloader)
        
        # 计算参数量
        trainable_params = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
        
        results.append({
            'rank': rank,
            'performance': performance,
            'trainable_params': trainable_params,
            'compression_ratio': trainable_params / model.num_parameters()
        })
    
    return results

2.3 LoRA变体

变体核心改进论文
QLoRA4-bit量化 + LoRADettmers et al., 2023
DoRA权重分解的方向分析Liu et al., 2024
LoRA+自适应学习率Hayou et al., 2024
AdaLoRA动态秩调整Zhang et al., 2023
LoRAfinFisher信息加权等人的2024

3. 双稀疏分解

3.1 Double Sparse方法

最新研究2提出将权重矩阵分解为两个稀疏矩阵的乘积:

其中 都是稀疏矩阵。

class DoubleSparseLayer(nn.Module):
    """
    双稀疏层实现
    """
    def __init__(self, in_features, out_features, density=0.1):
        super().__init__()
        self.density = density
        
        # 稀疏矩阵
        self.S1 = nn.Parameter(
            torch.randn(out_features, int(in_features * density))
        )
        self.S2 = nn.Parameter(
            torch.randn(int(in_features * density), in_features)
        )
        
        # 稀疏性由梯度决定
    
    def forward(self, x):
        # 稀疏矩阵乘法
        return F.linear(x, self.S1 @ self.S2)

3.2 实践效果

根据论文实验,在剪枝50%参数的同时,LLaMA2-13B可保持优于LLaMA2-7B的性能。2


4. 张量分解方法

4.1 CP分解

Canonical Polyadic (CP) 分解将高阶张量分解为多个秩-1张量的和:

其中 表示外积。

4.2 Tucker分解

Tucker分解是高阶张量的SVD推广:

4.3 TT(Tensor Train)分解

张量训练分解将高阶张量表示为矩阵序列:

def tensor_train_decompose(tensor, ranks):
    """
    张量训练分解
    
    tensor: (d1, d2, ..., dk)
    ranks: [1, r1, r2, ..., r_{k-1}, 1]
    """
    n_dims = len(tensor.shape)
    cores = []
    
    # 迭代SVD
    current = tensor.reshape(-1, tensor.shape[-1])
    
    for i in range(n_dims - 1):
        # 展开并SVD
        U, S, Vh = torch.linalg.svd(current, full_matrices=False)
        
        # 截断
        r = min(ranks[i+1], U.shape[1])
        cores.append(U[:, :r].reshape(*tensor.shape[:i], -1, r))
        
        # 更新
        current = torch.diag(S[:r]) @ Vh[:r, :]
        
        if i < n_dims - 2:
            tensor = tensor.transpose(-1, i)
            current = current @ tensor.reshape(r, -1)
    
    cores.append(current)
    
    return cores

5. Monarch分解

Monarch分解3是一种专门为硬件高效计算设计的矩阵分解:

其中 块置换矩阵的乘积,可以实现 的硬件效率。

def monarch_factorize(A, blocks):
    """
    Monarch分解的简化实现
    """
    # 将矩阵分割为块
    m, n = A.shape
    n_blocks = len(blocks)
    
    # 块级操作
    block_matrices = []
    for i in range(n_blocks):
        block = A[i::n_blocks, :]
        U, S, Vh = torch.linalg.svd(block, full_matrices=False)
        block_matrices.append((U, S, Vh))
    
    return block_matrices

6. 神经网络权重矩阵分解实践

6.1 全连接层分解

class FactorizedLinear(nn.Module):
    """
    分解后的全连接层
    """
    def __init__(self, in_features, out_features, rank):
        super().__init__()
        self.rank = rank
        
        # 低秩分解
        self.A = nn.Linear(in_features, rank, bias=False)
        self.B = nn.Linear(rank, out_features, bias=True)
    
    def forward(self, x):
        return self.B(self.A(x))
    
    @property
    def weight(self):
        return self.B.weight @ self.A.weight
 
 
def decompose_fc_layer(layer, rank):
    """将全连接层分解为两个线性层"""
    in_f = layer.in_features
    out_f = layer.out_features
    
    # 使用SVD获取最优近似
    W = layer.weight.data
    U, S, Vh = torch.linalg.svd(W, full_matrices=False)
    
    # 分解
    new_layer = FactorizedLinear(in_f, out_f, rank)
    
    # 设置权重
    with torch.no_grad():
        new_layer.A.weight.copy_(Vh[:rank, :])
        new_layer.B.weight.copy_(U[:, :rank] * S[:rank])
        if layer.bias is not None:
            new_layer.B.bias.copy_(layer.bias)
    
    return new_layer

6.2 嵌入矩阵分解

class FactorizedEmbedding(nn.Module):
    """
    分解的嵌入层
    
    原始: V x d (词汇表 x 嵌入维度)
    分解后: V x r + r x d, r << min(V, d)
    """
    def __init__(self, vocab_size, embedding_dim, rank):
        super().__init__()
        self.rank = rank
        
        # 压缩到低维
        self.encoder = nn.Embedding(vocab_size, rank)
        # 从低维展开
        self.decoder = nn.Linear(rank, embedding_dim, bias=False)
        
        # 可选的偏置
        self.has_bias = True
        if self.has_bias:
            self.bias = nn.Parameter(torch.zeros(vocab_size))
    
    def forward(self, x):
        # 编码
        compressed = self.encoder(x)  # (batch, seq, rank)
        # 解码
        output = self.decoder(compressed)  # (batch, seq, d)
        
        if self.has_bias:
            output = output + self.bias[None, None, :]
        
        return output

7. 渐进式分解训练

7.1 从大到小的渐进训练

def progressive_factorization_train(model, dataloader, initial_rank, final_rank, steps):
    """
    渐进式分解训练
    
    从大秩开始,逐渐减小到目标秩
    """
    rank_schedule = np.linspace(initial_rank, final_rank, steps)
    
    for step, target_rank in enumerate(rank_schedule):
        rank = int(target_rank)
        
        # 调整所有分解层
        for name, module in model.named_modules():
            if hasattr(module, 'rank'):
                # 增加或减少秩
                adjust_rank(module, rank)
        
        # 训练一步
        train_step(model, dataloader)
        
        print(f"Step {step}: rank = {rank}")

7.2 动态秩调整

class AdaptiveRankLayer(nn.Module):
    """
    自适应秩层:根据重要性动态调整秩
    """
    def __init__(self, in_features, out_features, max_rank):
        super().__init__()
        self.max_rank = max_rank
        
        # 全秩基础
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features)
        )
        
        # 重要性分数
        self.importance = nn.Parameter(
            torch.ones(max_rank)
        )
    
    def get_effective_rank(self):
        """基于重要性计算有效秩"""
        scores = torch.softmax(self.importance, dim=0)
        # 归一化到[1, max_rank]
        return max(1, int(scores.sum(dim=0) * self.max_rank))
    
    def forward(self, x):
        # 获取有效秩
        rank = self.get_effective_rank()
        
        # SVD分解
        W = self.weight
        U, S, Vh = torch.linalg.svd(W, full_matrices=False)
        
        # 截断
        W_approx = U[:, :rank] @ torch.diag(S[:rank]) @ Vh[:rank, :]
        
        return F.linear(x, W_approx)

8. 分解质量评估

8.1 评估指标

def evaluate_decomposition(original, decomposed, original_layer=None):
    """
    评估分解质量
    """
    # 重构误差
    recon_error = (original - decomposed).norm() / original.norm()
    
    # 奇异值重叠度
    U_o, S_o, _ = torch.linalg.svd(original, full_matrices=False)
    U_d, S_d, _ = torch.linalg.svd(decomposed, full_matrices=False)
    
    # 子空间重叠
    overlap = torch.abs(U_o[:, :10].T @ U_d[:, :10]).mean()
    
    # 如果有原始层,评估性能保持
    performance_kept = None
    if original_layer is not None:
        # 在验证集上比较
        pass
    
    return {
        'reconstruction_error': recon_error.item(),
        'subspace_overlap': overlap.item(),
        'performance_kept': performance_kept
    }

9. 总结

核心要点

  1. 低秩分解利用权重矩阵的低秩结构,大幅减少参数量
  2. LoRA是低秩适配的经典方法,通过固定预训练权重,只训练低秩更新
  3. 双稀疏分解在保持性能的同时实现更大的压缩比
  4. 张量分解(CP、Tucker、TT)处理高阶张量(如图神经网络中的多维权重)
  5. Monarch分解专门为硬件高效计算设计

方法对比

方法压缩比精度保持计算效率适用场景
截断SVD中等理论最优离线压缩
LoRA良好轻量微调
双稀疏很高良好极致压缩
Monarch良好很高硬件部署

参考资料


相关链接

Footnotes

  1. Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.

  2. Double Sparse. (2024). Two Sparse Matrices are Better than One. arXiv:2409.18850. 2

  3. Thomas, J. W., et al. (2022). Monarch: Expressive Structured Matrices for Efficient and Accurate Training. NeurIPS 2022.