模型量化技术

模型量化(Quantization)通过将模型权重和激活从高精度(FP32)转换为低精度(INT8/INT4/FP4)来表示,是大模型压缩的核心技术之一。

1. 量化基础

1.1 量化定义

量化是将连续值映射到离散值的过程:

其中:

  • :缩放因子(scale)
  • :零点(zero point)
  • :位宽

反量化

1.2 量化类型

类型公式特点
对称量化零点为0,适合分布对称的数据
非对称量化更灵活,但实现复杂

对称量化

非对称量化

1.3 量化误差分析

量化引入的误差:

均方误差

SNR(信噪比)

2. 后训练量化(PTQ)

2.1 分类

后训练量化 (PTQ)
├── 动态量化
│   └── 推理时实时量化
├── 静态量化
│   └── 离线校准
└── 混合精度量化
    └── 不同层不同精度

2.2 动态量化

最简单的量化方式,权重离线量化,激活动态量化:

import torch
import torch.quantization
 
# 动态量化(权重INT8,激活FP32)
model_dynamic = torch.quantization.quantize_dynamic(
    model,  # 原始FP32模型
    {torch.nn.Linear},  # 要量化的层类型
    dtype=torch.qint8  # 目标精度
)

2.3 静态量化

需要校准数据集来确定缩放因子:

# 静态量化配置
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
 
# 校准
with torch.no_grad():
    for batch in calibration_data:
        model(batch)
 
# 转换
model_int8 = torch.quantization.convert(model, inplace=False)

3. GPTQ

3.1 核心思想

GPTQ1使用**最优脑压缩(Optimal Brain Compression, OBC)**框架进行4-bit量化。

核心算法

对权重矩阵 进行逐列量化,对于每一列:

  1. 计算Hessian矩阵的逆
  2. 选择量化误差最小的权重子集
  3. 通过闭式解精确补偿量化误差

3.2 GPTQ算法

import torch
import numpy as np
 
def gptq_quantize(W, bits=4, per_channel=True):
    """
    GPTQ量化
    
    Args:
        W: 权重矩阵 (out_features, in_features)
        bits: 量化位数
        per_channel: 是否逐通道量化
    """
    rows, cols = W.shape
    device = W.device
    
    # 缩放因子和零点
    if per_channel:
        # 逐通道:每行一个scale
        max_val = W.abs().max(dim=1, keepdim=True)[0]
        scales = max_val / (2**(bits-1))
    else:
        # 逐张量:一个scale
        max_val = W.abs().max()
        scales = max_val / (2**(bits-1))
    
    # 量化
    W_quant = torch.round(W / scales)
    W_quant = torch.clamp(W_quant, -(2**(bits-1)), 2**(bits-1) - 1)
    
    # 反量化
    W_dequant = W_quant * scales
    
    return W_quant.to(torch.int8), scales.to(torch.float16)
 
 
class GPTQ:
    """
    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
    """
    def __init__(self, model, bits=4, groupsize=-1):
        self.model = model
        self.bits = bits
        self.groupsize = groupsize  # -1表示逐层量化
        
    def quantize_layer(self, layer, name):
        """量化单个层"""
        # 提取权重
        W = layer.weight.data.clone()
        orig_shape = W.shape
        
        # 处理大矩阵:按列块处理
        if len(W.shape) == 2:
            out_features, in_features = W.shape
            if self.groupsize == -1:
                # 逐层量化
                W_quant, scales = self._quantize_tensor(W)
            else:
                # 逐组量化
                W_quant, scales = self._quantize_groups(W, self.groupsize)
        
        # 转换为int格式存储
        return W_quant, scales
    
    def _quantize_tensor(self, W):
        """量化整个张量"""
        # 缩放因子
        scales = W.abs().max() / (2**(self.bits-1))
        # 量化
        W_quant = torch.round(W / scales)
        W_quant = torch.clamp(W_quant, -(2**(self.bits-1)), 2**(self.bits-1)-1)
        return W_quant, scales
    
    def _quantize_groups(self, W, groupsize):
        """逐组量化"""
        out_features, in_features = W.shape
        num_groups = in_features // groupsize
        
        W_quant = torch.zeros_like(W)
        scales = torch.zeros(out_features, num_groups, device=W.device)
        
        for g in range(num_groups):
            start = g * groupsize
            end = min((g + 1) * groupsize, in_features)
            
            W_g = W[:, start:end]
            scale_g = W_g.abs().max(dim=1, keepdim=True)[0] / (2**(self.bits-1))
            
            W_quant[:, start:end] = torch.round(W_g / scale_g)
            W_quant[:, start:end] = torch.clamp(
                W_quant[:, start:end], 
                -(2**(self.bits-1)), 
                2**(self.bits-1)-1
            )
            scales[:, g] = scale_g.squeeze(-1)
        
        return W_quant, scales

3.3 GPTQ的改进版本

版本改进论文
GPTQ基础OBC框架arXiv:2210.17323
AutoGPTQ集成框架开源库
GPTQ-for-llamaLLaMA优化GitHub

4. AWQ(Activation-Aware Weight Quantization)

4.1 核心思想

AWQ2发现权重对量化误差的敏感性不同

不是所有权重都同等重要,保护敏感权重可以显著减少量化误差。

敏感性度量

或者基于激活值:

4.2 AWQ算法

import torch
import torch.nn.functional as F
 
def awq_quantize(W, A, bits=4):
    """
    AWQ: Activation-Aware Weight Quantization
    
    Args:
        W: 权重矩阵
        A: 激活值(用于计算敏感性)
        bits: 量化位数
    """
    # 计算缩放因子
    # AWQ选择缩放使得敏感权重免于被截断
    scales = (W.abs() / W.abs().max()).clamp(0, 1)
    
    # 改进的缩放:考虑激活值
    # s = (|W| / max(|W|))^α, α ∈ [0.4, 0.7]
    alpha = 0.5
    scales = (W.abs() / W.abs().max()).pow(alpha)
    
    # 应用缩放
    W_scaled = W / scales.unsqueeze(-1)
    
    # 量化
    max_val = 2**(bits-1) - 1
    W_quant = torch.round(W_scaled)
    W_quant = torch.clamp(W_quant, -max_val, max_val)
    
    # 反量化
    W_dequant = W_quant * scales.unsqueeze(-1)
    
    return W_quant.to(torch.int8), scales
 
 
class AWQLinear(nn.Module):
    """
    AWQ量化线性层
    """
    def __init__(self, in_features, out_features, bias=False):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = nn.Parameter(torch.empty(out_features, in_features))
        if bias:
            self.bias = nn.Parameter(torch.empty(out_features))
        else:
            self.register_parameter('bias', None)
        
        # 量化参数
        self.scales = None
        self.zero_points = None
    
    def calibrate(self, dataloader, model):
        """校准:计算缩放因子"""
        # 收集激活值统计
        act_stats = {}
        
        def hook_fn(module, input, output):
            act = input[0].detach()
            if module not in act_stats:
                act_stats[module] = []
            act_stats[module].append(act)
        
        hooks = []
        for name, module in model.named_modules():
            if isinstance(module, AWQLinear):
                h = module.register_forward_hook(hook_fn)
                hooks.append(h)
        
        with torch.no_grad():
            for batch in dataloader:
                model(batch)
        
        for h in hooks:
            h.remove()
        
        # 计算缩放因子
        for module in act_stats:
            A = torch.cat(act_stats[module], dim=0)
            # 使用AWQ公式
            module.scales = (module.weight.data.abs() / 
                           module.weight.data.abs().max()).pow(0.5)
    
    def forward(self, x):
        # 反量化权重
        weight = self.weight * self.scales.unsqueeze(-1)
        return F.linear(x, weight, self.bias)

5. QLoRA

5.1 核心思想

QLoRA3将4-bit量化与低秩适配器结合,实现高效微调:

4-bit NormalFloat量化 + LoRA + 分页优化器

5.2 4-bit NormalFloat量化

**NF4(4-bit NormalFloat)**数据类型专为神经网络权重设计:

  • 权重服从近似正态分布
  • NF4的量化等级按正态分布非均匀分布
class NF4Tensor:
    """4-bit NormalFloat数据类型"""
    def __init__(self, device):
        # NF4的16个量化等级(非均匀分布)
        self.qlevels = torch.tensor([
            -0.9565, -0.8142, -0.6868, -0.5704,
            -0.4592, -0.3512, -0.2448, -0.1390,
            0.1390, 0.2448, 0.3512, 0.4592,
            0.5704, 0.6868, 0.8142, 0.9565
        ], device=device)
    
    def quantize(self, x):
        """量化到NF4"""
        # 找到最近的量化等级
        x_flat = x.flatten()
        idx = torch.searchsorted(self.qlevels, x_flat)
        return idx.view_as(x)
    
    def dequantize(self, indices):
        """反量化"""
        return self.qlevels[indices]

5.3 QLoRA实现

import bitsandbytes as bnb
 
class QLoRALinear(nn.Module):
    """
    QLoRA: Quantized Low-Rank Adaptation
    """
    def __init__(self, in_features, out_features, rank=4, lora_alpha=16):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.lora_alpha = lora_alpha
        
        # 4-bit量化权重(存储为int8)
        self.weight = bnb.nn.Params4bit(
            torch.empty(out_features, in_features),
            quant_state=self.quant_state
        )
        
        # LoRA适配器(FP16)
        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        self.scaling = self.lora_alpha / self.rank
        
        # 冻结原始权重
        self.weight.requires_grad = False
    
    def forward(self, x):
        # 4-bit权重的FP16反量化
        weight = self.weight.to(torch.float16)
        
        # LoRA更新
        lora_update = self.lora_B @ self.lora_A * self.scaling
        
        return F.linear(x, weight + lora_update, self.bias)
 
 
# 使用bitsandbytes库
import bitsandbytes as bnb
 
# 加载4-bit量化模型
model = bnb transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True  # 双重量化
)

6. 方法对比

方法位宽量化方式精度损失计算开销
动态量化INT8动态
静态量化INT8校准中等
GPTQINT4OBC中等
AWQINT4激活感知中等
QLoRAINT4+LoRANF4+适配极低高(微调时)

7. 量化感知训练(QAT)

7.1 基本原理

在训练过程中模拟量化效应,使模型适应低精度:

class FakeQuantize(nn.Module):
    """伪量化模块"""
    def __init__(self, num_bits=8):
        super().__init__()
        self.num_bits = num_bits
        self.scale = None
        self.zero_point = None
    
    def forward(self, x):
        if not self.training:
            return x
        
        # STE (Straight-Through Estimator)
        # 前向:量化
        # 反向:恒等函数
        qmin = -(2 ** (self.num_bits - 1))
        qmax = 2 ** (self.num_bits - 1) - 1
        
        # 计算scale和zero_point
        self.scale = x.abs().max() / qmax
        self.zero_point = 0
        
        # 量化
        x_quant = torch.round(x / self.scale).clamp(qmin, qmax)
        # 反量化
        x_dequant = x_quant * self.scale
        
        return x_dequant

7.2 可学习量化

LSQ(Learnable Step Size Quantization)等方法让量化参数可学习:

class LSQQuantize(nn.Module):
    """
    LSQ: Learnable Step Size Quantization
    """
    def __init__(self, num_bits=4):
        super().__init__()
        self.num_bits = num_bits
        # 可学习的缩放因子
        self.logScale = nn.Parameter(torch.zeros(1))
    
    @property
    def scale(self):
        return torch.exp(self.logScale)
    
    def forward(self, x):
        # 量化
        Q_n = -(2 ** (self.num_bits - 1))
        Q_p = 2 ** (self.num_bits - 1) - 1
        
        x_scaled = x / self.scale
        x_round = torch.round(x_scaled)
        x_clip = x_round.clamp(Q_n, Q_p)
        
        # STE
        x_quant = x_clip.detach() + x_scaled - x_scaled.detach()
        
        return x_quant * self.scale

8. 实践指南

8.1 量化方法选择

场景推荐方法
推理部署(4-bit)GPTQ / AWQ
微调QLoRA
极致压缩(2-bit)GGUF / GPTQ变体
快速实验动态量化

8.2 量化配置

# AWQ推荐配置
quant_config = {
    "bits": 4,
    "group_size": 128,  # 128为常用值
    "zero_point": True,
    "activation_scheme": "per_token"
}
 
# GPTQ推荐配置
quant_config = {
    "bits": 4,
    "group_size": 128,
    "desc_act": True,  # 激活顺序量化
    "static_groups": False
}

8.3 量化后评估

检查项方法
困惑度变化评估PPL
下游任务标准基准测试
生成质量人工评估 / 自动指标
数值范围检查溢出

9. 参考资料

扩展阅读:

Footnotes

  1. Frantar E, Ashkboos S, Eggers C, et al. GPTQ: Accurate post-training quantization for generative pre-trained transformers. ICLR, 2023. arXiv:2210.17323

  2. Lin J, Tang J, Tang H, et al. AWQ: Activation-aware weight quantization for LLM compression. arXiv:2306.00978, 2023.

  3. Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient finetuning of quantized LLMs. NeurIPS, 2023. arXiv:2305.14314