模型量化技术

模型量化（Quantization）通过将模型权重和激活从高精度（FP32）转换为低精度（INT8/INT4/FP4）来表示，是大模型压缩的核心技术之一。

1. 量化基础

1.1 量化定义

量化是将连续值映射到离散值的过程：

x_{q} = clamp (⌊ \frac{x}{s} ⌉ + z, 0, 2^{b} - 1)

其中：

$s$ ：缩放因子（scale）
$z$ ：零点（zero point）
$b$ ：位宽

反量化：

\overset{x}{^} = s \cdot (x_{q} - z)

1.2 量化类型

类型	公式	特点
对称量化	$z = 0$	零点为0，适合分布对称的数据
非对称量化	$z \neq = 0$	更灵活，但实现复杂

对称量化：

s = \frac{max ( ∣ x ∣ )}{2 ^{b - 1} - 1}

非对称量化：

s = \frac{max ( x ) - min ( x )}{2 ^{b} - 1}, z = - ⌊ \frac{min ( x )}{s} ⌉

1.3 量化误差分析

量化引入的误差：

ϵ = x - \overset{x}{^}

均方误差：

MSE_{q u an t} = \frac{1}{N} i = 1 \sum N (x_{i} - \overset{x}{^}_{i})^{2}

SNR（信噪比）：

SNR_{d B} = 10 lo g_{10} (\frac{\sum x _{i}^{2}}{\sum ϵ _{i}^{2}})

2. 后训练量化（PTQ）

2.1 分类

后训练量化 (PTQ)
├── 动态量化
│   └── 推理时实时量化
├── 静态量化
│   └── 离线校准
└── 混合精度量化
    └── 不同层不同精度

2.2 动态量化

最简单的量化方式，权重离线量化，激活动态量化：

import torch
import torch.quantization
 
# 动态量化（权重INT8，激活FP32）
model_dynamic = torch.quantization.quantize_dynamic(
    model,  # 原始FP32模型
    {torch.nn.Linear},  # 要量化的层类型
    dtype=torch.qint8  # 目标精度
)

2.3 静态量化

需要校准数据集来确定缩放因子：

# 静态量化配置
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
 
# 校准
with torch.no_grad():
    for batch in calibration_data:
        model(batch)
 
# 转换
model_int8 = torch.quantization.convert(model, inplace=False)

3. GPTQ

3.1 核心思想

GPTQ¹使用**最优脑压缩（Optimal Brain Compression, OBC）**框架进行4-bit量化。

核心算法：

对权重矩阵 $W$ 进行逐列量化，对于每一列：

计算Hessian矩阵的逆
选择量化误差最小的权重子集
通过闭式解精确补偿量化误差

3.2 GPTQ算法

import torch
import numpy as np
 
def gptq_quantize(W, bits=4, per_channel=True):
    """
    GPTQ量化
    
    Args:
        W: 权重矩阵 (out_features, in_features)
        bits: 量化位数
        per_channel: 是否逐通道量化
    """
    rows, cols = W.shape
    device = W.device
    
    # 缩放因子和零点
    if per_channel:
        # 逐通道：每行一个scale
        max_val = W.abs().max(dim=1, keepdim=True)[0]
        scales = max_val / (2**(bits-1))
    else:
        # 逐张量：一个scale
        max_val = W.abs().max()
        scales = max_val / (2**(bits-1))
    
    # 量化
    W_quant = torch.round(W / scales)
    W_quant = torch.clamp(W_quant, -(2**(bits-1)), 2**(bits-1) - 1)
    
    # 反量化
    W_dequant = W_quant * scales
    
    return W_quant.to(torch.int8), scales.to(torch.float16)
 
 
class GPTQ:
    """
    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
    """
    def __init__(self, model, bits=4, groupsize=-1):
        self.model = model
        self.bits = bits
        self.groupsize = groupsize  # -1表示逐层量化
        
    def quantize_layer(self, layer, name):
        """量化单个层"""
        # 提取权重
        W = layer.weight.data.clone()
        orig_shape = W.shape
        
        # 处理大矩阵：按列块处理
        if len(W.shape) == 2:
            out_features, in_features = W.shape
            if self.groupsize == -1:
                # 逐层量化
                W_quant, scales = self._quantize_tensor(W)
            else:
                # 逐组量化
                W_quant, scales = self._quantize_groups(W, self.groupsize)
        
        # 转换为int格式存储
        return W_quant, scales
    
    def _quantize_tensor(self, W):
        """量化整个张量"""
        # 缩放因子
        scales = W.abs().max() / (2**(self.bits-1))
        # 量化
        W_quant = torch.round(W / scales)
        W_quant = torch.clamp(W_quant, -(2**(self.bits-1)), 2**(self.bits-1)-1)
        return W_quant, scales
    
    def _quantize_groups(self, W, groupsize):
        """逐组量化"""
        out_features, in_features = W.shape
        num_groups = in_features // groupsize
        
        W_quant = torch.zeros_like(W)
        scales = torch.zeros(out_features, num_groups, device=W.device)
        
        for g in range(num_groups):
            start = g * groupsize
            end = min((g + 1) * groupsize, in_features)
            
            W_g = W[:, start:end]
            scale_g = W_g.abs().max(dim=1, keepdim=True)[0] / (2**(self.bits-1))
            
            W_quant[:, start:end] = torch.round(W_g / scale_g)
            W_quant[:, start:end] = torch.clamp(
                W_quant[:, start:end], 
                -(2**(self.bits-1)), 
                2**(self.bits-1)-1
            )
            scales[:, g] = scale_g.squeeze(-1)
        
        return W_quant, scales

3.3 GPTQ的改进版本

版本	改进	论文
GPTQ	基础OBC框架	arXiv:2210.17323
AutoGPTQ	集成框架	开源库
GPTQ-for-llama	LLaMA优化	GitHub

4. AWQ（Activation-Aware Weight Quantization）

4.1 核心思想

AWQ²发现权重对量化误差的敏感性不同：

不是所有权重都同等重要，保护敏感权重可以显著减少量化误差。

敏感性度量：

s_{w} = \frac{1}{∣ W ∣} (i, j) \sum ∣ W_{ij} ∣

或者基于激活值：

s_{w} = \frac{1}{∣ W ∣} (i, j) \sum ∣ W_{ij} ∣ \cdot ∣ a_{j} ∣

4.2 AWQ算法

import torch
import torch.nn.functional as F
 
def awq_quantize(W, A, bits=4):
    """
    AWQ: Activation-Aware Weight Quantization
    
    Args:
        W: 权重矩阵
        A: 激活值（用于计算敏感性）
        bits: 量化位数
    """
    # 计算缩放因子
    # AWQ选择缩放使得敏感权重免于被截断
    scales = (W.abs() / W.abs().max()).clamp(0, 1)
    
    # 改进的缩放：考虑激活值
    # s = (|W| / max(|W|))^α, α ∈ [0.4, 0.7]
    alpha = 0.5
    scales = (W.abs() / W.abs().max()).pow(alpha)
    
    # 应用缩放
    W_scaled = W / scales.unsqueeze(-1)
    
    # 量化
    max_val = 2**(bits-1) - 1
    W_quant = torch.round(W_scaled)
    W_quant = torch.clamp(W_quant, -max_val, max_val)
    
    # 反量化
    W_dequant = W_quant * scales.unsqueeze(-1)
    
    return W_quant.to(torch.int8), scales
 
 
class AWQLinear(nn.Module):
    """
    AWQ量化线性层
    """
    def __init__(self, in_features, out_features, bias=False):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.weight = nn.Parameter(torch.empty(out_features, in_features))
        if bias:
            self.bias = nn.Parameter(torch.empty(out_features))
        else:
            self.register_parameter('bias', None)
        
        # 量化参数
        self.scales = None
        self.zero_points = None
    
    def calibrate(self, dataloader, model):
        """校准：计算缩放因子"""
        # 收集激活值统计
        act_stats = {}
        
        def hook_fn(module, input, output):
            act = input[0].detach()
            if module not in act_stats:
                act_stats[module] = []
            act_stats[module].append(act)
        
        hooks = []
        for name, module in model.named_modules():
            if isinstance(module, AWQLinear):
                h = module.register_forward_hook(hook_fn)
                hooks.append(h)
        
        with torch.no_grad():
            for batch in dataloader:
                model(batch)
        
        for h in hooks:
            h.remove()
        
        # 计算缩放因子
        for module in act_stats:
            A = torch.cat(act_stats[module], dim=0)
            # 使用AWQ公式
            module.scales = (module.weight.data.abs() / 
                           module.weight.data.abs().max()).pow(0.5)
    
    def forward(self, x):
        # 反量化权重
        weight = self.weight * self.scales.unsqueeze(-1)
        return F.linear(x, weight, self.bias)

5. QLoRA

5.1 核心思想

QLoRA³将4-bit量化与低秩适配器结合，实现高效微调：

4-bit NormalFloat量化 + LoRA + 分页优化器

5.2 4-bit NormalFloat量化

**NF4（4-bit NormalFloat）**数据类型专为神经网络权重设计：

权重服从近似正态分布
NF4的量化等级按正态分布非均匀分布

class NF4Tensor:
    """4-bit NormalFloat数据类型"""
    def __init__(self, device):
        # NF4的16个量化等级（非均匀分布）
        self.qlevels = torch.tensor([
            -0.9565, -0.8142, -0.6868, -0.5704,
            -0.4592, -0.3512, -0.2448, -0.1390,
            0.1390, 0.2448, 0.3512, 0.4592,
            0.5704, 0.6868, 0.8142, 0.9565
        ], device=device)
    
    def quantize(self, x):
        """量化到NF4"""
        # 找到最近的量化等级
        x_flat = x.flatten()
        idx = torch.searchsorted(self.qlevels, x_flat)
        return idx.view_as(x)
    
    def dequantize(self, indices):
        """反量化"""
        return self.qlevels[indices]

5.3 QLoRA实现

import bitsandbytes as bnb
 
class QLoRALinear(nn.Module):
    """
    QLoRA: Quantized Low-Rank Adaptation
    """
    def __init__(self, in_features, out_features, rank=4, lora_alpha=16):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.lora_alpha = lora_alpha
        
        # 4-bit量化权重（存储为int8）
        self.weight = bnb.nn.Params4bit(
            torch.empty(out_features, in_features),
            quant_state=self.quant_state
        )
        
        # LoRA适配器（FP16）
        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        self.scaling = self.lora_alpha / self.rank
        
        # 冻结原始权重
        self.weight.requires_grad = False
    
    def forward(self, x):
        # 4-bit权重的FP16反量化
        weight = self.weight.to(torch.float16)
        
        # LoRA更新
        lora_update = self.lora_B @ self.lora_A * self.scaling
        
        return F.linear(x, weight + lora_update, self.bias)
 
 
# 使用bitsandbytes库
import bitsandbytes as bnb
 
# 加载4-bit量化模型
model = bnb transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True  # 双重量化
)

6. 方法对比

方法	位宽	量化方式	精度损失	计算开销
动态量化	INT8	动态	低	低
静态量化	INT8	校准	中等	低
GPTQ	INT4	OBC	低	中等
AWQ	INT4	激活感知	低	中等
QLoRA	INT4+LoRA	NF4+适配	极低	高（微调时）

7. 量化感知训练（QAT）

7.1 基本原理

在训练过程中模拟量化效应，使模型适应低精度：

class FakeQuantize(nn.Module):
    """伪量化模块"""
    def __init__(self, num_bits=8):
        super().__init__()
        self.num_bits = num_bits
        self.scale = None
        self.zero_point = None
    
    def forward(self, x):
        if not self.training:
            return x
        
        # STE (Straight-Through Estimator)
        # 前向：量化
        # 反向：恒等函数
        qmin = -(2 ** (self.num_bits - 1))
        qmax = 2 ** (self.num_bits - 1) - 1
        
        # 计算scale和zero_point
        self.scale = x.abs().max() / qmax
        self.zero_point = 0
        
        # 量化
        x_quant = torch.round(x / self.scale).clamp(qmin, qmax)
        # 反量化
        x_dequant = x_quant * self.scale
        
        return x_dequant

7.2 可学习量化

LSQ（Learnable Step Size Quantization）等方法让量化参数可学习：

class LSQQuantize(nn.Module):
    """
    LSQ: Learnable Step Size Quantization
    """
    def __init__(self, num_bits=4):
        super().__init__()
        self.num_bits = num_bits
        # 可学习的缩放因子
        self.logScale = nn.Parameter(torch.zeros(1))
    
    @property
    def scale(self):
        return torch.exp(self.logScale)
    
    def forward(self, x):
        # 量化
        Q_n = -(2 ** (self.num_bits - 1))
        Q_p = 2 ** (self.num_bits - 1) - 1
        
        x_scaled = x / self.scale
        x_round = torch.round(x_scaled)
        x_clip = x_round.clamp(Q_n, Q_p)
        
        # STE
        x_quant = x_clip.detach() + x_scaled - x_scaled.detach()
        
        return x_quant * self.scale

8. 实践指南

8.1 量化方法选择

场景	推荐方法
推理部署（4-bit）	GPTQ / AWQ
微调	QLoRA
极致压缩（2-bit）	GGUF / GPTQ变体
快速实验	动态量化

8.2 量化配置

# AWQ推荐配置
quant_config = {
    "bits": 4,
    "group_size": 128,  # 128为常用值
    "zero_point": True,
    "activation_scheme": "per_token"
}
 
# GPTQ推荐配置
quant_config = {
    "bits": 4,
    "group_size": 128,
    "desc_act": True,  # 激活顺序量化
    "static_groups": False
}

8.3 量化后评估

检查项	方法
困惑度变化	评估PPL
下游任务	标准基准测试
生成质量	人工评估 / 自动指标
数值范围	检查溢出

9. 参考资料

扩展阅读：

Frantar E, Ashkboos S, Eggers C, et al. GPTQ: Accurate post-training quantization for generative pre-trained transformers. ICLR, 2023. arXiv:2210.17323 ↩
Lin J, Tang J, Tang H, et al. AWQ: Activation-aware weight quantization for LLM compression. arXiv:2306.00978, 2023. ↩
Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient finetuning of quantized LLMs. NeurIPS, 2023. arXiv:2305.14314 ↩

Metaphor

探索

模型量化技术

模型量化技术

1. 量化基础

1.1 量化定义

1.2 量化类型

1.3 量化误差分析

2. 后训练量化（PTQ）

2.1 分类

2.2 动态量化

2.3 静态量化

3. GPTQ

3.1 核心思想

3.2 GPTQ算法

3.3 GPTQ的改进版本

4. AWQ（Activation-Aware Weight Quantization）

4.1 核心思想

4.2 AWQ算法

5. QLoRA

5.1 核心思想

5.2 4-bit NormalFloat量化

5.3 QLoRA实现

6. 方法对比

7. 量化感知训练（QAT）

7.1 基本原理

7.2 可学习量化

8. 实践指南

8.1 量化方法选择

8.2 量化配置

8.3 量化后评估

9. 参考资料

关系图谱

目录

反向链接

Metaphor

探索

模型量化技术

模型量化技术

1. 量化基础

1.1 量化定义

1.2 量化类型

1.3 量化误差分析

2. 后训练量化（PTQ）

2.1 分类

2.2 动态量化

2.3 静态量化

3. GPTQ

3.1 核心思想

3.2 GPTQ算法

3.3 GPTQ的改进版本

4. AWQ（Activation-Aware Weight Quantization）

4.1 核心思想

4.2 AWQ算法

5. QLoRA

5.1 核心思想

5.2 4-bit NormalFloat量化

5.3 QLoRA实现

6. 方法对比

7. 量化感知训练（QAT）

7.1 基本原理

7.2 可学习量化

8. 实践指南

8.1 量化方法选择

8.2 量化配置

8.3 量化后评估

9. 参考资料

Footnotes

关系图谱

目录

反向链接