模型量化技术
模型量化(Quantization)通过将模型权重和激活从高精度(FP32)转换为低精度(INT8/INT4/FP4)来表示,是大模型压缩的核心技术之一。
1. 量化基础
1.1 量化定义
量化是将连续值映射到离散值的过程:
其中:
- :缩放因子(scale)
- :零点(zero point)
- :位宽
反量化:
1.2 量化类型
| 类型 | 公式 | 特点 |
|---|---|---|
| 对称量化 | 零点为0,适合分布对称的数据 | |
| 非对称量化 | 更灵活,但实现复杂 |
对称量化:
非对称量化:
1.3 量化误差分析
量化引入的误差:
均方误差:
SNR(信噪比):
2. 后训练量化(PTQ)
2.1 分类
后训练量化 (PTQ)
├── 动态量化
│ └── 推理时实时量化
├── 静态量化
│ └── 离线校准
└── 混合精度量化
└── 不同层不同精度
2.2 动态量化
最简单的量化方式,权重离线量化,激活动态量化:
import torch
import torch.quantization
# 动态量化(权重INT8,激活FP32)
model_dynamic = torch.quantization.quantize_dynamic(
model, # 原始FP32模型
{torch.nn.Linear}, # 要量化的层类型
dtype=torch.qint8 # 目标精度
)2.3 静态量化
需要校准数据集来确定缩放因子:
# 静态量化配置
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
# 校准
with torch.no_grad():
for batch in calibration_data:
model(batch)
# 转换
model_int8 = torch.quantization.convert(model, inplace=False)3. GPTQ
3.1 核心思想
GPTQ1使用**最优脑压缩(Optimal Brain Compression, OBC)**框架进行4-bit量化。
核心算法:
对权重矩阵 进行逐列量化,对于每一列:
- 计算Hessian矩阵的逆
- 选择量化误差最小的权重子集
- 通过闭式解精确补偿量化误差
3.2 GPTQ算法
import torch
import numpy as np
def gptq_quantize(W, bits=4, per_channel=True):
"""
GPTQ量化
Args:
W: 权重矩阵 (out_features, in_features)
bits: 量化位数
per_channel: 是否逐通道量化
"""
rows, cols = W.shape
device = W.device
# 缩放因子和零点
if per_channel:
# 逐通道:每行一个scale
max_val = W.abs().max(dim=1, keepdim=True)[0]
scales = max_val / (2**(bits-1))
else:
# 逐张量:一个scale
max_val = W.abs().max()
scales = max_val / (2**(bits-1))
# 量化
W_quant = torch.round(W / scales)
W_quant = torch.clamp(W_quant, -(2**(bits-1)), 2**(bits-1) - 1)
# 反量化
W_dequant = W_quant * scales
return W_quant.to(torch.int8), scales.to(torch.float16)
class GPTQ:
"""
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
"""
def __init__(self, model, bits=4, groupsize=-1):
self.model = model
self.bits = bits
self.groupsize = groupsize # -1表示逐层量化
def quantize_layer(self, layer, name):
"""量化单个层"""
# 提取权重
W = layer.weight.data.clone()
orig_shape = W.shape
# 处理大矩阵:按列块处理
if len(W.shape) == 2:
out_features, in_features = W.shape
if self.groupsize == -1:
# 逐层量化
W_quant, scales = self._quantize_tensor(W)
else:
# 逐组量化
W_quant, scales = self._quantize_groups(W, self.groupsize)
# 转换为int格式存储
return W_quant, scales
def _quantize_tensor(self, W):
"""量化整个张量"""
# 缩放因子
scales = W.abs().max() / (2**(self.bits-1))
# 量化
W_quant = torch.round(W / scales)
W_quant = torch.clamp(W_quant, -(2**(self.bits-1)), 2**(self.bits-1)-1)
return W_quant, scales
def _quantize_groups(self, W, groupsize):
"""逐组量化"""
out_features, in_features = W.shape
num_groups = in_features // groupsize
W_quant = torch.zeros_like(W)
scales = torch.zeros(out_features, num_groups, device=W.device)
for g in range(num_groups):
start = g * groupsize
end = min((g + 1) * groupsize, in_features)
W_g = W[:, start:end]
scale_g = W_g.abs().max(dim=1, keepdim=True)[0] / (2**(self.bits-1))
W_quant[:, start:end] = torch.round(W_g / scale_g)
W_quant[:, start:end] = torch.clamp(
W_quant[:, start:end],
-(2**(self.bits-1)),
2**(self.bits-1)-1
)
scales[:, g] = scale_g.squeeze(-1)
return W_quant, scales3.3 GPTQ的改进版本
| 版本 | 改进 | 论文 |
|---|---|---|
| GPTQ | 基础OBC框架 | arXiv:2210.17323 |
| AutoGPTQ | 集成框架 | 开源库 |
| GPTQ-for-llama | LLaMA优化 | GitHub |
4. AWQ(Activation-Aware Weight Quantization)
4.1 核心思想
AWQ2发现权重对量化误差的敏感性不同:
不是所有权重都同等重要,保护敏感权重可以显著减少量化误差。
敏感性度量:
或者基于激活值:
4.2 AWQ算法
import torch
import torch.nn.functional as F
def awq_quantize(W, A, bits=4):
"""
AWQ: Activation-Aware Weight Quantization
Args:
W: 权重矩阵
A: 激活值(用于计算敏感性)
bits: 量化位数
"""
# 计算缩放因子
# AWQ选择缩放使得敏感权重免于被截断
scales = (W.abs() / W.abs().max()).clamp(0, 1)
# 改进的缩放:考虑激活值
# s = (|W| / max(|W|))^α, α ∈ [0.4, 0.7]
alpha = 0.5
scales = (W.abs() / W.abs().max()).pow(alpha)
# 应用缩放
W_scaled = W / scales.unsqueeze(-1)
# 量化
max_val = 2**(bits-1) - 1
W_quant = torch.round(W_scaled)
W_quant = torch.clamp(W_quant, -max_val, max_val)
# 反量化
W_dequant = W_quant * scales.unsqueeze(-1)
return W_quant.to(torch.int8), scales
class AWQLinear(nn.Module):
"""
AWQ量化线性层
"""
def __init__(self, in_features, out_features, bias=False):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.weight = nn.Parameter(torch.empty(out_features, in_features))
if bias:
self.bias = nn.Parameter(torch.empty(out_features))
else:
self.register_parameter('bias', None)
# 量化参数
self.scales = None
self.zero_points = None
def calibrate(self, dataloader, model):
"""校准:计算缩放因子"""
# 收集激活值统计
act_stats = {}
def hook_fn(module, input, output):
act = input[0].detach()
if module not in act_stats:
act_stats[module] = []
act_stats[module].append(act)
hooks = []
for name, module in model.named_modules():
if isinstance(module, AWQLinear):
h = module.register_forward_hook(hook_fn)
hooks.append(h)
with torch.no_grad():
for batch in dataloader:
model(batch)
for h in hooks:
h.remove()
# 计算缩放因子
for module in act_stats:
A = torch.cat(act_stats[module], dim=0)
# 使用AWQ公式
module.scales = (module.weight.data.abs() /
module.weight.data.abs().max()).pow(0.5)
def forward(self, x):
# 反量化权重
weight = self.weight * self.scales.unsqueeze(-1)
return F.linear(x, weight, self.bias)5. QLoRA
5.1 核心思想
QLoRA3将4-bit量化与低秩适配器结合,实现高效微调:
4-bit NormalFloat量化 + LoRA + 分页优化器
5.2 4-bit NormalFloat量化
**NF4(4-bit NormalFloat)**数据类型专为神经网络权重设计:
- 权重服从近似正态分布
- NF4的量化等级按正态分布非均匀分布
class NF4Tensor:
"""4-bit NormalFloat数据类型"""
def __init__(self, device):
# NF4的16个量化等级(非均匀分布)
self.qlevels = torch.tensor([
-0.9565, -0.8142, -0.6868, -0.5704,
-0.4592, -0.3512, -0.2448, -0.1390,
0.1390, 0.2448, 0.3512, 0.4592,
0.5704, 0.6868, 0.8142, 0.9565
], device=device)
def quantize(self, x):
"""量化到NF4"""
# 找到最近的量化等级
x_flat = x.flatten()
idx = torch.searchsorted(self.qlevels, x_flat)
return idx.view_as(x)
def dequantize(self, indices):
"""反量化"""
return self.qlevels[indices]5.3 QLoRA实现
import bitsandbytes as bnb
class QLoRALinear(nn.Module):
"""
QLoRA: Quantized Low-Rank Adaptation
"""
def __init__(self, in_features, out_features, rank=4, lora_alpha=16):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.rank = rank
self.lora_alpha = lora_alpha
# 4-bit量化权重(存储为int8)
self.weight = bnb.nn.Params4bit(
torch.empty(out_features, in_features),
quant_state=self.quant_state
)
# LoRA适配器(FP16)
self.lora_A = nn.Parameter(torch.randn(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.scaling = self.lora_alpha / self.rank
# 冻结原始权重
self.weight.requires_grad = False
def forward(self, x):
# 4-bit权重的FP16反量化
weight = self.weight.to(torch.float16)
# LoRA更新
lora_update = self.lora_B @ self.lora_A * self.scaling
return F.linear(x, weight + lora_update, self.bias)
# 使用bitsandbytes库
import bitsandbytes as bnb
# 加载4-bit量化模型
model = bnb transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True # 双重量化
)6. 方法对比
| 方法 | 位宽 | 量化方式 | 精度损失 | 计算开销 |
|---|---|---|---|---|
| 动态量化 | INT8 | 动态 | 低 | 低 |
| 静态量化 | INT8 | 校准 | 中等 | 低 |
| GPTQ | INT4 | OBC | 低 | 中等 |
| AWQ | INT4 | 激活感知 | 低 | 中等 |
| QLoRA | INT4+LoRA | NF4+适配 | 极低 | 高(微调时) |
7. 量化感知训练(QAT)
7.1 基本原理
在训练过程中模拟量化效应,使模型适应低精度:
class FakeQuantize(nn.Module):
"""伪量化模块"""
def __init__(self, num_bits=8):
super().__init__()
self.num_bits = num_bits
self.scale = None
self.zero_point = None
def forward(self, x):
if not self.training:
return x
# STE (Straight-Through Estimator)
# 前向:量化
# 反向:恒等函数
qmin = -(2 ** (self.num_bits - 1))
qmax = 2 ** (self.num_bits - 1) - 1
# 计算scale和zero_point
self.scale = x.abs().max() / qmax
self.zero_point = 0
# 量化
x_quant = torch.round(x / self.scale).clamp(qmin, qmax)
# 反量化
x_dequant = x_quant * self.scale
return x_dequant7.2 可学习量化
LSQ(Learnable Step Size Quantization)等方法让量化参数可学习:
class LSQQuantize(nn.Module):
"""
LSQ: Learnable Step Size Quantization
"""
def __init__(self, num_bits=4):
super().__init__()
self.num_bits = num_bits
# 可学习的缩放因子
self.logScale = nn.Parameter(torch.zeros(1))
@property
def scale(self):
return torch.exp(self.logScale)
def forward(self, x):
# 量化
Q_n = -(2 ** (self.num_bits - 1))
Q_p = 2 ** (self.num_bits - 1) - 1
x_scaled = x / self.scale
x_round = torch.round(x_scaled)
x_clip = x_round.clamp(Q_n, Q_p)
# STE
x_quant = x_clip.detach() + x_scaled - x_scaled.detach()
return x_quant * self.scale8. 实践指南
8.1 量化方法选择
| 场景 | 推荐方法 |
|---|---|
| 推理部署(4-bit) | GPTQ / AWQ |
| 微调 | QLoRA |
| 极致压缩(2-bit) | GGUF / GPTQ变体 |
| 快速实验 | 动态量化 |
8.2 量化配置
# AWQ推荐配置
quant_config = {
"bits": 4,
"group_size": 128, # 128为常用值
"zero_point": True,
"activation_scheme": "per_token"
}
# GPTQ推荐配置
quant_config = {
"bits": 4,
"group_size": 128,
"desc_act": True, # 激活顺序量化
"static_groups": False
}8.3 量化后评估
| 检查项 | 方法 |
|---|---|
| 困惑度变化 | 评估PPL |
| 下游任务 | 标准基准测试 |
| 生成质量 | 人工评估 / 自动指标 |
| 数值范围 | 检查溢出 |
9. 参考资料
扩展阅读:
Footnotes
-
Frantar E, Ashkboos S, Eggers C, et al. GPTQ: Accurate post-training quantization for generative pre-trained transformers. ICLR, 2023. arXiv:2210.17323 ↩
-
Lin J, Tang J, Tang H, et al. AWQ: Activation-aware weight quantization for LLM compression. arXiv:2306.00978, 2023. ↩
-
Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient finetuning of quantized LLMs. NeurIPS, 2023. arXiv:2305.14314 ↩