对抗攻击方法综述
概述
对抗攻击(Adversarial Attack)旨在生成能够欺骗深度学习模型的输入扰动。根据攻击者的知识和能力,可分为白盒攻击、黑盒攻击和迁移攻击。本综述系统介绍主流的对抗攻击方法,包括一阶方法、迭代方法和优化-based方法。
一阶攻击方法
FGSM(Fast Gradient Sign Method)
FGSM 是最早也是最简洁有效的对抗攻击方法,由 Goodfellow 等人于 2015 年提出。1
核心思想:利用损失函数的梯度信息,沿梯度符号方向做一步大幅扰动。
def fgsm(image, label, model, epsilon=8/255):
"""
Fast Gradient Sign Method
Args:
image: 输入图像 [B, C, H, W]
label: 真实标签
model: 目标模型
epsilon: 扰动幅度 (归一化到[0,1])
"""
image.requires_grad = True
output = model(image)
loss = F.cross_entropy(output, label)
# 计算梯度
model.zero_grad()
loss.backward()
grad = image.grad.data
# 生成对抗样本
perturbation = epsilon * grad.sign()
adversarial = (image + perturbation).clamp(0, 1)
return adversarial特点:
- 计算速度快(一步梯度)
- 攻击效果显著
- 是其他攻击方法的基线
FGM(Fast Gradient Method)
FGM 是 FGSM 的 范数版本:
def fgm(image, label, model, epsilon=8/255):
image.requires_grad = True
output = model(image)
loss = F.cross_entropy(output, label)
model.zero_grad()
loss.backward()
grad = image.grad.data
# L2归一化扰动
perturbation = epsilon * grad / (grad.norm(dim=(1,2,3), keepdim=True) + 1e-10)
adversarial = (image + perturbation).clamp(0, 1)
return adversarialR-FGSM(Randomized FGSM)
R-FGSM 在 FGSM 之前添加随机扰动,提高攻击成功率:
其中 ,确保总扰动不超过 。
迭代攻击方法
PGD(Projected Gradient Descent)
PGD 是 FGSM 的迭代版本,是最强的 攻击之一。2
def pgd_attack(image, label, model, epsilon=8/255, alpha=2/255, iterations=10):
"""
Projected Gradient Descent Attack
Args:
alpha: 每步扰动幅度
iterations: 迭代次数
"""
x_adv = image.clone()
# 初始化:在epsilon球内随机一点
x_adv = x_adv + torch.empty_like(x_adv).uniform_(-epsilon, epsilon)
x_adv = torch.clamp(x_adv, 0, 1)
for _ in range(iterations):
x_adv.requires_grad = True
output = model(x_adv)
loss = F.cross_entropy(output, label)
model.zero_grad()
loss.backward()
# 梯度上升(攻击者目标)
x_adv = x_adv.detach() + alpha * x_adv.grad.sign()
# 投影回epsilon球
delta = torch.clamp(x_adv - image, -epsilon, epsilon)
x_adv = torch.clamp(image + delta, 0, 1)
return x_advPGD 的理论保证:
- PGD 是 约束下的一阶最强攻击
- 如果模型对 PGD 鲁棒,则对所有一阶攻击鲁棒
- 随机起始点确保攻击的全面性
BIM(Basic Iterative Method)
BIM 是 PGD 的简化版本(无随机初始化):
MI-FGSM(Momentum Iterative FGSM)
添加动量项稳定梯度方向:3
EOT(Expectation over Transformation)
EOT 针对物理世界攻击,对抗扰动在变换分布上优化:
def eot_attack(image, target_label, model, num_samples=100):
"""EOT攻击框架"""
delta = torch.zeros_like(image, requires_grad=True)
optimizer = torch.optim.Adam([delta], lr=0.01)
transforms = [
lambda x: F.interpolate(x, scale_factor=0.9),
lambda x: F.interpolate(x, scale_factor=1.1),
lambda x: torch.rot90(x, k=1, dims=(2,3)),
# 更多变换...
]
for _ in range(1000):
# 采样变换
sampled_transforms = random.choices(transforms, k=num_samples)
# 对多种变换取期望梯度
grads = []
for T in sampled_transforms:
x_transformed = T(image + delta)
loss = F.cross_entropy(model(x_transformed), target_label)
model.zero_grad()
loss.backward()
grads.append(x_transformed.grad)
# 平均梯度
avg_grad = torch.stack(grads).mean(dim=0)
optimizer.zero_grad()
delta.grad = avg_grad
optimizer.step()
# 投影到可行域
delta.data = torch.clamp(delta.data, -epsilon, epsilon)
return image + delta优化-based 攻击
C&W 攻击(Carlini & Wagner)
C&W 攻击将攻击问题形式化为优化问题:
其中:
- 是模型的 logits 输出
- 是置信度参数
- 是平衡系数
def cw_attack(image, label, model, target=None, c=1.0, kappa=0, max_iter=1000):
"""
Carlini & Wagner L2 Attack
"""
# 变量替换:delta = (tanh(w) + 1) / 2 - x
def to_image(w, x):
return (torch.tanh(w) + 1) / 2 * (x.max() - x.min()) + x.min()
w = torch.randn_like(image) * 0.01
w.requires_grad = True
optimizer = torch.optim.Adam([w], lr=0.01)
for _ in range(max_iter):
adv_image = to_image(w, image)
# 计算 logits
logits = model(adv_image)
if target is None:
# 非定向攻击
target = label
loss1 = F.cross_entropy(logits, target)
loss2 = ((adv_image - image) ** 2).sum()
else:
# 定向攻击
target_logit = logits[0, target]
max_nontarget = (logits[0, :] - target_logit).clamp(min=-kappa).max()
loss1 = (-max_nontarget).clamp(min=0)
loss2 = ((adv_image - image) ** 2).sum()
loss = loss1 + c * loss2
optimizer.zero_grad()
loss.backward()
optimizer.step()
return to_image(w, image)C&W 攻击的特点:
- 对多种 范数(, , )都有效
- 能绕过部分防御方法
- 计算开销较大
EoT 攻击变体
结合 EOT 框架和优化方法:
DeepFool 攻击
DeepFool 通过迭代线性化决策边界找到最小扰动:4
def deepfool(image, model, num_classes=10, max_iter=50, overshoot=0.02):
"""
DeepFool L2 Attack
"""
image = image.clone().detach().to(device)
perturbed = image.clone()
for iteration in range(max_iter):
perturbed.requires_grad = True
output = model(perturbed)
# 找到非真实类别的 logits
current_pred = output.argmax(dim=1)
if current_pred == true_label:
# 找到最小梯度方向
grads = []
for c in range(num_classes):
if c != true_label:
model.zero_grad()
output[0, c].backward(retain_graph=True)
grads.append(perturbed.grad.data.flatten())
# 计算到每类边界的距离
w = torch.stack(grads) # [num_classes-1, dim]
f = output[0, true_label] - output[0, 1:] # [num_classes-1]
# 最小范数扰动方向
perturbation = (w @ f) / (w @ w.T + 1e-10)
# 归一化并加上overshoot
perturbation_norm = perturbation.norm() + 1e-10
perturbation = perturbation / perturbation_norm * (perturbation_norm + overshoot)
perturbed = perturbed.detach() + perturbation.view_as(image)
perturbed = torch.clamp(perturbed, 0, 1)
else:
break
return perturbedAutoAttack
AutoAttack 是一个对抗攻击集成,包含四种互补的攻击方法:5
- APGD-CE: 自适应步长的 PGD(交叉熵损失)
- APGD-T: PGD(DLR损失,定向攻击)
- FAB: Fast Adaptive Boundary attack
- SQUARE: 黑盒攻击
# AutoAttack 使用示例
from autoattack import AutoAttack
adversary = AutoAttack(model, norm='Linf', eps=8/255, version='standard')
adversarial = adversary.run_standard_evaluation(images, labels)AutoAttack 是评估鲁棒性的事实标准。
对抗攻击对比总结
| 攻击方法 | 威胁模型 | 范数 | 攻击强度 | 计算成本 |
|---|---|---|---|---|
| FGSM | 白盒 | 中 | 低 | |
| FGM | 白盒 | 中 | 低 | |
| PGD | 白盒 | 高 | 中 | |
| BIM | 白盒 | 中高 | 中 | |
| MI-FGSM | 白盒 | 高 | 中 | |
| C&W | 白盒 | 任意 | 最高 | 高 |
| DeepFool | 白盒 | 高 | 中 | |
| EOT | 白盒 | 物理 | 高 | 高 |
| AutoAttack | 白盒 | 最高 | 高 |
相关主题
- adversarial-robustness-fundamentals — 对抗鲁棒性基础理论
- projected-gradient-descent-attack — PGD 攻击详解
- universal-adversarial-perturbations — 通用对抗扰动
- adversarial-training-methods — 对抗训练防御
参考文献
Footnotes
-
Goodfellow, I. J., et al. (2015). Explaining and Harnessing Adversarial Examples. ICLR 2015. https://arxiv.org/abs/1412.6572 ↩
-
Madry, A., et al. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR 2018. https://arxiv.org/abs/1706.06083 ↩
-
Dong, Y., et al. (2018). Boosting Adversarial Attacks with Momentum. CVPR 2018. https://arxiv.org/abs/1710.06081 ↩
-
Moosavi-Dezfooli, S. M., et al. (2016). DeepFool: A Universal First-order Method. CVPR 2016. https://arxiv.org/abs/1511.04599 ↩
-
Croce, F., & Hein, M. (2020). Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-free Attacks. ICML 2020. https://arxiv.org/abs/2003.01690 ↩