对抗攻击方法综述

概述

对抗攻击(Adversarial Attack)旨在生成能够欺骗深度学习模型的输入扰动。根据攻击者的知识和能力,可分为白盒攻击、黑盒攻击和迁移攻击。本综述系统介绍主流的对抗攻击方法,包括一阶方法、迭代方法和优化-based方法。

一阶攻击方法

FGSM(Fast Gradient Sign Method)

FGSM 是最早也是最简洁有效的对抗攻击方法,由 Goodfellow 等人于 2015 年提出。1

核心思想:利用损失函数的梯度信息,沿梯度符号方向做一步大幅扰动。

def fgsm(image, label, model, epsilon=8/255):
    """
    Fast Gradient Sign Method
    
    Args:
        image: 输入图像 [B, C, H, W]
        label: 真实标签
        model: 目标模型
        epsilon: 扰动幅度 (归一化到[0,1])
    """
    image.requires_grad = True
    output = model(image)
    loss = F.cross_entropy(output, label)
    
    # 计算梯度
    model.zero_grad()
    loss.backward()
    grad = image.grad.data
    
    # 生成对抗样本
    perturbation = epsilon * grad.sign()
    adversarial = (image + perturbation).clamp(0, 1)
    
    return adversarial

特点

  • 计算速度快(一步梯度)
  • 攻击效果显著
  • 是其他攻击方法的基线

FGM(Fast Gradient Method)

FGM 是 FGSM 的 范数版本:

def fgm(image, label, model, epsilon=8/255):
    image.requires_grad = True
    output = model(image)
    loss = F.cross_entropy(output, label)
    
    model.zero_grad()
    loss.backward()
    grad = image.grad.data
    
    # L2归一化扰动
    perturbation = epsilon * grad / (grad.norm(dim=(1,2,3), keepdim=True) + 1e-10)
    adversarial = (image + perturbation).clamp(0, 1)
    
    return adversarial

R-FGSM(Randomized FGSM)

R-FGSM 在 FGSM 之前添加随机扰动,提高攻击成功率:

其中 ,确保总扰动不超过

迭代攻击方法

PGD(Projected Gradient Descent)

PGD 是 FGSM 的迭代版本,是最强的 攻击之一。2

def pgd_attack(image, label, model, epsilon=8/255, alpha=2/255, iterations=10):
    """
    Projected Gradient Descent Attack
    
    Args:
        alpha: 每步扰动幅度
        iterations: 迭代次数
    """
    x_adv = image.clone()
    
    # 初始化:在epsilon球内随机一点
    x_adv = x_adv + torch.empty_like(x_adv).uniform_(-epsilon, epsilon)
    x_adv = torch.clamp(x_adv, 0, 1)
    
    for _ in range(iterations):
        x_adv.requires_grad = True
        output = model(x_adv)
        loss = F.cross_entropy(output, label)
        
        model.zero_grad()
        loss.backward()
        
        # 梯度上升(攻击者目标)
        x_adv = x_adv.detach() + alpha * x_adv.grad.sign()
        
        # 投影回epsilon球
        delta = torch.clamp(x_adv - image, -epsilon, epsilon)
        x_adv = torch.clamp(image + delta, 0, 1)
    
    return x_adv

PGD 的理论保证

  • PGD 是 约束下的一阶最强攻击
  • 如果模型对 PGD 鲁棒,则对所有一阶攻击鲁棒
  • 随机起始点确保攻击的全面性

BIM(Basic Iterative Method)

BIM 是 PGD 的简化版本(无随机初始化):

MI-FGSM(Momentum Iterative FGSM)

添加动量项稳定梯度方向:3

EOT(Expectation over Transformation)

EOT 针对物理世界攻击,对抗扰动在变换分布上优化:

def eot_attack(image, target_label, model, num_samples=100):
    """EOT攻击框架"""
    delta = torch.zeros_like(image, requires_grad=True)
    optimizer = torch.optim.Adam([delta], lr=0.01)
    
    transforms = [
        lambda x: F.interpolate(x, scale_factor=0.9),
        lambda x: F.interpolate(x, scale_factor=1.1),
        lambda x: torch.rot90(x, k=1, dims=(2,3)),
        # 更多变换...
    ]
    
    for _ in range(1000):
        # 采样变换
        sampled_transforms = random.choices(transforms, k=num_samples)
        
        # 对多种变换取期望梯度
        grads = []
        for T in sampled_transforms:
            x_transformed = T(image + delta)
            loss = F.cross_entropy(model(x_transformed), target_label)
            model.zero_grad()
            loss.backward()
            grads.append(x_transformed.grad)
        
        # 平均梯度
        avg_grad = torch.stack(grads).mean(dim=0)
        
        optimizer.zero_grad()
        delta.grad = avg_grad
        optimizer.step()
        
        # 投影到可行域
        delta.data = torch.clamp(delta.data, -epsilon, epsilon)
    
    return image + delta

优化-based 攻击

C&W 攻击(Carlini & Wagner)

C&W 攻击将攻击问题形式化为优化问题:

其中:

  • 是模型的 logits 输出
  • 是置信度参数
  • 是平衡系数
def cw_attack(image, label, model, target=None, c=1.0, kappa=0, max_iter=1000):
    """
    Carlini & Wagner L2 Attack
    """
    # 变量替换:delta = (tanh(w) + 1) / 2 - x
    def to_image(w, x):
        return (torch.tanh(w) + 1) / 2 * (x.max() - x.min()) + x.min()
    
    w = torch.randn_like(image) * 0.01
    w.requires_grad = True
    optimizer = torch.optim.Adam([w], lr=0.01)
    
    for _ in range(max_iter):
        adv_image = to_image(w, image)
        
        # 计算 logits
        logits = model(adv_image)
        
        if target is None:
            # 非定向攻击
            target = label
            loss1 = F.cross_entropy(logits, target)
            loss2 = ((adv_image - image) ** 2).sum()
        else:
            # 定向攻击
            target_logit = logits[0, target]
            max_nontarget = (logits[0, :] - target_logit).clamp(min=-kappa).max()
            loss1 = (-max_nontarget).clamp(min=0)
            loss2 = ((adv_image - image) ** 2).sum()
        
        loss = loss1 + c * loss2
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    return to_image(w, image)

C&W 攻击的特点

  • 对多种 范数(, , )都有效
  • 能绕过部分防御方法
  • 计算开销较大

EoT 攻击变体

结合 EOT 框架和优化方法:

DeepFool 攻击

DeepFool 通过迭代线性化决策边界找到最小扰动:4

def deepfool(image, model, num_classes=10, max_iter=50, overshoot=0.02):
    """
    DeepFool L2 Attack
    """
    image = image.clone().detach().to(device)
    perturbed = image.clone()
    
    for iteration in range(max_iter):
        perturbed.requires_grad = True
        output = model(perturbed)
        
        # 找到非真实类别的 logits
        current_pred = output.argmax(dim=1)
        if current_pred == true_label:
            # 找到最小梯度方向
            grads = []
            for c in range(num_classes):
                if c != true_label:
                    model.zero_grad()
                    output[0, c].backward(retain_graph=True)
                    grads.append(perturbed.grad.data.flatten())
            
            # 计算到每类边界的距离
            w = torch.stack(grads)  # [num_classes-1, dim]
            f = output[0, true_label] - output[0, 1:]  # [num_classes-1]
            
            # 最小范数扰动方向
            perturbation = (w @ f) / (w @ w.T + 1e-10)
            
            # 归一化并加上overshoot
            perturbation_norm = perturbation.norm() + 1e-10
            perturbation = perturbation / perturbation_norm * (perturbation_norm + overshoot)
            
            perturbed = perturbed.detach() + perturbation.view_as(image)
            perturbed = torch.clamp(perturbed, 0, 1)
        else:
            break
    
    return perturbed

AutoAttack

AutoAttack 是一个对抗攻击集成,包含四种互补的攻击方法:5

  1. APGD-CE: 自适应步长的 PGD(交叉熵损失)
  2. APGD-T: PGD(DLR损失,定向攻击)
  3. FAB: Fast Adaptive Boundary attack
  4. SQUARE: 黑盒攻击
# AutoAttack 使用示例
from autoattack import AutoAttack
 
adversary = AutoAttack(model, norm='Linf', eps=8/255, version='standard')
adversarial = adversary.run_standard_evaluation(images, labels)

AutoAttack 是评估鲁棒性的事实标准

对抗攻击对比总结

攻击方法威胁模型 范数攻击强度计算成本
FGSM白盒
FGM白盒
PGD白盒
BIM白盒中高
MI-FGSM白盒
C&W白盒任意最高
DeepFool白盒
EOT白盒物理
AutoAttack白盒最高

相关主题


参考文献

Footnotes

  1. Goodfellow, I. J., et al. (2015). Explaining and Harnessing Adversarial Examples. ICLR 2015. https://arxiv.org/abs/1412.6572

  2. Madry, A., et al. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR 2018. https://arxiv.org/abs/1706.06083

  3. Dong, Y., et al. (2018). Boosting Adversarial Attacks with Momentum. CVPR 2018. https://arxiv.org/abs/1710.06081

  4. Moosavi-Dezfooli, S. M., et al. (2016). DeepFool: A Universal First-order Method. CVPR 2016. https://arxiv.org/abs/1511.04599

  5. Croce, F., & Hein, M. (2020). Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-free Attacks. ICML 2020. https://arxiv.org/abs/2003.01690