对抗训练

引言

**对抗训练(Adversarial Training)**是提升神经网络对抗鲁棒性的最有效方法之一。其核心思想是在训练过程中注入对抗样本,使模型学会抵抗对抗扰动。1

对抗训练的形式化

鲁棒优化视角

Madry等人(2018)将对抗训练形式化为一个min-max优化问题:1

其中:

  • 内层最大化:找到最强对抗扰动
  • 外层最小化:训练鲁棒模型参数

与标准训练对比

训练方式目标优化问题
标准训练最小化标准损失
对抗训练最小化对抗损失

PGD对抗训练

标准PGD-AT

def pgd_adversarial_training(model, optimizer, x, y, 
                            epsilon=0.03, alpha=0.003, steps=7):
    """
    PGD-based Adversarial Training.
    
    Args:
        epsilon: 扰动上界
        alpha: 步长
        steps: PGD迭代步数
    """
    model.train()
    
    # 生成对抗样本
    x_adv = x.clone().detach()
    x_adv = x_adv + torch.empty_like(x_adv).uniform_(-epsilon, epsilon)
    x_adv = torch.clamp(x_adv, 0, 1)
    
    for _ in range(steps):
        x_adv.requires_grad = True
        
        output = model(x_adv)
        loss = F.cross_entropy(output, y)
        
        optimizer.zero_grad()
        loss.backward()
        
        with torch.no_grad():
            x_adv = x_adv + alpha * x_adv.grad.sign()
            x_adv = torch.maximum(x_adv, x - epsilon)
            x_adv = torch.minimum(x_adv, x + epsilon)
            x_adv = torch.clamp(x_adv, 0, 1)
    
    # 使用对抗样本训练
    model.train()
    output = model(x_adv.detach())
    loss_adv = F.cross_entropy(output, y)
    
    optimizer.zero_grad()
    loss_adv.backward()
    optimizer.step()
    
    return loss_adv.item()

完整训练循环

def train_adversarial(model, train_loader, optimizer, 
                      epsilon=8/255, alpha=2/255, steps=7, epochs=100):
    """完整的对抗训练流程。"""
    for epoch in range(epochs):
        for x, y in train_loader:
            x, y = x.cuda(), y.cuda()
            
            # PGD对抗样本生成
            x_adv = pgd_attack(model, x, y, epsilon, alpha, steps)
            
            # 对抗训练
            optimizer.zero_grad()
            output = model(x_adv)
            loss = F.cross_entropy(output, y)
            loss.backward()
            optimizer.step()

TRADES方法

Zhang等人(2019)提出的TRADES在标准准确率和鲁棒性之间取得更好平衡:2

目标函数

其中 是第 个对抗样本。

TRADES的理论动机

TRADES通过KL散度正则化,强制模型在对抗扰动下保持与原始样本相似的预测分布:

“The robustness can be achieved by forcing the model to output similar predictions for nearby points.”

TRADES实现

def trades_loss(model, x, y, x_adv, beta=6.0):
    """
    TRADES loss function.
    
    TRADES = TRADE-off between Accuracy and Robustness
    """
    # 标准交叉熵损失
    loss_clean = F.cross_entropy(model(x), y)
    
    # KL散度正则化
    loss_robust = 0
    for x_p in x_adv:
        loss_robust += F.kl_div(
            model(x).log_softmax(dim=1),
            model(x_p).softmax(dim=1),
            reduction='batchmean'
        )
    
    # 组合损失
    return loss_clean + beta * loss_robust

MART方法

Wang等人(2019)提出Modular Adversarial Training (MART),强调不同样本的不同作用:3

其中 是对抗样本的概率估计。

对抗训练的技术细节

1. 随机初始化

def random_start_pgd(x, epsilon):
    """随机初始化以避免对抗样本的平凡性。"""
    x_adv = x + torch.empty_like(x).uniform_(-epsilon, epsilon)
    return torch.clamp(x_adv, 0, 1)

2. Early Stopping

对于每个样本,只在成功攻击后才更新:

if model(x_adv).argmax() != y:  # 攻击成功
    # 更新模型
    loss.backward()

3. 课程对抗训练

逐步增加扰动强度:

epsilon_schedule = [0.01, 0.02, 0.03, 0.05]  # 课程
for epoch, eps in enumerate(epsilon_schedule):
    # 使用当前epsilon训练
    adversarial_train(model, ..., epsilon=eps)

4. Label Smoothing

结合标签平滑提升泛化:

def smoothed_loss(logits, y, smoothing=0.1):
    n_classes = logits.size(1)
    y_smooth = y * (1 - smoothing) + smoothing / n_classes
    return F.cross_entropy(logits, y_smooth)

最新进展

1. AWP(Adversarial Weight Perturbation)

对抗权重扰动同时正则化权重和输入:4

def awp_loss(model, x, y, epsilon_w=1e-2, epsilon=8/255):
    """Adversarial Weight Perturbation."""
    # 第一步:在权重空间扰动
    params = get_params(model)
    params_adv = [p + epsilon_w * p.grad.sign() for p in params]
    set_params(model, params_adv)
    
    # 第二步:在输入空间攻击
    x_adv = pgd_attack(model, x, y, epsilon)
    
    # 第三步:恢复权重,计算损失
    output = model(x_adv)
    loss = F.cross_entropy(output, y)
    
    # 恢复权重
    set_params(model, params)
    
    return loss

2. Fast is Better than Free(FIB)

通过重参数化和数据增强减少对抗训练开销:5

3. AT-PR(Adversarial Training for Probabilistic Robustness)

ICCV 2025提出的新方法,专门优化概率鲁棒性:6

对抗训练的挑战

1. 计算开销

对抗训练需要在内层循环生成对抗样本,计算量是标准训练的3-10倍。

解决方案

  • 使用少量PGD步数(如7步)
  • 提前终止攻击(Early Stopping)
  • 分布式训练

2. 过拟合问题

对抗训练容易过拟合到特定攻击:

# 使用Clean Finetuning缓解过拟合
def clean_finetune(model, epochs=10, lr=1e-5):
    """在干净数据上微调以缓解过拟合。"""
    for epoch in range(epochs):
        for x, y in clean_loader:
            output = model(x)
            loss = F.cross_entropy(output, y)
            loss.backward()

3. 泛化-鲁棒性权衡

标准准确率和鲁棒性之间存在基本张力:7

指标标准训练对抗训练
Clean Accuracy略低
Robust Accuracy
Standard-FRobust Gap

实验对比

使用CIFAR-10数据集()的典型结果:

方法Clean AccRobust Acc
Standard95.0%0.0%
PGD-AT87.3%56.1%
TRADES88.6%55.8%
MART88.1%57.0%
AWP89.2%57.8%

本章小结

对抗训练是对抗鲁棒性的核心方法:

  1. 形式化:min-max优化问题
  2. PGD-AT:标准对抗训练方法
  3. TRADES:准确率-鲁棒性平衡
  4. MART:模块化对抗训练
  5. 技术细节:随机初始化、课程学习、早停
  6. 最新进展:AWP、FIB、AT-PR

对抗训练虽然有效,但计算开销大。下一章将介绍认证鲁棒性方法,提供可证明的鲁棒性保证。

参考文献

Footnotes

  1. Madry, A., et al. (2018). Towards Deep Learning Models Resistant to Adversarial Attacks. ICLR 2018. 2

  2. Zhang, H., et al. (2019). Theoretically Principled Trade-off between Robustness and Accuracy. ICLR 2019.

  3. Wang, Y., et al. (2019). Improving Adversarial Robustness via Modularity. arXiv:1911.09074.

  4. Wu, D., et al. (2020). Adversarial Weight Perturbation Helps Robust Generalization. NeurIPS 2020.

  5. Rebuffi, S. A., et al. (2021). Fixing Data Resolution to Fix Image Recognition’s Fragility. arXiv.

  6. Zhang, Y., et al. (2025). Adversarial Training for Probabilistic Robustness. ICCV 2025.

  7. Tsipras, D., et al. (2019). Robustness May Be at Odds with Accuracy. ICLR 2019.