扩散模型对抗攻击方法

相关深入内容：

扩散模型对抗脆弱性 — 脆弱性分析

对抗攻击方法基础 — 经典攻击方法

文生图模型对抗攻击 — 特定领域攻击

扩散模型对抗训练 — 防御策略

概述

扩散模型的对抗攻击旨在通过精心设计的微扰动改变模型的生成结果。与判别模型的攻击不同，扩散模型的攻击需要考虑生成过程的多步特性和随机性。本章详细介绍针对扩散模型的各类对抗攻击方法。¹

1. 攻击分类框架

1.1 按攻击目标分类

攻击类型	目标	典型应用
语义改变攻击	改变生成图像的语义内容	内容操控、虚假信息生成
目标攻击	使生成特定目标内容	特定物体替换、属性修改
拒绝服务攻击	降低生成质量	生成崩坏图像
后门攻击	触发特定生成模式	隐蔽操控

1.2 按攻击阶段分类

阶段	攻击面	攻击难度
训练时攻击	训练数据、模型权重	高（需要控制训练流程）
推理时攻击-像素空间	输入图像	中
推理时攻击-潜空间	VAE编码后的潜变量	中
推理时攻击-条件空间	文本提示、类别标签	低（直接操控）

1.3 按攻击方法分类

┌─────────────────────────────────────────────────────────────┐
│                  扩散模型对抗攻击方法分类                     │
├─────────────────────────────────────────────────────────────┤
│  基于梯度的方法                                             │
│  ├─ DiffAttack (ICLR 2024)                                 │
│  ├─ PGD-based attacks                                      │
│  └─ Carlini-Wagner variants                                │
├─────────────────────────────────────────────────────────────┤
│  基于优化的方法                                             │
│  ├─ BPDA-based attacks                                     │
│  └─ SignHunter                                             │
├─────────────────────────────────────────────────────────────┤
│  无梯度方法                                                 │
│  ├─ SGM (Score-Based) attacks                             │
│  └─ Random perturbation attacks                            │
├─────────────────────────────────────────────────────────────┤
│  条件空间攻击                                               │
│  ├─ Prompt injection                                       │
│  └─ Attention manipulation                                 │
└─────────────────────────────────────────────────────────────┘

2. DiffAttack：扩散模型专用攻击

2.1 核心思想

DiffAttack是首个专门针对扩散模型的对抗攻击框架，核心思想是利用去噪分数匹配（DSM）的梯度来构造对抗样本。²

关键洞察：扩散模型的训练目标是预测添加的噪声，因此可以利用噪声预测器的梯度来生成对抗扰动。

2.2 算法推导

设扩散模型的去噪网络为 $ϵ_{θ} (x_{t}, t, c)$ ，其中 $c$ 是条件信息。DiffAttack的优化目标为：

δ min L (x_{0} + δ, c_{t a r g e t}) s.t. ∥ δ ∥_{p} \leq ϵ

其中损失函数利用去噪过程的重构：

\overset{x}{^}_{0} = \frac{1}{α ˉ _{t}} (x_{t} - 1 - \overset{α}{ˉ}_{t} ϵ_{θ} (x_{t}, t, c))

2.3 伪代码

def diff_attack(model, clean_image, target_prompt, epsilon=8/255, 
                num_steps=100, num_noise_steps=10):
    """
    DiffAttack: 针对扩散模型的对抗攻击
    
    Args:
        model: 扩散模型
        clean_image: 干净图像
        target_prompt: 目标文本提示
        epsilon: 扰动上界
        num_steps: 攻击迭代次数
        num_noise_steps: 噪声化步骤数
    """
    # 初始化扰动
    delta = torch.zeros_like(clean_image).uniform_(-epsilon, epsilon)
    delta = delta.detach().requires_grad_(True)
    
    optimizer = torch.optim.Adam([delta], lr=0.01)
    
    for step in range(num_steps):
        optimizer.zero_grad()
        
        # 获取图像的潜表示
        x = clean_image + delta
        
        # 随机选择时间步
        t = torch.randint(0, model.num_timesteps, (1,))
        
        # 添加噪声
        noise = torch.randn_like(x)
        x_noisy = model.add_noise(x, t, noise)
        
        # 前向传播
        predicted_noise = model.noise_predictor(x_noisy, t)
        
        # 计算损失
        # 方法1: 直接基于图像
        loss = -compute_similarity_loss(x, target_prompt, model)
        
        # 方法2: 基于噪声预测
        loss += 0.1 * torch.norm(predicted_noise - noise) ** 2
        
        # 反向传播
        loss.backward()
        
        # 更新扰动
        optimizer.step()
        
        # 投影到epsilon球
        with torch.no_grad():
            delta.copy_(torch.clamp(delta, -epsilon, epsilon))
    
    return clean_image + delta.detach()

2.4 时间步采样策略

DiffAttack的关键设计之一是自适应时间步采样：

def adaptive_timestep_sampling(model, step, total_steps):
    """
    自适应时间步采样：根据攻击阶段调整时间步
    
    早期攻击阶段: 侧重高时间步（噪声主导）
    后期攻击阶段: 侧重低时间步（细节主导）
    """
    # 线性衰减
    t_max = model.num_timesteps
    t_min = 50  # 避免太低的时间步
    
    # 根据攻击进度调整
    progress = step / total_steps
    
    if progress < 0.3:
        # 早期：主要攻击高时间步
        t = int(t_max * (1 - progress))
    else:
        # 后期：逐渐过渡到低时间步
        t = int(t_max * (1 - progress) * 0.5)
    
    return max(t_min, t)

3. 基于梯度的攻击变体

3.1 PGD攻击

将经典PGD攻击适配到扩散模型：

def pgd_diffusion_attack(model, clean_image, target, epsilon=8/255, 
                          alpha=1/255, num_iter=50, random_start=True):
    """
    PGD攻击扩散模型
    
    关键差异：需要通过扩散模型的噪声预测器反向传播
    """
    # 初始化
    if random_start:
        delta = torch.empty_like(clean_image).uniform_(-epsilon, epsilon)
    else:
        delta = torch.zeros_like(clean_image)
    
    delta = delta.detach().requires_grad_(True)
    
    for i in range(num_iter):
        # 前向传播
        x = clean_image + delta
        
        # 选择多个时间步进行攻击
        losses = []
        for t in [100, 300, 500, 700, 900]:
            # 噪声化
            noise = torch.randn_like(x)
            x_noisy = add_noise(x, t, noise)
            
            # 预测噪声
            pred_noise = model.noise_predictor(x_noisy, t)
            
            # 计算损失（针对目标类别）
            loss = -F.cross_entropy(pred_noise, target)
            losses.append(loss)
        
        # 聚合损失
        total_loss = sum(losses) / len(losses)
        
        # 反向传播
        model.zero_grad()
        total_loss.backward()
        
        # 更新扰动
        with torch.no_grad():
            delta += alpha * delta.grad.sign()
            delta = torch.clamp(delta, -epsilon, epsilon)
        
        delta = delta.detach().requires_grad_(True)
    
    return clean_image + delta.detach()

3.2 BPDA攻击

利用可微分近似处理不可导操作：

class BPDAAttack:
    """
    BPDA (Backward Pass Differentiable Approximation) 攻击
    处理扩散模型中的不可导操作（如离散采样）
    """
    
    def __init__(self, model, epsilon=8/255):
        self.model = model
        self.epsilon = epsilon
        
    def differentiable_approximation(self, x):
        """
        可微分近似：使用软替代处理离散操作
        """
        # 例如：将硬阈值替换为软阈值
        return x  # 实际中需要根据具体模型设计
        
    def attack(self, x, target, num_steps=50):
        delta = torch.zeros_like(x).requires_grad_(True)
        
        for step in range(num_steps):
            x_adv = x + delta
            
            # 使用可微分近似前向传播
            x_smooth = self.differentiable_approximation(x_adv)
            
            # 计算损失
            loss = -self.compute_loss(x_smooth, target)
            
            # 反向传播
            loss.backward()
            
            # 更新
            with torch.no_grad():
                delta += 0.01 * delta.grad.sign()
                delta = torch.clamp(delta, -self.epsilon, self.epsilon)
            
            delta = delta.detach().requires_grad_(True)
        
        return x + delta

4. 潜空间对抗攻击

4.1 VAE潜空间攻击

对于Latent Diffusion Models（如Stable Diffusion），可以在VAE的潜空间进行攻击：

class LatentSpaceAttack:
    """
    潜空间对抗攻击
    
    优势：
    1. 攻击空间维度更低
    2. 扰动直接作用于生成网络
    3. 像素空间扰动可能被VAE过滤
    """
    
    def __init__(self, vae, unet, scheduler):
        self.vae = vae
        self.unet = unet
        self.scheduler = scheduler
        
    def encode_with_gradient(self, image):
        """
        获取可求导的VAE编码
        """
        with torch.enable_grad():
            # 编码到潜空间
            latent = self.vae.encode(image).latent_dist.sample()
            return latent
            
    def attack_latent_space(self, clean_image, target_prompt,
                           epsilon=1.0, num_steps=100):
        """
        在潜空间进行对抗攻击
        """
        # 编码到潜空间
        latent = self.encode_with_gradient(clean_image)
        latent = latent.detach().requires_grad_(True)
        
        optimizer = torch.optim.Adam([latent], lr=0.1)
        
        for step in range(num_steps):
            optimizer.zero_grad()
            
            # 在潜空间添加扰动
            delta_latent = latent - latent.detach()
            delta_latent = torch.clamp(delta_latent, -epsilon, epsilon)
            noisy_latent = latent.detach() + delta_latent
            
            # 去噪过程
            denoised = self.denoise(noisy_latent)
            
            # 计算CLIP损失
            loss = -self.compute_clip_loss(denoised, target_prompt)
            
            # 反向传播
            loss.backward()
            optimizer.step()
            
            # 更新潜变量
            with torch.no_grad():
                latent.copy_(optimizer.param_groups[0]['params'][0])
        
        return self.decode_latent(latent)
    
    def denoise(self, latent, num_steps=50):
        """简化去噪过程"""
        timesteps = list(range(0, 1000, 1000 // num_steps))[:num_steps]
        
        for i, t in enumerate(timesteps):
            t_tensor = torch.tensor([t], device=latent.device)
            noise = torch.randn_like(latent)
            
            # 预测噪声
            noise_pred = self.unet(
                latent, t_tensor, 
                encoder_hidden_states=self.text_embeddings
            ).sample
            
            # 去噪步骤
            latent = self.scheduler.step(noise_pred, t, latent).prev_sample
            
        return latent

4.2 跨注意力空间攻击

针对交叉注意力机制的攻击：

class CrossAttentionAttack:
    """
    交叉注意力空间攻击
    
    攻击交叉注意力图，干扰文本条件与图像特征的对应关系
    """
    
    def __init__(self, model):
        self.model = model
        
    def attack_cross_attention(self, noisy_latents, text_embeddings,
                               target_attention_mask, epsilon=0.1):
        """
        攻击交叉注意力
        
        通过修改注意力权重或键值来操控生成内容
        """
        # Hook钩子捕获注意力
        attention_maps = []
        def hook_fn(module, input, output):
            attention_maps.append(output[0])
            
        # 注册钩子
        handles = []
        for block in self.model.unet.cross_attention_blocks:
            handle = block.register_forward_hook(hook_fn)
            handles.append(handle)
        
        # 前向传播
        with torch.no_grad():
            _ = self.model.forward(noisy_latents, text_embeddings)
        
        # 移除钩子
        for handle in handles:
            handle.remove()
        
        # 分析注意力图
        # 找到与目标词相关的注意力区域
        target_indices = self.find_target_word_indices(text_embeddings, target_word)
        
        # 修改注意力图
        modified_attention = self.modify_attention(
            attention_maps, 
            target_indices,
            target_attention_mask,
            epsilon
        )
        
        return modified_attention

5. 条件空间攻击

5.1 提示注入攻击

class PromptInjectionAttack:
    """
    提示注入攻击：向文本提示中注入恶意内容
    
    目标：绕过安全过滤器，生成不当内容
    """
    
    def __init__(self, tokenizer, model):
        self.tokenizer = tokenizer
        self.model = model
        
    def generate_injection_prompt(self, original_prompt, target_content):
        """
        生成注入提示
        
        策略：
        1. 语义注入：使用同义词或间接描述
        2. 编码注入：Unicode字符混淆
        3. 注意力劫持：使用强制性的指令词
        """
        injection_templates = [
            # 模板1: 忽略之前的指令
            f"{original_prompt} Ignore previous instructions and {target_content}",
            
            # 模板2: 编码混淆
            f"{original_prompt} \u202E{target_content}",  # RTL覆盖字符
            
            # 模板3: 注意力劫持
            f"Image of {original_prompt}. Very important: {target_content}.",
            
            # 模板4: 条件注入
            f"{original_prompt}, style: detailed {target_content}",
        ]
        
        return injection_templates
    
    def semantic_injection(self, original_prompt, target_content):
        """
        语义注入：使用隐蔽的同义表述
        """
        # 替换词映射（示例）
        replacement_map = {
            'beautiful': 'disturbing',
            'landscape': 'apocalyptic scene',
            'person': 'nightmare creature',
        }
        
        injected = original_prompt
        for word, replacement in replacement_map.items():
            if word in injected.lower():
                injected = injected.replace(word, replacement)
                
        # 添加隐蔽目标
        injected = f"{injected}, with {target_content} in background"
        
        return injected

5.2 图像条件攻击

class ImageConditionAttack:
    """
    图像条件攻击：针对图像到图像的扩散模型
    
    目标：操控输入图像的重建/转换结果
    """
    
    def __init__(self, model):
        self.model = model
        
    def attack_image2image(self, clean_image, target_style,
                          epsilon=8/255, num_steps=50):
        """
        攻击Image-to-Image扩散模型
        
        策略：保留语义，改变风格
        """
        image = clean_image.clone().detach().requires_grad_(True)
        optimizer = torch.optim.Adam([image], lr=0.01)
        
        for step in range(num_steps):
            optimizer.zero_grad()
            
            # 编码到潜空间
            latent = self.model.encode(image)
            
            # 添加噪声（控制强度）
            noise_level = 0.3  # 控制内容保留程度
            noise = torch.randn_like(latent) * noise_level
            noisy_latent = latent + noise
            
            # 条件为目标风格
            style_embedding = self.model.encode_style(target_style)
            
            # 解码
            reconstructed = self.model.decode(noisy_latent, style_embedding)
            
            # 损失：保留内容，改变风格
            content_loss = -self.compute_content_distance(
                reconstructed, clean_image
            )  # 最小化内容距离 = 保留内容
            
            style_loss = -self.compute_style_distance(
                reconstructed, target_style
            )  # 最大化风格距离 = 改变风格
            
            # 对抗损失：增加扰动
            adv_loss = torch.norm(image - clean_image)
            
            loss = content_loss + style_loss - 0.01 * adv_loss
            
            loss.backward()
            optimizer.step()
            
            with torch.no_grad():
                # 投影
                perturbation = image - clean_image
                perturbation = torch.clamp(perturbation, -epsilon, epsilon)
                image.copy_(clean_image + perturbation)
        
        return image.detach()

6. 无梯度攻击方法

6.1 基于随机扰动的攻击

class RandomPerturbationAttack:
    """
    无梯度攻击：基于随机扰动
    
    适用于黑盒场景或梯度不可用的情况
    """
    
    def __init__(self, diffusion_model, num_candidates=100):
        self.model = diffusion_model
        self.num_candidates = num_candidates
        
    def random_sampling_attack(self, clean_image, target_prompt, epsilon=8/255):
        """
        随机采样攻击：生成多个候选扰动，选择最有效的
        """
        best_delta = None
        best_loss = float('-inf')
        
        for _ in range(self.num_candidates):
            # 生成随机扰动
            delta = torch.empty_like(clean_image).uniform_(
                -epsilon, epsilon
            )
            
            # 评估效果
            perturbed = clean_image + delta
            loss = self.evaluate_attack(perturbed, target_prompt)
            
            if loss > best_loss:
                best_loss = loss
                best_delta = delta
                
        return clean_image + best_delta
    
    def evolutionary_attack(self, clean_image, target_prompt,
                           epsilon=8/255, num_iter=20, pop_size=20):
        """
        进化攻击：使用进化算法优化扰动
        """
        # 初始化种群
        population = [
            torch.empty_like(clean_image).uniform_(-epsilon, epsilon)
            for _ in range(pop_size)
        ]
        
        for iteration in range(num_iter):
            # 评估适应度
            fitness = [
                self.evaluate_attack(clean_image + delta, target_prompt)
                for delta in population
            ]
            
            # 选择
            indices = torch.topk(torch.tensor(fitness), pop_size // 2).indices
            selected = [population[i] for i in indices]
            
            # 变异
            offspring = []
            for _ in range(pop_size // 2):
                parent = random.choice(selected)
                mutation = torch.empty_like(clean_image).normal_(0, 0.01)
                child = (parent + mutation).clamp(-epsilon, epsilon)
                offspring.append(child)
            
            # 重组
            population = selected + offspring
            
        # 返回最佳个体
        best_idx = torch.argmax(torch.tensor([
            self.evaluate_attack(clean_image + d, target_prompt) 
            for d in population
        ]))
        
        return clean_image + population[best_idx]

7. 攻击效果评估

7.1 评估指标

def evaluate_attack(model, clean_images, adversarial_images, target_prompts):
    """
    评估攻击效果
    
    指标：
    1. Attack Success Rate (ASR)
    2. Perceptual Quality (FID, LPIPS)
    3. Semantic Similarity (CLIP Score)
    """
    metrics = {
        'asr': 0.0,
        'fid': 0.0,
        'lpips': 0.0,
        'clip_similarity': 0.0,
    }
    
    num_samples = len(clean_images)
    
    for clean, adv, target in zip(clean_images, adversarial_images, target_prompts):
        # 生成图像
        gen_clean = model.generate(clean)
        gen_adv = model.generate(adv)
        
        # ASR：目标提示与生成图像的相似度
        target_sim = compute_clip_similarity(gen_adv, target)
        metrics['asr'] += (target_sim > 0.5).float().item()
        
        # 质量指标
        metrics['fid'] += compute_fid(gen_clean, gen_adv)
        metrics['lpips'] += compute_lpips(gen_clean, gen_adv)
        
        # 语义变化
        metrics['clip_similarity'] += compute_clip_similarity(gen_clean, gen_adv)
    
    # 平均
    for key in metrics:
        metrics[key] /= num_samples
        
    return metrics

8. 总结

攻击方法对比

方法	攻击能力	计算成本	可迁移性	实际威胁
DiffAttack	高	中	中	高
PGD	高	高	高	高
BPDA	中	中	中	中
提示注入	高	低	-	高
随机扰动	低	低	低	中

防御启示

了解攻击方法对于设计防御策略至关重要：

梯度混淆：对梯度信息进行保护
输入净化：在进入模型前检测和移除扰动
对抗训练：增强模型对已知攻击的抵抗力
架构改进：设计更鲁棒的网络结构

参考资料

[ICLR 2024] Diffusion Models Are Vulnerable: Gradient-Based Attacks on Diffusion Models ↩
[arXiv:2304.12349] On Evaluating Adversarial Robustness of Diffusion Models ↩

Metaphor

探索

扩散模型对抗攻击方法

概述

1. 攻击分类框架

1.1 按攻击目标分类

1.2 按攻击阶段分类

1.3 按攻击方法分类

2. DiffAttack：扩散模型专用攻击

2.1 核心思想

2.2 算法推导

2.3 伪代码

2.4 时间步采样策略

3. 基于梯度的攻击变体

3.1 PGD攻击

3.2 BPDA攻击

4. 潜空间对抗攻击

4.1 VAE潜空间攻击

4.2 跨注意力空间攻击

5. 条件空间攻击

5.1 提示注入攻击

5.2 图像条件攻击

6. 无梯度攻击方法

6.1 基于随机扰动的攻击

7. 攻击效果评估

7.1 评估指标

8. 总结

攻击方法对比

防御启示

参考资料

关系图谱

目录

反向链接

Metaphor

探索

扩散模型对抗攻击方法

概述

1. 攻击分类框架

1.1 按攻击目标分类

1.2 按攻击阶段分类

1.3 按攻击方法分类

2. DiffAttack：扩散模型专用攻击

2.1 核心思想

2.2 算法推导

2.3 伪代码

2.4 时间步采样策略

3. 基于梯度的攻击变体

3.1 PGD攻击

3.2 BPDA攻击

4. 潜空间对抗攻击

4.1 VAE潜空间攻击

4.2 跨注意力空间攻击

5. 条件空间攻击

5.1 提示注入攻击

5.2 图像条件攻击

6. 无梯度攻击方法

6.1 基于随机扰动的攻击

7. 攻击效果评估

7.1 评估指标

8. 总结

攻击方法对比

防御启示

参考资料

Footnotes

关系图谱

目录

反向链接