深度学习中的采样技术

1 引言

采样技术在深度学习中扮演着多重角色:从正则化手段到不确定性估计,从生成建模到优化算法。本章系统介绍采样在神经网络训练和推断中的各种应用。1

核心主题

  1. Dropout作为贝叶斯近似
  2. 随机神经网络与贝叶斯化
  3. 生成模型中的采样技术
  4. 神经网络优化中的随机性

2 Dropout作为贝叶斯近似

2.1 Dropout回顾

Dropout在训练时随机丢弃神经元:

def dropout_train(x, rate=0.5):
    """训练时Dropout"""
    mask = torch.rand_like(x) > rate
    return x * mask / (1 - rate)  # 缩放保持期望
 
def dropout_eval(x):
    """评估时Dropout关闭"""
    return x

2.2 变分解释

Gal & Ghahramani (2015) 证明Dropout等价于变分推断2

模型假设

  • 先验:
  • 近似后验:

变分目标

其中 是应用Dropout后的权重。

2.3 MC Dropout

推断时多次采样

class MCDropout(nn.Module):
    """MC Dropout实现"""
    
    def __init__(self, model, dropout_rate=0.5, n_samples=50):
        super().__init__()
        self.model = model
        self.dropout_rate = dropout_rate
        self.n_samples = n_samples
    
    def predict(self, x):
        """MC Dropout预测"""
        predictions = []
        
        for _ in range(self.n_samples):
            # 训练模式开启Dropout
            self.model.train()
            
            with torch.no_grad():
                pred = self.model(x)
                predictions.append(pred)
        
        # 集成预测
        predictions = torch.stack(predictions)
        
        # 均值和方差
        mean = predictions.mean(dim=0)
        variance = predictions.var(dim=0)
        
        return mean, variance

2.4 不确定性估计

MC Dropout可以估计两种不确定性:

认知不确定性(Epistemic)

  • 由模型不确定性引起
  • 可以通过更多数据减少

偶然不确定性(Aleatoric)

  • 由数据固有噪声引起
  • 无法通过更多数据减少

2.5 实践应用

主动学习

def active_learning_mc_dropout(model, unlabeled_loader, n_samples=30, top_k=100):
    """
    基于MC Dropout的主动学习
    """
    model.eval()
    uncertainties = []
    
    for x, _ in unlabeled_loader:
        _, variance = mc_dropout_predict(model, x, n_samples)
        
        # 最大方差 = 最高不确定性
        uncertainty = variance.sum(dim=-1)
        uncertainties.extend(uncertainty.tolist())
    
    # 选择最不确定的样本
    indices = np.argsort(uncertainties)[-top_k:]
    
    return indices

3 随机神经网络

3.1 DropConnect

DropConnect是Dropout的推广,随机丢弃连接而非神经元:

def dropconnect(x, weights, rate=0.5):
    """DropConnect"""
    mask = torch.rand_like(weights) > rate
    output = x @ (weights * mask)
    return output

3.2 随机池化(Stochastic Pooling)

用于卷积神经网络的随机池化:

def stochastic_pooling(features, kernel_size=2):
    """
    随机池化:按概率选择激活
    """
    batch, channels, h, w = features.shape
    
    # 计算每区域的激活和
    h_out = h // kernel_size
    w_out = w // kernel_size
    
    pooled = []
    for i in range(h_out):
        row = []
        for j in range(w_out):
            # 提取区域
            region = features[:, :, 
                           i*kernel_size:(i+1)*kernel_size,
                           j*kernel_size:(j+1)*kernel_size]
            
            # 归一化概率
            region_flat = region.flatten(2)
            probs = region_flat / region_flat.sum(dim=-1, keepdim=True)
            
            # 按概率采样
            idx = torch.multinomial(probs.view(-1, kernel_size*kernel_size), 1)
            sampled = region_flat.gather(2, idx).squeeze()
            row.append(sampled)
        
        pooled.append(torch.stack(row, dim=-1))
    
    return torch.stack(pooled, dim=-2)

3.3 批归一化中的随机性

Batch Normalization在训练和测试时行为不同:

阶段均值方差
训练批次统计量批次统计量
测试移动平均移动平均

训练时的随机性

class BayesianBatchNorm2d(nn.Module):
    """贝叶斯批归一化"""
    
    def __init__(self, num_features, momentum=0.1, n_samples=10):
        super().__init__()
        self.bn = nn.BatchNorm2d(num_features, momentum=momentum)
        self.n_samples = n_samples
        self.training = True
    
    def forward(self, x):
        if self.training:
            outputs = []
            for _ in range(self.n_samples):
                # 添加噪声
                noisy_bn = self.bn(x + torch.randn_like(x) * 0.1)
                outputs.append(noisy_bn)
            return torch.stack(outputs).mean(dim=0)
        else:
            return self.bn(x)

3.4 噪声注入

在神经网络中添加输入噪声是一种有效的正则化:

高斯噪声注入

等价于在损失函数中添加Tikhonov正则化


4 贝叶斯神经网络

4.1 贝叶斯化的层次

神经网络权重的贝叶斯化可以有不同层次:

层次方法不确定性
最后一层贝叶斯线性回归仅输出
特定层MC Dropout, 变分局部
全部层变分推断, MCMC全模型

详见 bayesian-last-layer-deep-learningbayesian-neural-networks

4.2 变分推断训练BNN

平均场变分推断

class BayesianLinear(nn.Module):
    """贝叶斯线性层(平均场变分)"""
    
    def __init__(self, in_features, out_features, prior_std=1.0):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.prior_std = prior_std
        
        # 变分参数
        self.weight_mean = nn.Parameter(torch.randn(out_features, in_features))
        self.weight_log_std = nn.Parameter(torch.zeros(out_features, in_features))
        
        self.bias_mean = nn.Parameter(torch.zeros(out_features))
        self.bias_log_std = nn.Parameter(torch.zeros(out_features))
    
    def forward(self, x, sample=True):
        if sample:
            # 重参数化采样
            weight = self.weight_mean + torch.randn_like(self.weight_mean) * \
                    torch.exp(self.weight_log_std)
            bias = self.bias_mean + torch.randn_like(self.bias_mean) * \
                  torch.exp(self.bias_log_std)
        else:
            weight = self.weight_mean
            bias = self.bias_mean
        
        return F.linear(x, weight, bias)
    
    def kl_loss(self):
        """KL散度损失"""
        # 先验
        prior = torch.distributions.Normal(0, self.prior_std)
        # 后验
        weight_post = torch.distributions.Normal(
            self.weight_mean, torch.exp(self.weight_log_std))
        bias_post = torch.distributions.Normal(
            self.bias_mean, torch.exp(self.bias_log_std))
        
        kl_w = torch.distributions.kl.kl_divergence(weight_post, prior).sum()
        kl_b = torch.distributions.kl.kl_divergence(bias_post, prior).sum()
        
        return kl_w + kl_b

4.3 MCMC训练BNN

随机梯度朗之万动力学(SGLD)

class SGLDOptimizer:
    """随机梯度朗之万动力学优化器"""
    
    def __init__(self, model, lr=0.01, noise_std=0.1):
        self.model = model
        self.lr = lr
        self.noise_std = noise_std
    
    def step(self, loss):
        """一步SGLD更新"""
        # 保存当前参数
        params_before = [p.clone() for p in self.model.parameters()]
        
        # 计算梯度
        loss.backward()
        
        # SGLD更新
        with torch.no_grad():
            for param in self.model.parameters():
                if param.grad is not None:
                    # 梯度下降 + 朗之万噪声
                    param.add_(param.grad, alpha=-self.lr)
                    param.add_(torch.randn_like(param) * 
                              np.sqrt(2 * self.lr) * self.noise_std)
        
        # 清除梯度
        self.model.zero_grad()
        
        return params_before

4.4 不确定性传播

贝叶斯神经网络的一个重要应用是不确定性传播

def uncertainty_propagation(bnn, input_dist, n_samples=100):
    """
    通过贝叶斯神经网络传播输入不确定性
    """
    output_samples = []
    
    for _ in range(n_samples):
        # 采样网络权重
        x_sample = input_dist.sample()
        output = bnn(x_sample)
        output_samples.append(output)
    
    # 输出分布统计
    output_mean = torch.stack(output_samples).mean(dim=0)
    output_var = torch.stack(output_samples).var(dim=0)
    
    return output_mean, output_var

5 生成模型中的采样

5.1 VAE中的重参数化

变分自编码器使用重参数化技巧进行采样:

class VAE(nn.Module):
    """变分自编码器"""
    
    def __init__(self, encoder, decoder, latent_dim):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.latent_dim = latent_dim
    
    def reparameterize(self, mu, log_var):
        """重参数化技巧"""
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std
    
    def forward(self, x):
        # 编码
        z_params = self.encoder(x)
        mu, log_var = z_params[:, :self.latent_dim], \
                      z_params[:, self.latent_dim:]
        
        # 重参数化采样
        z = self.reparameterize(mu, log_var)
        
        # 解码
        x_recon = self.decoder(z)
        
        return x_recon, mu, log_var
    
    def elbo_loss(self, x, x_recon, mu, log_var):
        """ELBO损失"""
        # 重建损失
        recon_loss = F.mse_loss(x_recon, x, reduction='sum')
        
        # KL散度
        kl_loss = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
        
        return recon_loss + kl_loss

5.2 Flow模型的可逆采样

归一化流允许精确对数密度计算可逆采样

class PlanarFlow(nn.Module):
    """Planar归一化流"""
    
    def __init__(self, dim):
        super().__init__()
        self.w = nn.Parameter(torch.randn(dim, 1))
        self.b = nn.Parameter(torch.zeros(1))
        self.u = nn.Parameter(torch.randn(dim, 1))
    
    def forward(self, z):
        """前向传播"""
        # 线性变换
        inner = torch.mm(z, self.w) + self.b
        h = torch.tanh(inner)
        z_transformed = z + self.u * h
        
        # 对数行列式
        psi = self.w * (1 - h.pow(2))
        log_det = torch.log(torch.abs(
            1 + torch.mm(psi.t(), self.u) * (1 - h.pow(2))
        ) + 1e-8)
        
        return z_transformed, log_det
    
    def inverse(self, z):
        """逆变换(需要数值求解)"""
        # 使用牛顿迭代求解
        # ...

5.3 Diffusion模型的多步采样

扩散模型通过迭代去噪生成样本:

class DDPM(nn.Module):
    """去噪扩散概率模型"""
    
    def __init__(self, network, n_steps=1000, beta_start=0.0001, beta_end=0.02):
        super().__init__()
        self.network = network
        self.n_steps = n_steps
        
        # 噪声调度
        self.betas = torch.linspace(beta_start, beta_end, n_steps)
        self.alphas = 1 - self.betas
        self.alpha_bar = torch.cumprod(self.alphas, dim=0)
    
    def forward_process(self, x0, t):
        """前向加噪过程(可分析计算)"""
        noise = torch.randn_like(x0)
        sqrt_alpha_bar = self.alpha_bar[t].sqrt()
        sqrt_one_minus_alpha_bar = (1 - self.alpha_bar[t]).sqrt()
        
        x_t = sqrt_alpha_bar * x0 + sqrt_one_minus_alpha_bar * noise
        
        return x_t, noise
    
    @torch.no_grad()
    def reverse_process(self, xT, temperature=1.0):
        """反向去噪过程"""
        x_t = xT
        
        for t in reversed(range(self.n_steps)):
            # 预测噪声
            noise_pred = self.network(x_t, t)
            
            # 采样
            if t > 0:
                z = torch.randn_like(x_t) * temperature
            else:
                z = 0
            
            # 更新
            alpha = self.alphas[t]
            alpha_bar = self.alpha_bar[t]
            beta = self.betas[t]
            
            x_t = (1 / alpha.sqrt()) * (
                x_t - beta / (1 - alpha_bar).sqrt() * noise_pred
            ) + beta.sqrt() * z
        
        return x_t

详见 diffusion-modelscore-matching-sde


6 神经网络优化中的随机性

6.1 SGD作为随机优化

SGD处理的是随机梯度

其中 是真实梯度的无偏估计。

梯度噪声的统计性质

通常假设

6.2 朗之万动力学优化

朗之万动力学在优化中引入热噪声

物理意义

  • 第一项:梯度下降
  • 第二项:热噪声驱动
  • :温度参数

6.3 随机搜索

进化策略(Evolution Strategies)

def evolution_strategy(objective, n_samples=100, sigma=0.1, lr=0.01, 
                     n_iterations=1000, theta_init=None):
    """
    进化策略优化
    """
    theta = theta_init if theta_init is not None else np.zeros(d)
    
    for _ in range(n_iterations):
        # 采样扰动
        epsilons = np.random.randn(n_samples, len(theta))
        
        # 评估
        rewards = np.array([objective(theta + sigma * eps) 
                          for eps in epsilons])
        
        # 归一化奖励
        rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
        
        # 更新
        theta += lr / (n_samples * sigma) * np.sum(
            rewards[:, None] * epsilons, axis=0)
    
    return theta

6.4 随机dropout作为正则化

Dropout的隐式正则化效应:

这等价于 正则化与自适应权重衰减


7 实践技巧

7.1 MC Dropout使用指南

何时使用MC Dropout

  • 需要不确定性估计
  • 无法获取验证数据
  • 模型已使用Dropout训练

参数选择

  • :一般任务
  • :实时应用
  • :高精度需求

7.2 采样数量选择

任务推荐样本数权衡
预测均值10-30速度
预测方差50-100速度/精度
不确定性边界100-500精度
主动学习30-50速度

7.3 方差缩减技术

在深度学习中应用方差缩减:

Rao-Blackwellization

利用条件期望减少方差。


8 与相关内容的联系

8.1 生成模型

采样是生成模型的核心:

8.2 贝叶斯深度学习

完整的贝叶斯深度学习方法:

8.3 采样理论

基础采样理论:


9 总结

本章系统介绍了采样技术在深度学习中的应用:

  1. Dropout作为贝叶斯近似:变分解释、MC Dropout、不确定性估计
  2. 随机神经网络:DropConnect、随机池化、噪声注入
  3. 贝叶斯神经网络:变分推断、MCMC训练、不确定性传播
  4. 生成模型采样:重参数化、Flow可逆采样、Diffusion迭代去噪
  5. 优化中的随机性:SGD、朗之万动力学、进化策略

采样技术贯穿深度学习的各个环节,从正则化到不确定性估计,从生成建模到优化算法。


参考文献

Footnotes

  1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

  2. Gal, Y. & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. International Conference on Machine Learning, 1050-1059.