贝叶斯深度学习与不确定性量化

贝叶斯深度学习（Bayesian Deep Learning，BDL）是将贝叶斯概率推断与深度神经网络融合的框架，旨在建模和量化神经网络预测中的不确定性。¹ 与传统深度学习的点估计不同，BDL 学习网络权重的后验分布，从而能够区分不同类型的不确定性并做出知情的预测。

1. 贝叶斯神经网络的定义

1.1 从确定性网络到贝叶斯网络

传统神经网络的训练目标：

\overset{w}{^} = ar g w max i = 1 \sum N lo g p (y^{(i)} ∣ x^{(i)}, w)

得到的是权重的点估计，没有关于模型不确定性的信息。

贝叶斯神经网络的框架：

先验分布： $p (w) = \prod_{i} N (w_{i} ∣ 0, σ_{w}^{2})$
似然函数： $p (y ∣ x, w) = N (y ∣ f_{w} (x), σ_{y}^{2})$
后验分布： $p (w ∣ D) = \frac{p ( D ∣ w ) p ( w )}{p ( D )}$

1.2 BNN 的概率模型

层次化表示：

w f_{w} (x) y ∣ w \sim p (w) = NN (x; w) \sim p (y ∣ f_{w} (x)) 先验 网络输出 似然

预测分布：

p (y^{*} ∣ x^{*}, D) = \int p (y^{*} ∣ x^{*}, w) \cdot p (w ∣ D) d w

这是对所有可能权重配置的平均，包含不确定性。

1.3 BNN 的结构对比

方面	确定性网络	贝叶斯网络
权重	点估计 $\overset{w}{^}$	分布 $p (w ∣ D)$
预测	$f_{\overset{w}{^}} (x)$	$\int f_{w} (x) p (w ∣ D) d w$
过拟合	容易过拟合	自然正则化
不确定性	无法量化	自然包含
计算成本	$O (1)$	$O (K)$ per forward

1.4 网络结构示例

class BayesianNetwork(nn.Module):
    """
    贝叶斯神经网络基本结构
    
    与确定性网络不同，每个权重层学习的是分布参数
    """
    
    def __init__(self, layer_dims, prior_std=1.0):
        super().__init__()
        self.layer_dims = layer_dims
        self.prior_std = prior_std
        
        # 创建贝叶斯层
        self.layers = nn.ModuleList()
        for i in range(len(layer_dims) - 1):
            self.layers.append(
                BayesianLinear(
                    layer_dims[i], 
                    layer_dims[i+1],
                    prior_std=prior_std
                )
            )
    
    def forward(self, x, sample=True):
        """
        前向传播
        
        Args:
            x: 输入
            sample: 是否从后验采样权重
        """
        for layer in self.layers:
            if sample:
                x = layer.sample_forward(x)
            else:
                # 使用均值
                x = layer.mean_forward(x)
            if layer != self.layers[-1]:
                x = F.relu(x)
        return x
    
    def predict(self, x, n_samples=50):
        """
        贝叶斯预测：多次采样获取预测分布
        """
        predictions = []
        
        with torch.no_grad():
            for _ in range(n_samples):
                pred = self(x, sample=True)
                predictions.append(pred)
        
        predictions = torch.stack(predictions)
        
        # 统计量
        mean = predictions.mean(dim=0)
        std = predictions.std(dim=0)
        var = predictions.var(dim=0)
        
        return mean, std, var, predictions
 
 
class BayesianLinear(nn.Module):
    """贝叶斯线性层"""
    
    def __init__(self, in_features, out_features, prior_std=1.0):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.prior_std = prior_std
        
        # 变分参数：均值和对数方差
        self.weight_mu = nn.Parameter(
            torch.randn(out_features, in_features) * 0.1
        )
        self.weight_log_var = nn.Parameter(
            torch.zeros(out_features, in_features) - 6  # log(0.001)
        )
        
        self.bias_mu = nn.Parameter(torch.zeros(out_features))
        self.bias_log_var = nn.Parameter(
            torch.zeros(out_features) - 6
        )
    
    def sample_weights(self):
        """从变分后验采样"""
        std = (0.5 * self.weight_log_var).exp()
        eps = torch.randn_like(self.weight_mu)
        weight = self.weight_mu + eps * std
        
        std = (0.5 * self.bias_log_var).exp()
        eps = torch.randn_like(self.bias_mu)
        bias = self.bias_mu + eps * std
        
        return weight, bias
    
    def sample_forward(self, x):
        """使用采样权重的正向传播"""
        weight, bias = self.sample_weights()
        return F.linear(x, weight, bias)
    
    def mean_forward(self, x):
        """使用均值权重的正向传播"""
        return F.linear(x, self.weight_mu, self.bias_mu)
    
    def kl_divergence(self):
        """
        计算与先验的 KL 散度
        
        D_KL(N(μ,σ²) || N(0,σ_p²))
        """
        prior_var = self.prior_std ** 2
        
        def kl_gaussian(mu, log_var):
            var = torch.exp(log_var)
            return 0.5 * (
                log_var - torch.log(torch.tensor(prior_var, device=mu.device))
                + (var + mu ** 2) / prior_var
                - 1.0
            ).sum()
        
        return kl_gaussian(self.weight_mu, self.weight_log_var) + \
               kl_gaussian(self.bias_mu, self.bias_log_var)

2. 参数后验推断的挑战

2.1 计算复杂性

后验分布的计算：

p (w ∣ D) = \frac{p ( D ∣ w ) p ( w )}{p ( D )} = \frac{\prod _{i} p ( y ^{(i)} ∣ x ^{(i)} , w ) p ( w )}{\int \prod _{i} p ( y ^{(i)} ∣ x ^{(i)} , w ) p ( w ) d w}

困难来源：

困难	说明
高维参数空间	现代网络有 $1 0^{6}$ - $1 0^{11}$ 个参数
非线性映射	似然函数非凸
归一化常数	$p (D) = \int p (D ∣ w) p (w) d w$ 无法解析计算
后验非高斯	后验分布形状复杂

2.2 精确推断的不可能性

理论限制：

对于具有数百万参数的网络，精确计算后验分布在计算上是不可行的。

参数数量： $d = O (1 0^{6} \sim 1 0^{11})$
积分复杂度： $O (exp (d))$ （维数灾难）

2.3 近似推断方法谱系

                      精确推断
                         ↑
                         │
            ┌─────────────┼─────────────┐
            │             │             │
        变分推断    蒙特卡洛采样    点估计+不确定性
            │             │             │
            ↓             ↓             ↓
      平均场近似      HMC/NUTS      MC Dropout
      归一化流        SGLD         集成方法
      局部暴露       粒子滤波

2.4 近似质量与计算成本权衡

方法	计算成本	近似质量	实现复杂度
HMC	极高	最高	高
SGLD	高	高	中
变分推断（Mean Field）	中	中等	低
变分推断（归一化流）	中高	中高	中
MC Dropout	低	中等	低
集成方法	可调	中高	低

3. 变分推断方法：Mean Field Approximation

3.1 Mean Field 假设

分解假设：

q (w) = i \prod q (w_{i}) = i \prod N (w_{i} ∣ μ_{i}, σ_{i}^{2})

假设的依据：

计算可行：只需优化一维高斯参数
正则化效果：鼓励权重独立
理论基础：平均场是真实后验的最小 KL 距离近似

3.2 变分目标函数

ELBO：

L (θ) = E_{q} [lo g p (D ∣ w)] - D_{KL} (q (w) ∥ p (w))

展开形式：

L = n = 1 \sum N E_{q} [lo g p (y^{(n)} ∣ x^{(n)}, w)] - i \sum D_{KL} (q (w_{i}) ∥ p (w_{i}))

3.3 KL 散度的计算

高斯到高斯的 KL 散度：

D_{KL} (N (μ, σ^{2}) ∥ N (0, σ_{p}^{2})) = lo g \frac{σ _{p}}{σ} + \frac{σ ^{2} + μ ^{2}}{2 σ _{p}^{2}} - \frac{1}{2}

def kl_divergence_gaussian_to_prior(mu, log_var, prior_std=1.0):
    """
    计算高斯变分分布与高斯先验的 KL 散度
    
    D_KL(N(μ,σ²) || N(0,σ_p²))
    
    Args:
        mu: 均值
        log_var: 对数方差 log(σ²)
        prior_std: 先验标准差
    
    Returns:
        KL 散度（标量）
    """
    prior_var = prior_std ** 2
    var = torch.exp(log_var)
    
    kl = 0.5 * (
        log_var - torch.log(torch.tensor(prior_var, device=mu.device))
        + (var + mu ** 2) / prior_var
        - 1.0
    )
    
    return kl.sum()

3.4 优化算法

随机梯度变分推断（SGVB）：

class MeanFieldVI:
    """
    均值场变分推断
    
    使用重参数化技巧进行梯度优化
    """
    
    def __init__(self, model, prior_std=1.0, kl_weight=1.0):
        self.model = model
        self.prior_std = prior_std
        self.kl_weight = kl_weight
    
    def elbo(self, x, y, n_samples=1):
        """
        计算 ELBO
        
        ELBO = E_q[log p(y|x,w)] - KL(q(w)||p(w))
        """
        batch_size = x.size(0)
        
        total_loss = 0.0
        total_recon = 0.0
        total_kl = 0.0
        
        for _ in range(n_samples):
            # 重参数化采样权重
            kl = self.sample_and_compute_kl()
            
            # 前向传播
            output = self.model(x, sample=True)
            
            # 重构损失
            recon = F.cross_entropy(output, y, reduction='sum')
            
            total_loss += recon + self.kl_weight * kl
            total_recon += recon
            total_kl += kl
        
        # 平均
        loss = total_loss / n_samples
        recon_loss = total_recon / n_samples
        kl_loss = total_kl / n_samples
        
        return loss, recon_loss, kl_loss
    
    def sample_and_compute_kl(self):
        """采样权重并计算 KL"""
        kl = 0.0
        for name, param in self.model.named_parameters():
            if hasattr(param, 'mu'):
                kl += kl_divergence_gaussian_to_prior(
                    param.mu, 
                    param.log_var,
                    self.prior_std
                )
        return kl
    
    def fit(self, dataloader, n_epochs=100, lr=1e-3):
        """训练"""
        optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
        
        for epoch in range(n_epochs):
            for x, y in dataloader:
                optimizer.zero_grad()
                
                loss, recon, kl = self.elbo(x, y, n_samples=1)
                
                loss.backward()
                optimizer.step()
            
            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch+1}: Loss={loss.item():.4f}, "
                      f"Recon={recon.item():.4f}, KL={kl.item():.4f}")

4. Bayes by Backprop 详解

4.1 算法核心

Bayes by Backprop 通过变分推断学习权重分布，同时利用反向传播进行梯度优化。²

算法步骤：

初始化变分参数 $θ = (μ, σ)$
重复：
- 从 $q_{θ} (w)$ 采样 $w$
- 计算 ELBO 的梯度
- 更新 $θ$

4.2 损失函数推导

变分自由能（Free Energy）：

F (D, θ) = E_{q_{θ} (w)} [- lo g p (D ∣ w)] + D_{KL} (q_{θ} (w) ∥ p (w))

实际实现：

L = \frac{1}{M} j = 1 \sum M [- lo g p (D ∣ w_{j}) + λ \cdot KL (w_{j})]

其中 $w_{j} = μ + σ ⊙ ϵ_{j}$ ， $ϵ_{j} \sim N (0, I)$ 。

4.3 完整实现

class BayesByBackpropNet(nn.Module):
    """
    Bayes by Backprop 实现
    
    核心思想：
    1. 用高斯分布近似每个权重
    2. 使用重参数化技巧采样
    3. 最小化变分自由能
    """
    
    def __init__(self, input_dim, hidden_dim, output_dim, prior_std=1.0):
        super().__init__()
        
        # 定义网络结构
        self.fc1 = BayesianLinear(input_dim, hidden_dim, prior_std)
        self.fc2 = BayesianLinear(hidden_dim, hidden_dim, prior_std)
        self.fc3 = BayesianLinear(hidden_dim, output_dim, prior_std)
        
        self.layers = [self.fc1, self.fc2, self.fc3]
    
    def forward(self, x, sample=True):
        """前向传播"""
        if sample:
            h = F.relu(self.fc1.sample_forward(x))
            h = F.relu(self.fc2.sample_forward(h))
            logits = self.fc3.sample_forward(h)
        else:
            h = F.relu(self.fc1.mean_forward(x))
            h = F.relu(self.fc2.mean_forward(h))
            logits = self.fc3.mean_forward(h)
        
        return logits
    
    def kl_divergence(self):
        """总 KL 散度"""
        return sum(layer.kl_divergence() for layer in self.layers)
    
    def predict(self, x, n_samples=50):
        """
        贝叶斯预测
        
        Returns:
            mean: 预测均值
            variance: 预测方差
            predictions: 所有采样预测
        """
        predictions = []
        
        with torch.no_grad():
            for _ in range(n_samples):
                logits = self.forward(x, sample=True)
                predictions.append(logits)
        
        predictions = torch.stack(predictions)
        
        # 均值和方差
        mean = predictions.mean(dim=0)
        variance = predictions.var(dim=0)
        
        return mean, variance, predictions
 
 
class BayesByBackpropTrainer:
    """
    Bayes by Backprop 训练器
    """
    
    def __init__(
        self, 
        model: BayesByBackpropNet,
        lr: float = 1e-3,
        kl_weight: float = 1.0,
        n_samples: int = 1
    ):
        self.model = model
        self.kl_weight = kl_weight
        self.n_samples = n_samples
        
        self.optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    
    def train_step(self, x, y):
        """单步训练"""
        self.optimizer.zero_grad()
        
        # 多次采样估计期望
        total_loss = 0.0
        total_recon = 0.0
        total_kl = 0.0
        
        for _ in range(self.n_samples):
            # 重构损失
            logits = self.model(x, sample=True)
            recon = F.cross_entropy(logits, y, reduction='mean')
            
            # KL 损失
            kl = self.model.kl_divergence() / len(x)
            
            loss = recon + self.kl_weight * kl
            
            loss.backward()
            
            total_loss += loss.item()
            total_recon += recon.item()
            total_kl += kl.item()
        
        # 平均
        avg_loss = total_loss / self.n_samples
        avg_recon = total_recon / self.n_samples
        avg_kl = total_kl / self.n_samples
        
        # 梯度裁剪
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
        
        self.optimizer.step()
        
        return avg_loss, avg_recon, avg_kl
    
    def train(self, dataloader, n_epochs):
        """完整训练"""
        for epoch in range(n_epochs):
            epoch_loss = 0
            epoch_recon = 0
            epoch_kl = 0
            n_batches = 0
            
            for x, y in dataloader:
                loss, recon, kl = self.train_step(x, y)
                
                epoch_loss += loss
                epoch_recon += recon
                epoch_kl += kl
                n_batches += 1
            
            if (epoch + 1) % 5 == 0:
                print(f"Epoch {epoch+1}: Loss={epoch_loss/n_batches:.4f}, "
                      f"Recon={epoch_recon/n_batches:.4f}, "
                      f"KL={epoch_kl/n_batches:.4f}")

4.4 局部重参数化技巧

优化：直接在输出空间采样，减少梯度方差。

class LocalReparamBayesianLinear(nn.Module):
    """
    局部重参数化的贝叶斯线性层
    
    关键优化：
    - 不在参数空间采样
    - 直接在输出空间采样
    - 减少梯度估计方差
    """
    
    def __init__(self, in_features, out_features, prior_std=1.0):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.prior_std = prior_std
        
        # 只存储均值
        self.weight_mu = nn.Parameter(torch.randn(out_features, in_features) * 0.1)
        self.bias_mu = nn.Parameter(torch.zeros(out_features))
        
        # 存储对数方差
        self.weight_log_var = nn.Parameter(torch.zeros(out_features, in_features) - 6)
        self.bias_log_var = nn.Parameter(torch.zeros(out_features) - 6)
    
    def forward(self, x):
        """
        前向传播（在输出空间采样）
        
        E[y] = x @ W_mu + b_mu
        Var[y] = (x² @ Var[W]) + Var[b]
        """
        # 计算均值
        mean = F.linear(x, self.weight_mu, self.bias_mu)
        
        # 计算方差（使用局部重参数化）
        # Var[wx + b] = Var[w]x² + Var[b]
        weight_var = torch.exp(self.weight_log_var)
        bias_var = torch.exp(self.bias_log_var)
        
        # (x² @ Var[w]) + Var[b]
        output_var = F.linear(x ** 2, weight_var, bias_var)
        
        # 采样
        output_std = torch.sqrt(output_var + 1e-8)
        eps = torch.randn_like(mean)
        output = mean + eps * output_std
        
        return output
    
    def kl_divergence(self):
        """KL 散度"""
        prior_var = self.prior_std ** 2
        
        def kl_gaussian(mu, log_var):
            var = torch.exp(log_var)
            return 0.5 * (
                log_var - torch.log(torch.tensor(prior_var, device=mu.device))
                + (var + mu ** 2) / prior_var
                - 1.0
            ).sum()
        
        return kl_gaussian(self.weight_mu, self.weight_log_var) + \
               kl_gaussian(self.bias_mu, self.bias_log_var)

5. MC Dropout 方法

5.1 Dropout 的贝叶斯解释

关键洞察：Dropout 可以解释为贝叶斯后验近似的变分推断。³

对应关系：

Dropout 操作	变分推断解释
Dropout mask $ϵ \sim Bernoulli (p)$	变分分布 $q (w_{i}) = p δ_{w_{i}} + (1 - p) δ_{0}$
Dropout 训练	最大化变分下界
Dropout 测试	从变分后验采样

5.2 MC Dropout 算法

预测过程：

class MCDropoutModel(nn.Module):
    """
    MC Dropout 模型
    
    使用 Dropout 进行不确定性估计
    """
    
    def __init__(self, input_dim, hidden_dim, output_dim, dropout_rate=0.5):
        super().__init__()
        
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.drop1 = nn.Dropout(p=dropout_rate)
        
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.drop2 = nn.Dropout(p=dropout_rate)
        
        self.fc3 = nn.Linear(hidden_dim, output_dim)
        self.drop3 = nn.Dropout(p=dropout_rate)
        
        self.activation = nn.ReLU()
    
    def forward(self, x, dropout=True):
        """
        前向传播
        
        Args:
            x: 输入
            dropout: 是否启用 Dropout
        """
        h = self.activation(self.fc1(x))
        h = self.drop1(h) if dropout or self.training else h
        
        h = self.activation(self.fc2(h))
        h = self.drop2(h) if dropout or self.training else h
        
        logits = self.fc3(h)
        logits = self.drop3(logits) if dropout or self.training else logits
        
        return logits
    
    def predict(self, x, n_samples=50):
        """
        MC Dropout 预测
        
        使用多次前向传播估计不确定性
        """
        self.train()  # 启用 Dropout
        
        predictions = []
        
        with torch.no_grad():
            for _ in range(n_samples):
                logits = self.forward(x, dropout=True)
                predictions.append(logits)
        
        predictions = torch.stack(predictions)
        
        # 预测统计
        mean = predictions.mean(dim=0)
        variance = predictions.var(dim=0)
        std = predictions.std(dim=0)
        
        # 预测类别
        pred_class = mean.argmax(dim=-1)
        
        return {
            'mean': mean,
            'variance': variance,
            'std': std,
            'pred_class': pred_class,
            'predictions': predictions
        }
 
 
def mc_dropout_predict(model, x, n_samples=50):
    """
    MC Dropout 预测函数
    
    适用于任何带有 Dropout 的模型
    
    Args:
        model: 带 Dropout 的模型
        x: 输入数据
        n_samples: 采样次数
    
    Returns:
        dict: 预测均值、方差、标准差
    """
    model.train()  # 启用 Dropout
    
    predictions = []
    
    with torch.no_grad():
        for _ in range(n_samples):
            pred = model(x)
            predictions.append(pred)
    
    predictions = torch.stack(predictions)
    
    mean = predictions.mean(dim=0)
    variance = predictions.var(dim=0)
    std = predictions.std(dim=0)
    
    return mean, variance, std, predictions

5.3 MC Dropout 与其他方法的比较

方面	MC Dropout	Bayes by Backprop	集成方法
实现复杂度	低	中	低
计算成本	$O (T)$ 前向传播	$O (T)$ 含采样	$O (T)$
不确定性质量	中等	高	中等
训练方式	标准 Dropout	变分推断	独立训练
理论基础	变分推断近似	变分推断	非贝叶斯

5.4 MC Dropout 的方差估计

预测方差分解：

Var (y^{*}) = 偶然不确定性 E [σ^{2} (y^{*} ∣ x^{*}, w)] + 认知不确定性 Var [E [y^{*} ∣ x^{*}, w]]

def estimate_uncertainty(model, x, y_true=None, n_samples=100):
    """
    估计预测不确定性
    
    Returns:
        epistemic: 认知不确定性（模型不确定性）
        aleatoric: 偶然不确定性（数据不确定性）
    """
    model.train()
    
    # 收集预测
    logits_list = []
    predictions_list = []
    
    with torch.no_grad():
        for _ in range(n_samples):
            logits = model(x)
            probs = F.softmax(logits, dim=-1)
            logits_list.append(logits)
            predictions_list.append(probs)
    
    logits = torch.stack(logits_list)
    predictions = torch.stack(predictions_list)
    
    # 预测均值
    mean_logits = logits.mean(dim=0)
    mean_probs = predictions.mean(dim=0)
    
    # 认知不确定性：预测均值的变化
    epistemic_var = predictions.var(dim=0)
    epistemic_std = epistemic_var.sqrt()
    
    # 偶然不确定性：每个预测的熵
    # H = -Σ p * log p
    aleatoric_entropy = -(mean_probs * (mean_probs + 1e-8).log()).sum(dim=-1)
    
    return {
        'mean_logits': mean_logits,
        'mean_probs': mean_probs,
        'epistemic_std': epistemic_std,
        'aleatoric_entropy': aleatoric_entropy,
        'total_uncertainty': epistemic_var + aleatoric_entropy.unsqueeze(-1)
    }

6. 不确定性类型：认知不确定性与偶然不确定性

6.1 数学定义

总不确定性（Total Uncertainty）：

T [y^{*}] = H [y^{*}] = - y \sum p (y^{*}) lo g p (y^{*})

认知不确定性（Epistemic Uncertainty）：

E [y^{*} ∣ D] = H [y^{*}] - E_{p (w ∣ D)} [H [y^{*} ∣ w]]

偶然不确定性（Aleatoric Uncertainty）：

A [y^{*}] = E_{p (w ∣ D)} [H [y^{*} ∣ w]]

6.2 方差分解

预测方差：

Var (y^{*}) = Aleatoric E [σ^{2} (y^{*} ∣ x^{*}, w)] + Epistemic Var [μ (y^{*} ∣ x^{*}, w)]

def decompose_uncertainty(model, x, n_samples=100):
    """
    分解总不确定性为认知和偶然不确定性
    
    Args:
        model: 贝叶斯神经网络
        x: 输入
        n_samples: 采样次数
    
    Returns:
        epistemic_var: 认知不确定性
        aleatoric_var: 偶然不确定性
        total_var: 总方差
    """
    model.train()
    
    predictions = []
    logits_list = []
    
    for _ in range(n_samples):
        logits = model(x, sample=True)
        probs = F.softmax(logits, dim=-1)
        
        logits_list.append(logits)
        predictions.append(probs)
    
    predictions = torch.stack(predictions)
    logits = torch.stack(logits_list)
    
    # 认知不确定性：预测均值的方差
    mean_pred = predictions.mean(dim=0)
    epistemic_var = predictions.var(dim=0)  # (batch, num_classes)
    
    # 偶然不确定性：每个预测的条件方差
    aleatoric_var = torch.zeros_like(epistemic_var)
    for i in range(n_samples):
        # H(y|x,w) = -Σ p(y|x,w) log p(y|x,w)
        entropy_per_sample = -(predictions[i] * (predictions[i] + 1e-8).log()).sum(dim=-1)
        aleatoric_var += entropy_per_sample / n_samples
    
    # 总方差
    total_var = epistemic_var + aleatoric_var.unsqueeze(-1)
    
    return {
        'epistemic_var': epistemic_var,
        'aleatoric_var': aleatoric_var.unsqueeze(-1),
        'total_var': total_var,
        'mean_pred': mean_pred
    }

6.3 不确定性的可视化

def visualize_uncertainty(model, X_train, y_train, X_test, n_samples=100):
    """
    可视化回归任务的不确定性
    """
    import matplotlib.pyplot as plt
    
    model.eval()
    
    # MC 预测
    predictions = []
    with torch.no_grad():
        for _ in range(n_samples):
            pred = model(X_test, sample=True)
            predictions.append(pred)
    
    predictions = torch.stack(predictions)
    
    mean = predictions.mean(dim=0)
    std = predictions.std(dim=0)
    
    # 绘制
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # 左图：预测均值和不确定性
    axes[0].scatter(X_train, y_train, c='blue', alpha=0.5, label='Training data')
    axes[0].plot(X_test, mean, 'r-', label='Mean prediction')
    axes[0].fill_between(
        X_test.flatten(), 
        (mean - 2*std).flatten(), 
        (mean + 2*std).flatten(),
        alpha=0.3, 
        color='red',
        label='95% CI'
    )
    axes[0].legend()
    axes[0].set_xlabel('x')
    axes[0].set_ylabel('y')
    axes[0].set_title('Regression with Uncertainty')
    
    # 右图：不确定性分布
    axes[1].hist(std.numpy(), bins=50, edgecolor='black')
    axes[1].set_xlabel('Predictive Std Dev')
    axes[1].set_ylabel('Count')
    axes[1].set_title('Uncertainty Distribution')
    
    plt.tight_layout()
    plt.savefig('uncertainty_visualization.png')

6.4 不确定性的应用场景

不确定性类型	典型应用	响应策略
Epistemic（高）	分布外数据	收集更多数据、拒识
Epistemic（低）	分布内数据	正常预测
Aleatoric（高）	本质随机数据	报告高不确定性
Aleatoric（低）	确定性数据	高置信预测

7. 实际应用与 PyTorch 实现

7.1 不确定性感知回归

class HeteroscedasticRegressor(nn.Module):
    """
    异方差回归器
    
    同时预测均值和方差
    """
    
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        
        # 均值网络
        self.mean_net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )
        
        # 方差网络
        self.var_net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1),
            nn.Softplus()  # 确保方差为正
        )
    
    def forward(self, x):
        mean = self.mean_net(x)
        var = self.var_net(x)
        return mean, var
    
    def loss(self, x, y):
        """
        高斯负对数似然损失
        
        NLL = 0.5 * (log(2πσ²) + (y-μ)²/σ²)
        """
        mean, var = self.forward(x)
        
        # 使用对数方差便于数值稳定
        log_var = torch.log(var + 1e-8)
        
        nll = 0.5 * (
            np.log(2 * np.pi) + 
            log_var + 
            (y - mean) ** 2 / var
        )
        
        return nll.mean()
 
 
class BayesianRegressor(nn.Module):
    """
    贝叶斯回归器
    
    学习权重的不确定性
    """
    
    def __init__(self, input_dim, hidden_dim, prior_std=1.0):
        super().__init__()
        
        self.fc1 = LocalReparamBayesianLinear(input_dim, hidden_dim, prior_std)
        self.fc2 = LocalReparamBayesianLinear(hidden_dim, hidden_dim, prior_std)
        self.fc3 = LocalReparamBayesianLinear(hidden_dim, 1, prior_std)
        
        self.activation = nn.ReLU()
    
    def forward(self, x, sample=True):
        h = self.activation(self.fc1(x))
        h = self.activation(self.fc2(h))
        return self.fc3(h)
    
    def predict(self, x, n_samples=50):
        """贝叶斯预测"""
        predictions = []
        
        with torch.no_grad():
            for _ in range(n_samples):
                pred = self(x, sample=True)
                predictions.append(pred)
        
        predictions = torch.stack(predictions)
        
        mean = predictions.mean(dim=0)
        std = predictions.std(dim=0)
        
        return mean, std, predictions

7.2 不确定性感知分类

class UncertaintyAwareClassifier(nn.Module):
    """
    不确定性感知分类器
    
    结合：
    1. 异方差损失（偶然不确定性）
    2. MC Dropout（认知不确定性）
    """
    
    def __init__(self, input_dim, hidden_dim, num_classes, dropout_rate=0.5):
        super().__init__()
        
        self.feature_extractor = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(dropout_rate)
        )
        
        self.classifier = nn.Linear(hidden_dim, num_classes)
        
        # 异方差参数
        self.log_noise = nn.Parameter(torch.zeros(1))
    
    def forward(self, x):
        features = self.feature_extractor(x)
        logits = self.classifier(features)
        return logits
    
    def predict_with_uncertainty(self, x, n_samples=50):
        """
        带不确定性的预测
        """
        self.train()  # 启用 Dropout
        
        logits_list = []
        probs_list = []
        
        with torch.no_grad():
            for _ in range(n_samples):
                logits = self.forward(x)
                probs = F.softmax(logits, dim=-1)
                logits_list.append(logits)
                probs_list.append(probs)
        
        logits = torch.stack(logits_list)
        probs = torch.stack(probs_list)
        
        # 点估计
        mean_logits = logits.mean(dim=0)
        mean_probs = probs.mean(dim=0)
        pred_class = mean_logits.argmax(dim=-1)
        
        # 认知不确定性（预测方差）
        epistemic_var = probs.var(dim=0)
        epistemic_std = epistemic_var.sqrt()
        
        # 预测熵
        pred_entropy = -(mean_probs * (mean_probs + 1e-8).log()).sum(dim=-1)
        
        # 置信度
        max_prob, _ = mean_probs.max(dim=-1)
        
        return {
            'pred_class': pred_class,
            'mean_probs': mean_probs,
            'epistemic_std': epistemic_std,
            'pred_entropy': pred_entropy,
            'confidence': max_prob,
            'logits': logits,
            'probs': probs
        }
    
    def detect_out_of_distribution(self, x, threshold=0.5):
        """
        检测分布外样本
        
        使用最大概率作为置信度指标
        """
        results = self.predict_with_uncertainty(x)
        
        # 低置信度 → 可能是 OOD
        is_ood = results['confidence'] < threshold
        
        return is_ood, results

7.3 完整训练流程

def train_bayesian_model():
    """
    训练贝叶斯神经网络完整示例
    """
    import torchvision.datasets as datasets
    import torchvision.transforms as transforms
    
    # 设置设备
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # 加载数据
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Lambda(lambda x: x.view(-1))
    ])
    
    # 使用小数据集演示
    torch.manual_seed(42)
    
    # 生成模拟数据（双月形）
    from sklearn.datasets import make_moons
    
    X, y = make_moons(n_samples=1000, noise=0.1, random_state=42)
    X = torch.tensor(X, dtype=torch.float32)
    y = torch.tensor(y, dtype=torch.long)
    
    dataset = torch.utils.data.TensorDataset(X, y)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True)
    
    # 创建模型
    model = UncertaintyAwareClassifier(
        input_dim=2,
        hidden_dim=64,
        num_classes=2,
        dropout_rate=0.5
    ).to(device)
    
    # 优化器
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    
    # 训练
    n_epochs = 100
    
    for epoch in range(n_epochs):
        model.train()
        total_loss = 0
        n_batches = 0
        
        for x, y in dataloader:
            x, y = x.to(device), y.to(device)
            
            optimizer.zero_grad()
            
            # 预测
            logits = model(x)
            
            # 交叉熵损失
            loss = F.cross_entropy(logits, y)
            
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            n_batches += 1
        
        if (epoch + 1) % 20 == 0:
            print(f"Epoch {epoch+1}: Loss = {total_loss/n_batches:.4f}")
    
    # 测试不确定性估计
    model.eval()
    
    # 生成测试数据
    X_test = torch.tensor(
        np.random.uniform(-2, 3, size=(200, 2)),
        dtype=torch.float32
    )
    
    results = model.predict_with_uncertainty(X_test.to(device), n_samples=100)
    
    print("\n不确定性统计：")
    print(f"  平均置信度: {results['confidence'].mean():.4f}")
    print(f"  平均预测熵: {results['pred_entropy'].mean():.4f}")
    print(f"  最大认知不确定性: {results['epistemic_std'].max():.4f}")
    
    return model, results
 
 
# 运行示例
if __name__ == "__main__":
    model, results = train_bayesian_model()

7.4 不确定性校准

class UncertaintyCalibrator:
    """
    预测不确定性校准
    
    确保预测置信度与实际准确率匹配
    """
    
    def __init__(self, n_bins=10):
        self.n_bins = n_bins
        self.bin_boundaries = np.linspace(0, 1, n_bins + 1)
    
    def calibrate(self, model, dataloader, n_samples=50):
        """
        校准模型
        
        计算 ECE (Expected Calibration Error)
        """
        device = next(model.parameters()).device
        
        model.eval()
        
        all_confidences = []
        all_accuracies = []
        all_predictions = []
        all_true = []
        
        with torch.no_grad():
            for x, y in dataloader:
                x, y = x.to(device), y.to(device)
                
                results = model.predict_with_uncertainty(x, n_samples=n_samples)
                
                all_confidences.append(results['confidence'].cpu())
                all_accuracies.append(
                    (results['pred_class'] == y).float().cpu()
                )
                all_predictions.append(results['pred_class'].cpu())
                all_true.append(y.cpu())
        
        confidences = torch.cat(all_confidences)
        accuracies = torch.cat(all_accuracies)
        
        # 计算 ECE
        ece = self.expected_calibration_error(confidences, accuracies)
        
        # 计算可靠性图
        reliability_diagram = self.reliability_diagram(confidences, accuracies)
        
        return {
            'ece': ece,
            'reliability_diagram': reliability_diagram,
            'confidences': confidences,
            'accuracies': accuracies
        }
    
    def expected_calibration_error(self, confidences, accuracies):
        """
        计算 ECE
        
        ECE = Σ_b |B_b| / n * |acc(B_b) - conf(B_b)|
        """
        ece = 0.0
        
        for i in range(self.n_bins):
            bin_lower = self.bin_boundaries[i]
            bin_upper = self.bin_boundaries[i + 1]
            
            in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
            
            if in_bin.sum() > 0:
                bin_confidence = confidences[in_bin].mean()
                bin_accuracy = accuracies[in_bin].mean()
                
                ece += in_bin.float().mean() * abs(bin_accuracy - bin_confidence)
        
        return ece.item()
    
    def reliability_diagram(self, confidences, accuracies):
        """
        计算可靠性图数据
        """
        bin_accuracies = []
        bin_confidences = []
        bin_counts = []
        
        for i in range(self.n_bins):
            bin_lower = self.bin_boundaries[i]
            bin_upper = self.bin_boundaries[i + 1]
            
            in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
            
            if in_bin.sum() > 0:
                bin_accuracies.append(accuracies[in_bin].mean().item())
                bin_confidences.append(confidences[in_bin].mean().item())
                bin_counts.append(in_bin.sum().item())
            else:
                bin_accuracies.append(0)
                bin_confidences.append((bin_lower + bin_upper) / 2)
                bin_counts.append(0)
        
        return {
            'accuracies': bin_accuracies,
            'confidences': bin_confidences,
            'counts': bin_counts
        }

8. 与深度学习最新进展的联系

8.1 深度集成与贝叶斯方法

深度集成（Deep Ensembles）可视为 BNN 的实用近似：

p (y^{*} ∣ x^{*}) \approx \frac{1}{M} i = 1 \sum M p (y^{*} ∣ x^{*}, \overset{w}{^}_{i})

其中 $\overset{w}{^}_{i}$ 是不同随机初始化下独立训练的模型。

class DeepEnsemble:
    """
    深度集成
    
    结合多个独立训练的模型
    """
    
    def __init__(self, models):
        self.models = models
    
    def predict(self, x, return_all=False):
        """集成预测"""
        predictions = []
        
        for model in self.models:
            model.eval()
            with torch.no_grad():
                pred = model(x)
                predictions.append(pred)
        
        predictions = torch.stack(predictions)
        
        mean = predictions.mean(dim=0)
        std = predictions.std(dim=0)
        
        if return_all:
            return mean, std, predictions
        return mean, std

8.2 Dropout 与注意力机制

Dropout 在 Transformer 中的应用：

class BayesianTransformerLayer(nn.Module):
    """
    贝叶斯 Transformer 层
    
    使用 MC Dropout 进行不确定性估计
    """
    
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        
        self.attention = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model)
        )
    
    def forward(self, x, src_mask=None, use_dropout=True):
        # 自注意力
        attn_out, _ = self.attention(x, x, x, attn_mask=src_mask)
        if use_dropout or self.training:
            attn_out = self.dropout1(attn_out)
        x = self.norm1(x + attn_out)
        
        # 前馈网络
        ff_out = self.ff(x)
        if use_dropout or self.training:
            ff_out = self.dropout2(ff_out)
        x = self.norm2(x + ff_out)
        
        return x

8.3 不确定性在 LLM 中的应用

大语言模型的不确定性：

Token 级别的困惑度 → 词级别的不确定性
序列级别的熵 → 整体生成的不确定性
Self-consistency 检验 → 认知不确定性的近似

class LLMPromptUncertainty:
    """
    LLM 提示不确定性估计
    
    通过多次采样估计不确定性
    """
    
    def __init__(self, model, tokenizer, device='cuda'):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
    
    def estimate_uncertainty(self, prompt, n_samples=10, max_length=50):
        """
        估计生成的不确定性
        
        Returns:
            mean_response: 平均响应
            entropy: 响应熵
            disagreement: 模型间分歧
        """
        responses = []
        log_probs_list = []
        
        for _ in range(n_samples):
            # 生成（使用采样）
            inputs = self.tokenizer(prompt, return_tensors='pt').to(self.device)
            
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    max_length=max_length,
                    do_sample=True,
                    temperature=0.8,
                    top_p=0.9
                )
            
            response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
            responses.append(response)
            
            # 计算困惑度
            log_prob = self.compute_log_prob(inputs, outputs)
            log_probs_list.append(log_prob)
        
        # 计算统计量
        responses_unique = list(set(responses))
        disagreement = 1 - len(responses_unique) / n_samples
        
        mean_log_prob = np.mean(log_probs_list)
        std_log_prob = np.std(log_probs_list)
        
        return {
            'responses': responses,
            'disagreement': disagreement,
            'mean_log_prob': mean_log_prob,
            'std_log_prob': std_log_prob,
            'n_unique': len(responses_unique)
        }
    
    def compute_log_prob(self, inputs, outputs):
        """计算序列的对数概率"""
        with torch.no_grad():
            result = self.model(
                inputs['input_ids'],
                labels=outputs
            )
        return result.loss.item()

9. 总结与关联

9.1 核心要点总结

主题	核心概念
BNN 定义	$p (y^{} ∣ x^{}) = \int p (y^{} ∣ x^{}, w) p (w ∣ D) d w$
Mean Field VI	$q (w) = \prod_{i} N (w_{i} ∣ μ_{i}, σ_{i}^{2})$
Bayes by Backprop	重参数化 + ELBO 优化
MC Dropout	Dropout 作为变分近似
Epistemic 不确定性	模型不确定性，可减少
Aleatoric 不确定性	数据不确定性，不可减少

9.2 方法比较速查

方法	计算成本	不确定性质量	实现难度
Bayes by Backprop	高	高	中
MC Dropout	中	中	低
Deep Ensembles	可调	中高	低
SWAG	中	中高	中
HMC	极高	最高	高

9.3 与相关文档的关联

相关主题	关联说明
贝叶斯网络	概率图模型基础
变分推断进阶	VI 理论框架
神经变分推断	NVI 方法
Bayes by Backprop	变分 BNN 训练
MC Dropout	Dropout 不确定性
概率电路基础	可追踪推断

参考

Jospin, L. V., et al. (2020). “Hands-On Bayesian Neural Networks: A Tutorial for Deep Learning Users”. arXiv:2007.06823. ↩
Blundell, C., et al. (2015). “Weight Uncertainty in Neural Networks”. ICML 2015. ↩
Gal, Y., & Ghahramani, Z. (2016). “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”. ICML 2016. ↩

Metaphor

探索

贝叶斯深度学习与不确定性量化

贝叶斯深度学习与不确定性量化

1. 贝叶斯神经网络的定义

1.1 从确定性网络到贝叶斯网络

1.2 BNN 的概率模型

1.3 BNN 的结构对比

1.4 网络结构示例

2. 参数后验推断的挑战

2.1 计算复杂性

2.2 精确推断的不可能性

2.3 近似推断方法谱系

2.4 近似质量与计算成本权衡

3. 变分推断方法：Mean Field Approximation

3.1 Mean Field 假设

3.2 变分目标函数

3.3 KL 散度的计算

3.4 优化算法

4. Bayes by Backprop 详解

4.1 算法核心

4.2 损失函数推导

4.3 完整实现

4.4 局部重参数化技巧

5. MC Dropout 方法

5.1 Dropout 的贝叶斯解释

5.2 MC Dropout 算法

5.3 MC Dropout 与其他方法的比较

5.4 MC Dropout 的方差估计

6. 不确定性类型：认知不确定性与偶然不确定性

6.1 数学定义

6.2 方差分解

6.3 不确定性的可视化

6.4 不确定性的应用场景

7. 实际应用与 PyTorch 实现

7.1 不确定性感知回归

7.2 不确定性感知分类

7.3 完整训练流程

7.4 不确定性校准

8. 与深度学习最新进展的联系

8.1 深度集成与贝叶斯方法

8.2 Dropout 与注意力机制

8.3 不确定性在 LLM 中的应用

9. 总结与关联

9.1 核心要点总结

9.2 方法比较速查

9.3 与相关文档的关联

参考

Footnotes

关系图谱

目录