贝叶斯深度学习与不确定性量化
贝叶斯深度学习(Bayesian Deep Learning,BDL)是将贝叶斯概率推断与深度神经网络融合的框架,旨在建模和量化神经网络预测中的不确定性。1 与传统深度学习的点估计不同,BDL 学习网络权重的后验分布,从而能够区分不同类型的不确定性并做出知情的预测。
1. 贝叶斯神经网络的定义
1.1 从确定性网络到贝叶斯网络
传统神经网络的训练目标:
得到的是权重的点估计,没有关于模型不确定性的信息。
贝叶斯神经网络的框架:
- 先验分布:
- 似然函数:
- 后验分布:
1.2 BNN 的概率模型
层次化表示:
预测分布:
这是对所有可能权重配置的平均,包含不确定性。
1.3 BNN 的结构对比
| 方面 | 确定性网络 | 贝叶斯网络 |
|---|---|---|
| 权重 | 点估计 | 分布 |
| 预测 | ||
| 过拟合 | 容易过拟合 | 自然正则化 |
| 不确定性 | 无法量化 | 自然包含 |
| 计算成本 | per forward |
1.4 网络结构示例
class BayesianNetwork(nn.Module):
"""
贝叶斯神经网络基本结构
与确定性网络不同,每个权重层学习的是分布参数
"""
def __init__(self, layer_dims, prior_std=1.0):
super().__init__()
self.layer_dims = layer_dims
self.prior_std = prior_std
# 创建贝叶斯层
self.layers = nn.ModuleList()
for i in range(len(layer_dims) - 1):
self.layers.append(
BayesianLinear(
layer_dims[i],
layer_dims[i+1],
prior_std=prior_std
)
)
def forward(self, x, sample=True):
"""
前向传播
Args:
x: 输入
sample: 是否从后验采样权重
"""
for layer in self.layers:
if sample:
x = layer.sample_forward(x)
else:
# 使用均值
x = layer.mean_forward(x)
if layer != self.layers[-1]:
x = F.relu(x)
return x
def predict(self, x, n_samples=50):
"""
贝叶斯预测:多次采样获取预测分布
"""
predictions = []
with torch.no_grad():
for _ in range(n_samples):
pred = self(x, sample=True)
predictions.append(pred)
predictions = torch.stack(predictions)
# 统计量
mean = predictions.mean(dim=0)
std = predictions.std(dim=0)
var = predictions.var(dim=0)
return mean, std, var, predictions
class BayesianLinear(nn.Module):
"""贝叶斯线性层"""
def __init__(self, in_features, out_features, prior_std=1.0):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.prior_std = prior_std
# 变分参数:均值和对数方差
self.weight_mu = nn.Parameter(
torch.randn(out_features, in_features) * 0.1
)
self.weight_log_var = nn.Parameter(
torch.zeros(out_features, in_features) - 6 # log(0.001)
)
self.bias_mu = nn.Parameter(torch.zeros(out_features))
self.bias_log_var = nn.Parameter(
torch.zeros(out_features) - 6
)
def sample_weights(self):
"""从变分后验采样"""
std = (0.5 * self.weight_log_var).exp()
eps = torch.randn_like(self.weight_mu)
weight = self.weight_mu + eps * std
std = (0.5 * self.bias_log_var).exp()
eps = torch.randn_like(self.bias_mu)
bias = self.bias_mu + eps * std
return weight, bias
def sample_forward(self, x):
"""使用采样权重的正向传播"""
weight, bias = self.sample_weights()
return F.linear(x, weight, bias)
def mean_forward(self, x):
"""使用均值权重的正向传播"""
return F.linear(x, self.weight_mu, self.bias_mu)
def kl_divergence(self):
"""
计算与先验的 KL 散度
D_KL(N(μ,σ²) || N(0,σ_p²))
"""
prior_var = self.prior_std ** 2
def kl_gaussian(mu, log_var):
var = torch.exp(log_var)
return 0.5 * (
log_var - torch.log(torch.tensor(prior_var, device=mu.device))
+ (var + mu ** 2) / prior_var
- 1.0
).sum()
return kl_gaussian(self.weight_mu, self.weight_log_var) + \
kl_gaussian(self.bias_mu, self.bias_log_var)2. 参数后验推断的挑战
2.1 计算复杂性
后验分布的计算:
困难来源:
| 困难 | 说明 |
|---|---|
| 高维参数空间 | 现代网络有 - 个参数 |
| 非线性映射 | 似然函数非凸 |
| 归一化常数 | 无法解析计算 |
| 后验非高斯 | 后验分布形状复杂 |
2.2 精确推断的不可能性
理论限制:
对于具有数百万参数的网络,精确计算后验分布在计算上是不可行的。
- 参数数量:
- 积分复杂度:(维数灾难)
2.3 近似推断方法谱系
精确推断
↑
│
┌─────────────┼─────────────┐
│ │ │
变分推断 蒙特卡洛采样 点估计+不确定性
│ │ │
↓ ↓ ↓
平均场近似 HMC/NUTS MC Dropout
归一化流 SGLD 集成方法
局部暴露 粒子滤波
2.4 近似质量与计算成本权衡
| 方法 | 计算成本 | 近似质量 | 实现复杂度 |
|---|---|---|---|
| HMC | 极高 | 最高 | 高 |
| SGLD | 高 | 高 | 中 |
| 变分推断(Mean Field) | 中 | 中等 | 低 |
| 变分推断(归一化流) | 中高 | 中高 | 中 |
| MC Dropout | 低 | 中等 | 低 |
| 集成方法 | 可调 | 中高 | 低 |
3. 变分推断方法:Mean Field Approximation
3.1 Mean Field 假设
分解假设:
假设的依据:
- 计算可行:只需优化一维高斯参数
- 正则化效果:鼓励权重独立
- 理论基础:平均场是真实后验的最小 KL 距离近似
3.2 变分目标函数
ELBO:
展开形式:
3.3 KL 散度的计算
高斯到高斯的 KL 散度:
def kl_divergence_gaussian_to_prior(mu, log_var, prior_std=1.0):
"""
计算高斯变分分布与高斯先验的 KL 散度
D_KL(N(μ,σ²) || N(0,σ_p²))
Args:
mu: 均值
log_var: 对数方差 log(σ²)
prior_std: 先验标准差
Returns:
KL 散度(标量)
"""
prior_var = prior_std ** 2
var = torch.exp(log_var)
kl = 0.5 * (
log_var - torch.log(torch.tensor(prior_var, device=mu.device))
+ (var + mu ** 2) / prior_var
- 1.0
)
return kl.sum()3.4 优化算法
随机梯度变分推断(SGVB):
class MeanFieldVI:
"""
均值场变分推断
使用重参数化技巧进行梯度优化
"""
def __init__(self, model, prior_std=1.0, kl_weight=1.0):
self.model = model
self.prior_std = prior_std
self.kl_weight = kl_weight
def elbo(self, x, y, n_samples=1):
"""
计算 ELBO
ELBO = E_q[log p(y|x,w)] - KL(q(w)||p(w))
"""
batch_size = x.size(0)
total_loss = 0.0
total_recon = 0.0
total_kl = 0.0
for _ in range(n_samples):
# 重参数化采样权重
kl = self.sample_and_compute_kl()
# 前向传播
output = self.model(x, sample=True)
# 重构损失
recon = F.cross_entropy(output, y, reduction='sum')
total_loss += recon + self.kl_weight * kl
total_recon += recon
total_kl += kl
# 平均
loss = total_loss / n_samples
recon_loss = total_recon / n_samples
kl_loss = total_kl / n_samples
return loss, recon_loss, kl_loss
def sample_and_compute_kl(self):
"""采样权重并计算 KL"""
kl = 0.0
for name, param in self.model.named_parameters():
if hasattr(param, 'mu'):
kl += kl_divergence_gaussian_to_prior(
param.mu,
param.log_var,
self.prior_std
)
return kl
def fit(self, dataloader, n_epochs=100, lr=1e-3):
"""训练"""
optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
for epoch in range(n_epochs):
for x, y in dataloader:
optimizer.zero_grad()
loss, recon, kl = self.elbo(x, y, n_samples=1)
loss.backward()
optimizer.step()
if (epoch + 1) % 10 == 0:
print(f"Epoch {epoch+1}: Loss={loss.item():.4f}, "
f"Recon={recon.item():.4f}, KL={kl.item():.4f}")4. Bayes by Backprop 详解
4.1 算法核心
Bayes by Backprop 通过变分推断学习权重分布,同时利用反向传播进行梯度优化。2
算法步骤:
- 初始化变分参数
- 重复:
- 从 采样
- 计算 ELBO 的梯度
- 更新
4.2 损失函数推导
变分自由能(Free Energy):
实际实现:
其中 ,。
4.3 完整实现
class BayesByBackpropNet(nn.Module):
"""
Bayes by Backprop 实现
核心思想:
1. 用高斯分布近似每个权重
2. 使用重参数化技巧采样
3. 最小化变分自由能
"""
def __init__(self, input_dim, hidden_dim, output_dim, prior_std=1.0):
super().__init__()
# 定义网络结构
self.fc1 = BayesianLinear(input_dim, hidden_dim, prior_std)
self.fc2 = BayesianLinear(hidden_dim, hidden_dim, prior_std)
self.fc3 = BayesianLinear(hidden_dim, output_dim, prior_std)
self.layers = [self.fc1, self.fc2, self.fc3]
def forward(self, x, sample=True):
"""前向传播"""
if sample:
h = F.relu(self.fc1.sample_forward(x))
h = F.relu(self.fc2.sample_forward(h))
logits = self.fc3.sample_forward(h)
else:
h = F.relu(self.fc1.mean_forward(x))
h = F.relu(self.fc2.mean_forward(h))
logits = self.fc3.mean_forward(h)
return logits
def kl_divergence(self):
"""总 KL 散度"""
return sum(layer.kl_divergence() for layer in self.layers)
def predict(self, x, n_samples=50):
"""
贝叶斯预测
Returns:
mean: 预测均值
variance: 预测方差
predictions: 所有采样预测
"""
predictions = []
with torch.no_grad():
for _ in range(n_samples):
logits = self.forward(x, sample=True)
predictions.append(logits)
predictions = torch.stack(predictions)
# 均值和方差
mean = predictions.mean(dim=0)
variance = predictions.var(dim=0)
return mean, variance, predictions
class BayesByBackpropTrainer:
"""
Bayes by Backprop 训练器
"""
def __init__(
self,
model: BayesByBackpropNet,
lr: float = 1e-3,
kl_weight: float = 1.0,
n_samples: int = 1
):
self.model = model
self.kl_weight = kl_weight
self.n_samples = n_samples
self.optimizer = torch.optim.Adam(model.parameters(), lr=lr)
def train_step(self, x, y):
"""单步训练"""
self.optimizer.zero_grad()
# 多次采样估计期望
total_loss = 0.0
total_recon = 0.0
total_kl = 0.0
for _ in range(self.n_samples):
# 重构损失
logits = self.model(x, sample=True)
recon = F.cross_entropy(logits, y, reduction='mean')
# KL 损失
kl = self.model.kl_divergence() / len(x)
loss = recon + self.kl_weight * kl
loss.backward()
total_loss += loss.item()
total_recon += recon.item()
total_kl += kl.item()
# 平均
avg_loss = total_loss / self.n_samples
avg_recon = total_recon / self.n_samples
avg_kl = total_kl / self.n_samples
# 梯度裁剪
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
self.optimizer.step()
return avg_loss, avg_recon, avg_kl
def train(self, dataloader, n_epochs):
"""完整训练"""
for epoch in range(n_epochs):
epoch_loss = 0
epoch_recon = 0
epoch_kl = 0
n_batches = 0
for x, y in dataloader:
loss, recon, kl = self.train_step(x, y)
epoch_loss += loss
epoch_recon += recon
epoch_kl += kl
n_batches += 1
if (epoch + 1) % 5 == 0:
print(f"Epoch {epoch+1}: Loss={epoch_loss/n_batches:.4f}, "
f"Recon={epoch_recon/n_batches:.4f}, "
f"KL={epoch_kl/n_batches:.4f}")4.4 局部重参数化技巧
优化:直接在输出空间采样,减少梯度方差。
class LocalReparamBayesianLinear(nn.Module):
"""
局部重参数化的贝叶斯线性层
关键优化:
- 不在参数空间采样
- 直接在输出空间采样
- 减少梯度估计方差
"""
def __init__(self, in_features, out_features, prior_std=1.0):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.prior_std = prior_std
# 只存储均值
self.weight_mu = nn.Parameter(torch.randn(out_features, in_features) * 0.1)
self.bias_mu = nn.Parameter(torch.zeros(out_features))
# 存储对数方差
self.weight_log_var = nn.Parameter(torch.zeros(out_features, in_features) - 6)
self.bias_log_var = nn.Parameter(torch.zeros(out_features) - 6)
def forward(self, x):
"""
前向传播(在输出空间采样)
E[y] = x @ W_mu + b_mu
Var[y] = (x² @ Var[W]) + Var[b]
"""
# 计算均值
mean = F.linear(x, self.weight_mu, self.bias_mu)
# 计算方差(使用局部重参数化)
# Var[wx + b] = Var[w]x² + Var[b]
weight_var = torch.exp(self.weight_log_var)
bias_var = torch.exp(self.bias_log_var)
# (x² @ Var[w]) + Var[b]
output_var = F.linear(x ** 2, weight_var, bias_var)
# 采样
output_std = torch.sqrt(output_var + 1e-8)
eps = torch.randn_like(mean)
output = mean + eps * output_std
return output
def kl_divergence(self):
"""KL 散度"""
prior_var = self.prior_std ** 2
def kl_gaussian(mu, log_var):
var = torch.exp(log_var)
return 0.5 * (
log_var - torch.log(torch.tensor(prior_var, device=mu.device))
+ (var + mu ** 2) / prior_var
- 1.0
).sum()
return kl_gaussian(self.weight_mu, self.weight_log_var) + \
kl_gaussian(self.bias_mu, self.bias_log_var)5. MC Dropout 方法
5.1 Dropout 的贝叶斯解释
关键洞察:Dropout 可以解释为贝叶斯后验近似的变分推断。3
对应关系:
| Dropout 操作 | 变分推断解释 |
|---|---|
| Dropout mask | 变分分布 |
| Dropout 训练 | 最大化变分下界 |
| Dropout 测试 | 从变分后验采样 |
5.2 MC Dropout 算法
预测过程:
class MCDropoutModel(nn.Module):
"""
MC Dropout 模型
使用 Dropout 进行不确定性估计
"""
def __init__(self, input_dim, hidden_dim, output_dim, dropout_rate=0.5):
super().__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.drop1 = nn.Dropout(p=dropout_rate)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.drop2 = nn.Dropout(p=dropout_rate)
self.fc3 = nn.Linear(hidden_dim, output_dim)
self.drop3 = nn.Dropout(p=dropout_rate)
self.activation = nn.ReLU()
def forward(self, x, dropout=True):
"""
前向传播
Args:
x: 输入
dropout: 是否启用 Dropout
"""
h = self.activation(self.fc1(x))
h = self.drop1(h) if dropout or self.training else h
h = self.activation(self.fc2(h))
h = self.drop2(h) if dropout or self.training else h
logits = self.fc3(h)
logits = self.drop3(logits) if dropout or self.training else logits
return logits
def predict(self, x, n_samples=50):
"""
MC Dropout 预测
使用多次前向传播估计不确定性
"""
self.train() # 启用 Dropout
predictions = []
with torch.no_grad():
for _ in range(n_samples):
logits = self.forward(x, dropout=True)
predictions.append(logits)
predictions = torch.stack(predictions)
# 预测统计
mean = predictions.mean(dim=0)
variance = predictions.var(dim=0)
std = predictions.std(dim=0)
# 预测类别
pred_class = mean.argmax(dim=-1)
return {
'mean': mean,
'variance': variance,
'std': std,
'pred_class': pred_class,
'predictions': predictions
}
def mc_dropout_predict(model, x, n_samples=50):
"""
MC Dropout 预测函数
适用于任何带有 Dropout 的模型
Args:
model: 带 Dropout 的模型
x: 输入数据
n_samples: 采样次数
Returns:
dict: 预测均值、方差、标准差
"""
model.train() # 启用 Dropout
predictions = []
with torch.no_grad():
for _ in range(n_samples):
pred = model(x)
predictions.append(pred)
predictions = torch.stack(predictions)
mean = predictions.mean(dim=0)
variance = predictions.var(dim=0)
std = predictions.std(dim=0)
return mean, variance, std, predictions5.3 MC Dropout 与其他方法的比较
| 方面 | MC Dropout | Bayes by Backprop | 集成方法 |
|---|---|---|---|
| 实现复杂度 | 低 | 中 | 低 |
| 计算成本 | 前向传播 | 含采样 | |
| 不确定性质量 | 中等 | 高 | 中等 |
| 训练方式 | 标准 Dropout | 变分推断 | 独立训练 |
| 理论基础 | 变分推断近似 | 变分推断 | 非贝叶斯 |
5.4 MC Dropout 的方差估计
预测方差分解:
def estimate_uncertainty(model, x, y_true=None, n_samples=100):
"""
估计预测不确定性
Returns:
epistemic: 认知不确定性(模型不确定性)
aleatoric: 偶然不确定性(数据不确定性)
"""
model.train()
# 收集预测
logits_list = []
predictions_list = []
with torch.no_grad():
for _ in range(n_samples):
logits = model(x)
probs = F.softmax(logits, dim=-1)
logits_list.append(logits)
predictions_list.append(probs)
logits = torch.stack(logits_list)
predictions = torch.stack(predictions_list)
# 预测均值
mean_logits = logits.mean(dim=0)
mean_probs = predictions.mean(dim=0)
# 认知不确定性:预测均值的变化
epistemic_var = predictions.var(dim=0)
epistemic_std = epistemic_var.sqrt()
# 偶然不确定性:每个预测的熵
# H = -Σ p * log p
aleatoric_entropy = -(mean_probs * (mean_probs + 1e-8).log()).sum(dim=-1)
return {
'mean_logits': mean_logits,
'mean_probs': mean_probs,
'epistemic_std': epistemic_std,
'aleatoric_entropy': aleatoric_entropy,
'total_uncertainty': epistemic_var + aleatoric_entropy.unsqueeze(-1)
}6. 不确定性类型:认知不确定性与偶然不确定性
6.1 数学定义
总不确定性(Total Uncertainty):
认知不确定性(Epistemic Uncertainty):
偶然不确定性(Aleatoric Uncertainty):
6.2 方差分解
预测方差:
def decompose_uncertainty(model, x, n_samples=100):
"""
分解总不确定性为认知和偶然不确定性
Args:
model: 贝叶斯神经网络
x: 输入
n_samples: 采样次数
Returns:
epistemic_var: 认知不确定性
aleatoric_var: 偶然不确定性
total_var: 总方差
"""
model.train()
predictions = []
logits_list = []
for _ in range(n_samples):
logits = model(x, sample=True)
probs = F.softmax(logits, dim=-1)
logits_list.append(logits)
predictions.append(probs)
predictions = torch.stack(predictions)
logits = torch.stack(logits_list)
# 认知不确定性:预测均值的方差
mean_pred = predictions.mean(dim=0)
epistemic_var = predictions.var(dim=0) # (batch, num_classes)
# 偶然不确定性:每个预测的条件方差
aleatoric_var = torch.zeros_like(epistemic_var)
for i in range(n_samples):
# H(y|x,w) = -Σ p(y|x,w) log p(y|x,w)
entropy_per_sample = -(predictions[i] * (predictions[i] + 1e-8).log()).sum(dim=-1)
aleatoric_var += entropy_per_sample / n_samples
# 总方差
total_var = epistemic_var + aleatoric_var.unsqueeze(-1)
return {
'epistemic_var': epistemic_var,
'aleatoric_var': aleatoric_var.unsqueeze(-1),
'total_var': total_var,
'mean_pred': mean_pred
}6.3 不确定性的可视化
def visualize_uncertainty(model, X_train, y_train, X_test, n_samples=100):
"""
可视化回归任务的不确定性
"""
import matplotlib.pyplot as plt
model.eval()
# MC 预测
predictions = []
with torch.no_grad():
for _ in range(n_samples):
pred = model(X_test, sample=True)
predictions.append(pred)
predictions = torch.stack(predictions)
mean = predictions.mean(dim=0)
std = predictions.std(dim=0)
# 绘制
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# 左图:预测均值和不确定性
axes[0].scatter(X_train, y_train, c='blue', alpha=0.5, label='Training data')
axes[0].plot(X_test, mean, 'r-', label='Mean prediction')
axes[0].fill_between(
X_test.flatten(),
(mean - 2*std).flatten(),
(mean + 2*std).flatten(),
alpha=0.3,
color='red',
label='95% CI'
)
axes[0].legend()
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Regression with Uncertainty')
# 右图:不确定性分布
axes[1].hist(std.numpy(), bins=50, edgecolor='black')
axes[1].set_xlabel('Predictive Std Dev')
axes[1].set_ylabel('Count')
axes[1].set_title('Uncertainty Distribution')
plt.tight_layout()
plt.savefig('uncertainty_visualization.png')6.4 不确定性的应用场景
| 不确定性类型 | 典型应用 | 响应策略 |
|---|---|---|
| Epistemic(高) | 分布外数据 | 收集更多数据、拒识 |
| Epistemic(低) | 分布内数据 | 正常预测 |
| Aleatoric(高) | 本质随机数据 | 报告高不确定性 |
| Aleatoric(低) | 确定性数据 | 高置信预测 |
7. 实际应用与 PyTorch 实现
7.1 不确定性感知回归
class HeteroscedasticRegressor(nn.Module):
"""
异方差回归器
同时预测均值和方差
"""
def __init__(self, input_dim, hidden_dim):
super().__init__()
# 均值网络
self.mean_net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1)
)
# 方差网络
self.var_net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, 1),
nn.Softplus() # 确保方差为正
)
def forward(self, x):
mean = self.mean_net(x)
var = self.var_net(x)
return mean, var
def loss(self, x, y):
"""
高斯负对数似然损失
NLL = 0.5 * (log(2πσ²) + (y-μ)²/σ²)
"""
mean, var = self.forward(x)
# 使用对数方差便于数值稳定
log_var = torch.log(var + 1e-8)
nll = 0.5 * (
np.log(2 * np.pi) +
log_var +
(y - mean) ** 2 / var
)
return nll.mean()
class BayesianRegressor(nn.Module):
"""
贝叶斯回归器
学习权重的不确定性
"""
def __init__(self, input_dim, hidden_dim, prior_std=1.0):
super().__init__()
self.fc1 = LocalReparamBayesianLinear(input_dim, hidden_dim, prior_std)
self.fc2 = LocalReparamBayesianLinear(hidden_dim, hidden_dim, prior_std)
self.fc3 = LocalReparamBayesianLinear(hidden_dim, 1, prior_std)
self.activation = nn.ReLU()
def forward(self, x, sample=True):
h = self.activation(self.fc1(x))
h = self.activation(self.fc2(h))
return self.fc3(h)
def predict(self, x, n_samples=50):
"""贝叶斯预测"""
predictions = []
with torch.no_grad():
for _ in range(n_samples):
pred = self(x, sample=True)
predictions.append(pred)
predictions = torch.stack(predictions)
mean = predictions.mean(dim=0)
std = predictions.std(dim=0)
return mean, std, predictions7.2 不确定性感知分类
class UncertaintyAwareClassifier(nn.Module):
"""
不确定性感知分类器
结合:
1. 异方差损失(偶然不确定性)
2. MC Dropout(认知不确定性)
"""
def __init__(self, input_dim, hidden_dim, num_classes, dropout_rate=0.5):
super().__init__()
self.feature_extractor = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(dropout_rate)
)
self.classifier = nn.Linear(hidden_dim, num_classes)
# 异方差参数
self.log_noise = nn.Parameter(torch.zeros(1))
def forward(self, x):
features = self.feature_extractor(x)
logits = self.classifier(features)
return logits
def predict_with_uncertainty(self, x, n_samples=50):
"""
带不确定性的预测
"""
self.train() # 启用 Dropout
logits_list = []
probs_list = []
with torch.no_grad():
for _ in range(n_samples):
logits = self.forward(x)
probs = F.softmax(logits, dim=-1)
logits_list.append(logits)
probs_list.append(probs)
logits = torch.stack(logits_list)
probs = torch.stack(probs_list)
# 点估计
mean_logits = logits.mean(dim=0)
mean_probs = probs.mean(dim=0)
pred_class = mean_logits.argmax(dim=-1)
# 认知不确定性(预测方差)
epistemic_var = probs.var(dim=0)
epistemic_std = epistemic_var.sqrt()
# 预测熵
pred_entropy = -(mean_probs * (mean_probs + 1e-8).log()).sum(dim=-1)
# 置信度
max_prob, _ = mean_probs.max(dim=-1)
return {
'pred_class': pred_class,
'mean_probs': mean_probs,
'epistemic_std': epistemic_std,
'pred_entropy': pred_entropy,
'confidence': max_prob,
'logits': logits,
'probs': probs
}
def detect_out_of_distribution(self, x, threshold=0.5):
"""
检测分布外样本
使用最大概率作为置信度指标
"""
results = self.predict_with_uncertainty(x)
# 低置信度 → 可能是 OOD
is_ood = results['confidence'] < threshold
return is_ood, results7.3 完整训练流程
def train_bayesian_model():
"""
训练贝叶斯神经网络完整示例
"""
import torchvision.datasets as datasets
import torchvision.transforms as transforms
# 设置设备
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# 加载数据
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Lambda(lambda x: x.view(-1))
])
# 使用小数据集演示
torch.manual_seed(42)
# 生成模拟数据(双月形)
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.1, random_state=42)
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.long)
dataset = torch.utils.data.TensorDataset(X, y)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, shuffle=True)
# 创建模型
model = UncertaintyAwareClassifier(
input_dim=2,
hidden_dim=64,
num_classes=2,
dropout_rate=0.5
).to(device)
# 优化器
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# 训练
n_epochs = 100
for epoch in range(n_epochs):
model.train()
total_loss = 0
n_batches = 0
for x, y in dataloader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
# 预测
logits = model(x)
# 交叉熵损失
loss = F.cross_entropy(logits, y)
loss.backward()
optimizer.step()
total_loss += loss.item()
n_batches += 1
if (epoch + 1) % 20 == 0:
print(f"Epoch {epoch+1}: Loss = {total_loss/n_batches:.4f}")
# 测试不确定性估计
model.eval()
# 生成测试数据
X_test = torch.tensor(
np.random.uniform(-2, 3, size=(200, 2)),
dtype=torch.float32
)
results = model.predict_with_uncertainty(X_test.to(device), n_samples=100)
print("\n不确定性统计:")
print(f" 平均置信度: {results['confidence'].mean():.4f}")
print(f" 平均预测熵: {results['pred_entropy'].mean():.4f}")
print(f" 最大认知不确定性: {results['epistemic_std'].max():.4f}")
return model, results
# 运行示例
if __name__ == "__main__":
model, results = train_bayesian_model()7.4 不确定性校准
class UncertaintyCalibrator:
"""
预测不确定性校准
确保预测置信度与实际准确率匹配
"""
def __init__(self, n_bins=10):
self.n_bins = n_bins
self.bin_boundaries = np.linspace(0, 1, n_bins + 1)
def calibrate(self, model, dataloader, n_samples=50):
"""
校准模型
计算 ECE (Expected Calibration Error)
"""
device = next(model.parameters()).device
model.eval()
all_confidences = []
all_accuracies = []
all_predictions = []
all_true = []
with torch.no_grad():
for x, y in dataloader:
x, y = x.to(device), y.to(device)
results = model.predict_with_uncertainty(x, n_samples=n_samples)
all_confidences.append(results['confidence'].cpu())
all_accuracies.append(
(results['pred_class'] == y).float().cpu()
)
all_predictions.append(results['pred_class'].cpu())
all_true.append(y.cpu())
confidences = torch.cat(all_confidences)
accuracies = torch.cat(all_accuracies)
# 计算 ECE
ece = self.expected_calibration_error(confidences, accuracies)
# 计算可靠性图
reliability_diagram = self.reliability_diagram(confidences, accuracies)
return {
'ece': ece,
'reliability_diagram': reliability_diagram,
'confidences': confidences,
'accuracies': accuracies
}
def expected_calibration_error(self, confidences, accuracies):
"""
计算 ECE
ECE = Σ_b |B_b| / n * |acc(B_b) - conf(B_b)|
"""
ece = 0.0
for i in range(self.n_bins):
bin_lower = self.bin_boundaries[i]
bin_upper = self.bin_boundaries[i + 1]
in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
if in_bin.sum() > 0:
bin_confidence = confidences[in_bin].mean()
bin_accuracy = accuracies[in_bin].mean()
ece += in_bin.float().mean() * abs(bin_accuracy - bin_confidence)
return ece.item()
def reliability_diagram(self, confidences, accuracies):
"""
计算可靠性图数据
"""
bin_accuracies = []
bin_confidences = []
bin_counts = []
for i in range(self.n_bins):
bin_lower = self.bin_boundaries[i]
bin_upper = self.bin_boundaries[i + 1]
in_bin = (confidences > bin_lower) & (confidences <= bin_upper)
if in_bin.sum() > 0:
bin_accuracies.append(accuracies[in_bin].mean().item())
bin_confidences.append(confidences[in_bin].mean().item())
bin_counts.append(in_bin.sum().item())
else:
bin_accuracies.append(0)
bin_confidences.append((bin_lower + bin_upper) / 2)
bin_counts.append(0)
return {
'accuracies': bin_accuracies,
'confidences': bin_confidences,
'counts': bin_counts
}8. 与深度学习最新进展的联系
8.1 深度集成与贝叶斯方法
深度集成(Deep Ensembles)可视为 BNN 的实用近似:
其中 是不同随机初始化下独立训练的模型。
class DeepEnsemble:
"""
深度集成
结合多个独立训练的模型
"""
def __init__(self, models):
self.models = models
def predict(self, x, return_all=False):
"""集成预测"""
predictions = []
for model in self.models:
model.eval()
with torch.no_grad():
pred = model(x)
predictions.append(pred)
predictions = torch.stack(predictions)
mean = predictions.mean(dim=0)
std = predictions.std(dim=0)
if return_all:
return mean, std, predictions
return mean, std8.2 Dropout 与注意力机制
Dropout 在 Transformer 中的应用:
class BayesianTransformerLayer(nn.Module):
"""
贝叶斯 Transformer 层
使用 MC Dropout 进行不确定性估计
"""
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = nn.MultiheadAttention(d_model, n_heads, dropout=dropout)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(d_ff, d_model)
)
def forward(self, x, src_mask=None, use_dropout=True):
# 自注意力
attn_out, _ = self.attention(x, x, x, attn_mask=src_mask)
if use_dropout or self.training:
attn_out = self.dropout1(attn_out)
x = self.norm1(x + attn_out)
# 前馈网络
ff_out = self.ff(x)
if use_dropout or self.training:
ff_out = self.dropout2(ff_out)
x = self.norm2(x + ff_out)
return x8.3 不确定性在 LLM 中的应用
大语言模型的不确定性:
- Token 级别的困惑度 → 词级别的不确定性
- 序列级别的熵 → 整体生成的不确定性
- Self-consistency 检验 → 认知不确定性的近似
class LLMPromptUncertainty:
"""
LLM 提示不确定性估计
通过多次采样估计不确定性
"""
def __init__(self, model, tokenizer, device='cuda'):
self.model = model
self.tokenizer = tokenizer
self.device = device
def estimate_uncertainty(self, prompt, n_samples=10, max_length=50):
"""
估计生成的不确定性
Returns:
mean_response: 平均响应
entropy: 响应熵
disagreement: 模型间分歧
"""
responses = []
log_probs_list = []
for _ in range(n_samples):
# 生成(使用采样)
inputs = self.tokenizer(prompt, return_tensors='pt').to(self.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_length=max_length,
do_sample=True,
temperature=0.8,
top_p=0.9
)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
responses.append(response)
# 计算困惑度
log_prob = self.compute_log_prob(inputs, outputs)
log_probs_list.append(log_prob)
# 计算统计量
responses_unique = list(set(responses))
disagreement = 1 - len(responses_unique) / n_samples
mean_log_prob = np.mean(log_probs_list)
std_log_prob = np.std(log_probs_list)
return {
'responses': responses,
'disagreement': disagreement,
'mean_log_prob': mean_log_prob,
'std_log_prob': std_log_prob,
'n_unique': len(responses_unique)
}
def compute_log_prob(self, inputs, outputs):
"""计算序列的对数概率"""
with torch.no_grad():
result = self.model(
inputs['input_ids'],
labels=outputs
)
return result.loss.item()9. 总结与关联
9.1 核心要点总结
| 主题 | 核心概念 |
|---|---|
| BNN 定义 | |
| Mean Field VI | |
| Bayes by Backprop | 重参数化 + ELBO 优化 |
| MC Dropout | Dropout 作为变分近似 |
| Epistemic 不确定性 | 模型不确定性,可减少 |
| Aleatoric 不确定性 | 数据不确定性,不可减少 |
9.2 方法比较速查
| 方法 | 计算成本 | 不确定性质量 | 实现难度 |
|---|---|---|---|
| Bayes by Backprop | 高 | 高 | 中 |
| MC Dropout | 中 | 中 | 低 |
| Deep Ensembles | 可调 | 中高 | 低 |
| SWAG | 中 | 中高 | 中 |
| HMC | 极高 | 最高 | 高 |
9.3 与相关文档的关联
| 相关主题 | 关联说明 |
|---|---|
| 贝叶斯网络 | 概率图模型基础 |
| 变分推断进阶 | VI 理论框架 |
| 神经变分推断 | NVI 方法 |
| Bayes by Backprop | 变分 BNN 训练 |
| MC Dropout | Dropout 不确定性 |
| 概率电路基础 | 可追踪推断 |
参考
Footnotes
-
Jospin, L. V., et al. (2020). “Hands-On Bayesian Neural Networks: A Tutorial for Deep Learning Users”. arXiv:2007.06823. ↩
-
Blundell, C., et al. (2015). “Weight Uncertainty in Neural Networks”. ICML 2015. ↩
-
Gal, Y., & Ghahramani, Z. (2016). “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning”. ICML 2016. ↩