概述
隐马尔可夫模型(Hidden Markov Model, HMM)是序列建模的经典概率图模型,在语音识别、自然语言处理等领域有广泛应用。近年来,深度学习与HMM的融合成为研究热点,既包括用神经网络参数化HMM组件,也包括探索Transformer与HMM的关系。12
HMM基础理论
模型定义
HMM由以下参数完全描述:
| 参数 | 含义 | 维度 |
|---|---|---|
| 初始状态概率 | ||
| 状态转移矩阵 | ||
| 发射概率矩阵 |
核心假设:
- 齐次马尔可夫性:
- 观测独立性:
HMM的概率计算
给定模型 和观测序列 ,计算观测概率 。
前向算法:
定义前向变量:
递归计算:
初始条件:
终止条件:
import numpy as np
def forward_algorithm(observations, pi, A, B):
"""
前向算法计算观测概率
Args:
observations: 观测序列 (T,)
pi: 初始概率 (N,)
A: 转移矩阵 (N, N)
B: 发射矩阵 (N, M)
Returns:
log_prob: 对数概率
alpha: 前向变量 (T, N)
"""
T = len(observations)
N = len(pi)
alpha = np.zeros((T, N))
log_prob = 0
# 初始化
alpha[0] = pi * B[:, observations[0]]
# 递归
for t in range(1, T):
for j in range(N):
alpha[t, j] = np.sum(alpha[t-1] * A[:, j]) * B[j, observations[t]]
# 终止
log_prob = np.log(np.sum(alpha[-1]) + 1e-300)
return log_prob, alpha后向算法:
定义后向变量:
递归计算:
def backward_algorithm(observations, pi, A, B):
"""
后向算法
"""
T = len(observations)
N = len(pi)
beta = np.zeros((T, N))
# 初始化
beta[-1] = 1
# 递归(反向)
for t in range(T - 2, -1, -1):
for i in range(N):
beta[t, i] = np.sum(
A[i, :] * B[:, observations[t + 1]] * beta[t + 1]
)
# 使用前向和后向计算概率
log_prob = np.log(np.sum(pi * B[:, observations[0]] * beta[0]) + 1e-300)
return log_prob, betaHMM参数学习
Baum-Welch算法(EM算法)
E步:计算在当前参数下,每个时刻处于各状态的概率
M步:更新参数
def baum_welch(observations, n_states, n_iter=100):
"""
Baum-Welch算法训练HMM
"""
T = len(observations)
# 随机初始化
pi = np.random.rand(n_states)
pi = pi / pi.sum()
A = np.random.rand(n_states, n_states)
A = A / A.sum(axis=1, keepdims=True)
# 发射概率(离散观测)
n_obs = max(observations) + 1
B = np.random.rand(n_states, n_obs)
B = B / B.sum(axis=1, keepdims=True)
for iteration in range(n_iter):
# E步
log_prob_fwd, alpha = forward_algorithm(observations, pi, A, B)
log_prob_bwd, beta = backward_algorithm(observations, pi, A, B)
# 计算gamma和xi
gamma = alpha * beta / (np.sum(alpha * beta, axis=1, keepdims=True) + 1e-300)
xi = np.zeros((T - 1, n_states, n_states))
for t in range(T - 1):
for i in range(n_states):
for j in range(n_states):
xi[t, i, j] = alpha[t, i] * A[i, j] * B[j, observations[t + 1]] * beta[t + 1, j]
xi[t] = xi[t] / (np.sum(xi[t]) + 1e-300)
# M步
pi = gamma[0]
A = np.sum(xi, axis=0) / (np.sum(gamma[:-1], axis=0, keepdims=True).T + 1e-300)
# 更新发射概率
B_new = np.zeros_like(B)
for j in range(n_states):
for k in range(n_obs):
mask = (np.array(observations) == k)
B_new[j, k] = np.sum(gamma[mask, j])
B = B_new / (np.sum(gamma, axis=0, keepdims=True).T + 1e-300)
return pi, A, BViterbi算法:最优路径求解
目标:找到最可能产生观测序列的隐状态序列
def viterbi(observations, pi, A, B):
"""
Viterbi算法:求解最优隐状态序列
Returns:
best_path: 最优路径
best_prob: 最优概率
"""
T = len(observations)
N = len(pi)
# 动态规划表
delta = np.zeros((T, N))
psi = np.zeros((T, N), dtype=int)
# 初始化
delta[0] = pi * B[:, observations[0]]
# 递归
for t in range(1, T):
for j in range(N):
trans_probs = delta[t - 1] * A[:, j]
psi[t, j] = np.argmax(trans_probs)
delta[t, j] = np.max(trans_probs) * B[j, observations[t]]
# 回溯
best_path = np.zeros(T, dtype=int)
best_path[T - 1] = np.argmax(delta[T - 1])
for t in range(T - 2, -1, -1):
best_path[t] = psi[t + 1, best_path[t + 1]]
best_prob = np.max(delta[T - 1])
return best_path, best_prob神经HMM:神经网络参数化
基本思想
用神经网络替代HMM的发射概率分布:
神经前向-后向算法
class NeuralHMM(nn.Module):
"""神经HMM:发射概率由神经网络参数化"""
def __init__(self, n_states, obs_dim, hidden_dim=128):
super().__init__()
self.n_states = n_states
# 转移矩阵(可学习)
self.transition = nn.Parameter(torch.randn(n_states, n_states))
self.transition_bias = nn.Parameter(torch.zeros(n_states))
# 发射网络
self.emission_net = nn.Sequential(
nn.Linear(obs_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_states)
)
# 初始概率
self.initial = nn.Parameter(torch.randn(n_states))
def forward(self, observations):
"""
计算对数似然
observations: (T, obs_dim)
"""
T = observations.shape[0]
# 发射概率
log_emission = F.log_softmax(self.emission_net(observations), dim=-1)
# 转移概率(对数域)
log_trans = F.log_softmax(self.transition, dim=-1)
# 初始概率
log_pi = F.log_softmax(self.initial, dim=-1)
# 前向算法
alpha = torch.zeros(T, self.n_states)
alpha[0] = log_pi + log_emission[0]
for t in range(1, T):
# log-sum-exp技巧
log_alpha_prev = alpha[t - 1].unsqueeze(1) # (N, 1)
log_transposed = log_trans.t() # (N, N)
sum_log = log_alpha_prev + log_transposed + log_emission[t]
alpha[t] = torch.logsumexp(sum_log, dim=0)
log_likelihood = torch.logsumexp(alpha[-1], dim=-1)
return log_likelihood, alpha连续发射分布
对于连续观测,使用混合高斯或神经网络解码:
class GaussianNeuralHMM(nn.Module):
"""高斯发射的神经HMM"""
def __init__(self, n_states, obs_dim):
super().__init__()
self.n_states = n_states
# 每个状态的高斯参数
self.means = nn.Parameter(torch.randn(n_states, obs_dim))
self.log_vars = nn.Parameter(torch.zeros(n_states, obs_dim))
# 转移矩阵
self.transition = nn.Parameter(torch.randn(n_states, n_states))
def emission_log_prob(self, obs, state):
"""计算单个状态下的发射对数概率"""
mean = self.means[state]
log_var = self.log_vars[state]
var = torch.exp(log_var)
log_prob = -0.5 * (
torch.log(2 * np.pi * var) +
(obs - mean) ** 2 / var
).sum(dim=-1)
return log_prob
def forward(self, observations):
"""使用动态规划计算对数似然"""
T = observations.shape[0]
# 发射概率矩阵
log_B = torch.zeros(T, self.n_states)
for t in range(T):
for s in range(self.n_states):
log_B[t, s] = self.emission_log_prob(observations[t], s)
# 前向传递
log_trans = F.log_softmax(self.transition, dim=-1)
log_pi = F.log_softmax(self.transition[:, 0], dim=0) # 简化的初始概率
alpha = torch.zeros(T, self.n_states)
alpha[0] = log_pi + log_B[0]
for t in range(1, T):
sum_log = alpha[t - 1].unsqueeze(1) + log_trans.t()
alpha[t] = torch.logsumexp(sum_log, dim=0) + log_B[t]
return torch.logsumexp(alpha[-1], dim=-1)Transformer vs HMM:表达能力对比
关键发现3
研究问题:Transformer能否学习基本的序列模型(如HMM)?
实验设计:在各类HMM数据和变体上测试Transformer与RNN的表现。
主要结论
| 发现 | 描述 |
|---|---|
| 深度需求 | Transformer需要至少对数深度的层数才能有效学习HMM |
| 训练速度 | RNN在所有测试的HMM上始终优于Transformer |
| 长记忆 | Transformer在长混合时间和无法访问中间隐状态时表现下降 |
| 困难案例 | 存在Transformer难以学习但RNN能成功学习的HMM |
理论分析
表达能力定理:
给定序列长度 ,深度为 的Transformer可以以 的深度逼近任意HMM的转移概率。
Transformer的归纳偏置:
- 全局注意力:适合建模长程依赖
- 并行计算:训练高效但需要更多深度
- 位置编码:需要显式注入位置信息
RNN/HMM的归纳偏置:
- 马尔可夫假设:天然的局部依赖建模
- 递归结构:参数共享,适合长序列
- 隐状态:压缩的历史表示
Block CoT改进
为改进Transformer对HMM的学习,研究者提出Block Chain-of-Thought:
class BlockCoTTransformer(nn.Module):
"""Block CoT增强的Transformer"""
def __init__(self, d_model, n_heads, n_layers):
super().__init__()
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model, n_heads),
num_layers=n_layers
)
def forward_with_block_cot(self, x, block_size=8):
"""
Block CoT: 将序列分成块,在训练时提供块级别的中间状态
"""
T = x.shape[0]
outputs = []
for start in range(0, T, block_size):
end = min(start + block_size, T)
block = x[start:end]
# 处理当前块
out = self.transformer(block)
outputs.append(out)
# 为下一个块提供上下文
if end < T:
x[end:end+block_size] = x[end:end+block_size] + out[-1].unsqueeze(0)
return torch.cat(outputs, dim=0)Belief Net:滤波即神经网络
核心思想4
Belief Net提出将HMM的滤波算法框架化为神经网络:
模型参数化:
- 初始分布的对数
- 转移矩阵的对数
- 发射矩阵的对数
神经网络滤波:
class BeliefNet(nn.Module):
"""
Belief Net: 将HMM滤波转化为神经网络
权重直接对应HMM参数
"""
def __init__(self, n_states, obs_dim):
super().__init__()
self.n_states = n_states
# HMM参数作为可学习权重
self.log_pi = nn.Parameter(torch.zeros(n_states))
self.log_A = nn.Parameter(torch.zeros(n_states, n_states))
self.log_B = nn.Parameter(torch.randn(n_states, obs_dim))
def filter(self, observations):
"""
神经网络形式的滤波
observations: (T, obs_dim)
"""
T = observations.shape[0]
beliefs = []
# 初始化
log_alpha = self.log_pi + self._emission_log_prob(observations[0])
beliefs.append(F.softmax(log_alpha, dim=-1))
# 递归
for t in range(1, T):
# 预测步
log_pred = torch.logsumexp(
log_alpha.unsqueeze(1) + self.log_A.t(), dim=0
)
# 更新步
log_unnorm = log_pred + self._emission_log_prob(observations[t])
log_alpha = log_unnorm - torch.logsumexp(log_unnorm, dim=-1)
beliefs.append(F.softmax(log_alpha, dim=-1))
return torch.stack(beliefs)
def _emission_log_prob(self, obs):
"""发射概率的对数"""
return F.log_softmax(self.log_B, dim=-1) @ obs
def decode(self, observations):
"""Viterbi解码"""
T = observations.shape[0]
delta = torch.zeros(T, self.n_states)
psi = torch.zeros(T, self.n_states, dtype=torch.long)
# 初始化
delta[0] = self.log_pi + self._emission_log_prob(observations[0])
# 递归
for t in range(1, T):
trans = delta[t-1].unsqueeze(1) + self.log_A.t()
psi[t] = torch.argmax(trans, dim=1)
delta[t] = torch.max(trans, dim=1)[0] + self._emission_log_prob(observations[t])
# 回溯
path = torch.zeros(T, dtype=torch.long)
path[-1] = torch.argmax(delta[-1])
for t in range(T-2, -1, -1):
path[t] = psi[t+1, path[t+1]]
return path优势:
- 参数可解释:权重直接对应HMM参数
- 端到端训练:可结合下游任务
- 梯度优化:避免EM算法的局部最优
隐马尔可夫语言模型
扩展到语言建模
将HMM扩展到词级别序列:
参数:
- 词嵌入层替代发射矩阵
- 神经网络预测发射概率
class HMM LanguageModel(nn.Module):
"""HMM语言模型"""
def __init__(self, vocab_size, n_states, embed_dim=256):
super().__init__()
self.n_states = n_states
self.embed = nn.Embedding(vocab_size, embed_dim)
# 状态转移(可学习)
self.transition = nn.Parameter(torch.randn(n_states, n_states))
# 发射网络
self.emission = nn.Sequential(
nn.Linear(embed_dim, embed_dim),
nn.ReLU(),
nn.Linear(embed_dim, vocab_size)
)
# 初始状态嵌入
self.initial_proj = nn.Linear(embed_dim, n_states)
def forward(self, input_ids, target_ids):
"""
计算语言模型损失
input_ids: (B, T)
target_ids: (B, T)
"""
B, T = input_ids.shape
# 嵌入
embeddings = self.embed(input_ids) # (B, T, D)
# 简化的前向计算
log_probs = []
for t in range(T - 1):
# 发射概率
emit = F.log_softmax(self.emission(embeddings[:, t]), dim=-1)
log_probs.append(emit)
return -torch.stack(log_probs, dim=1)HMM在深度学习时代的应用
1. 序列标注
class HMMForSequenceLabeling(nn.Module):
"""HMM用于序列标注(POS tagging, NER等)"""
def __init__(self, n_tags, embed_dim=256):
super().__init__()
self.n_tags = n_tags
self.embed = nn.Embedding(vocab_size, embed_dim)
# HMM参数
self.transition = nn.Parameter(torch.randn(n_tags, n_tags))
self.emission = nn.Parameter(torch.randn(n_tags, embed_dim))
def viterbi_decode(self, emissions):
"""Viterbi解码"""
B, T, N = emissions.shape
# 动态规划
best_paths = []
for b in range(B):
viterbi = torch.zeros(T, N)
backpointers = torch.zeros(T, N, dtype=torch.long)
# 初始化
viterbi[0] = emissions[b, 0]
# 递归
for t in range(1, T):
trans = viterbi[t-1].unsqueeze(1) + self.transition
backpointers[t] = torch.argmax(trans, dim=1)
viterbi[t] = torch.max(trans, dim=1)[0] + emissions[b, t]
# 回溯
best_path = [torch.argmax(viterbi[-1])]
for t in range(T-1, 0, -1):
best_path.insert(0, backpointers[t, best_path[0]])
best_paths.append(torch.stack(best_path))
return torch.stack(best_paths)2. 语音识别
HMM在传统语音识别中占主导地位,现代系统将HMM与深度学习结合:
class DeepSpeechHybrid(nn.Module):
"""深度学习混合语音识别"""
def __init__(self, n_phones):
super().__init__()
# 声学模型(深度学习)
self.acoustic = nn.Sequential(
nn.Conv1d(40, 512, 5, padding=2),
nn.ReLU(),
nn.BatchNorm1d(512),
nn.GRU(512, 512, num_layers=3, batch_first=True),
nn.Linear(512, n_phones * 3) # 三音素状态
)
def forward(self, mel_features):
"""提取发射概率"""
# CNN + RNN处理声学特征
x = self.acoustic(mel_features)
# 输出HMM状态级别的发射概率
return F.log_softmax(x, dim=-1)总结
| 方法 | 优势 | 劣势 |
|---|---|---|
| 经典HMM | 可解释、参数少 | 表达能力有限 |
| 神经HMM | 灵活、端到端 | 失去可解释性 |
| Transformer | 全局建模、高并行 | 需要更多深度 |
| RNN/LSTM | 序列压缩、局部建模 | 训练困难 |
发展趋势:
- 混合架构:HMM结构 + 神经网络参数化
- 层次化HMM:多尺度序列建模
- 神经-符号融合:HMM的离散结构与神经网络结合
参考资料
相关阅读
Footnotes
-
Tran, K., et al. (2016). “Unsupervised Neural Hidden Markov Models”. Workshop on Structured Prediction for NLP. ↩
-
Deng, R., et al. (2024). “On Limitation of Transformer for Learning HMMs”. ICLR 2025. ↩
-
OpenReview. “On Limitation of Transformer for Learning HMMs”. https://openreview.net/forum?id=b5lXUwZiD3 ↩
-
arXiv:2511.10571. “Belief Net: A Filter-Based Framework for Learning Hidden Markov Models from Observations”. ↩