概述

隐马尔可夫模型(Hidden Markov Model, HMM)是序列建模的经典概率图模型,在语音识别、自然语言处理等领域有广泛应用。近年来,深度学习与HMM的融合成为研究热点,既包括用神经网络参数化HMM组件,也包括探索Transformer与HMM的关系。12


HMM基础理论

模型定义

HMM由以下参数完全描述:

参数含义维度
初始状态概率
状态转移矩阵
发射概率矩阵

核心假设

  1. 齐次马尔可夫性
  2. 观测独立性

HMM的概率计算

给定模型 和观测序列 ,计算观测概率

前向算法

定义前向变量:

递归计算

初始条件

终止条件

import numpy as np
 
def forward_algorithm(observations, pi, A, B):
    """
    前向算法计算观测概率
    
    Args:
        observations: 观测序列 (T,)
        pi: 初始概率 (N,)
        A: 转移矩阵 (N, N)
        B: 发射矩阵 (N, M)
    Returns:
        log_prob: 对数概率
        alpha: 前向变量 (T, N)
    """
    T = len(observations)
    N = len(pi)
    
    alpha = np.zeros((T, N))
    log_prob = 0
    
    # 初始化
    alpha[0] = pi * B[:, observations[0]]
    
    # 递归
    for t in range(1, T):
        for j in range(N):
            alpha[t, j] = np.sum(alpha[t-1] * A[:, j]) * B[j, observations[t]]
    
    # 终止
    log_prob = np.log(np.sum(alpha[-1]) + 1e-300)
    
    return log_prob, alpha

后向算法

定义后向变量:

递归计算

def backward_algorithm(observations, pi, A, B):
    """
    后向算法
    """
    T = len(observations)
    N = len(pi)
    
    beta = np.zeros((T, N))
    
    # 初始化
    beta[-1] = 1
    
    # 递归(反向)
    for t in range(T - 2, -1, -1):
        for i in range(N):
            beta[t, i] = np.sum(
                A[i, :] * B[:, observations[t + 1]] * beta[t + 1]
            )
    
    # 使用前向和后向计算概率
    log_prob = np.log(np.sum(pi * B[:, observations[0]] * beta[0]) + 1e-300)
    
    return log_prob, beta

HMM参数学习

Baum-Welch算法(EM算法)

E步:计算在当前参数下,每个时刻处于各状态的概率

M步:更新参数

def baum_welch(observations, n_states, n_iter=100):
    """
    Baum-Welch算法训练HMM
    """
    T = len(observations)
    
    # 随机初始化
    pi = np.random.rand(n_states)
    pi = pi / pi.sum()
    
    A = np.random.rand(n_states, n_states)
    A = A / A.sum(axis=1, keepdims=True)
    
    # 发射概率(离散观测)
    n_obs = max(observations) + 1
    B = np.random.rand(n_states, n_obs)
    B = B / B.sum(axis=1, keepdims=True)
    
    for iteration in range(n_iter):
        # E步
        log_prob_fwd, alpha = forward_algorithm(observations, pi, A, B)
        log_prob_bwd, beta = backward_algorithm(observations, pi, A, B)
        
        # 计算gamma和xi
        gamma = alpha * beta / (np.sum(alpha * beta, axis=1, keepdims=True) + 1e-300)
        
        xi = np.zeros((T - 1, n_states, n_states))
        for t in range(T - 1):
            for i in range(n_states):
                for j in range(n_states):
                    xi[t, i, j] = alpha[t, i] * A[i, j] * B[j, observations[t + 1]] * beta[t + 1, j]
            xi[t] = xi[t] / (np.sum(xi[t]) + 1e-300)
        
        # M步
        pi = gamma[0]
        A = np.sum(xi, axis=0) / (np.sum(gamma[:-1], axis=0, keepdims=True).T + 1e-300)
        
        # 更新发射概率
        B_new = np.zeros_like(B)
        for j in range(n_states):
            for k in range(n_obs):
                mask = (np.array(observations) == k)
                B_new[j, k] = np.sum(gamma[mask, j])
        B = B_new / (np.sum(gamma, axis=0, keepdims=True).T + 1e-300)
    
    return pi, A, B

Viterbi算法:最优路径求解

目标:找到最可能产生观测序列的隐状态序列

def viterbi(observations, pi, A, B):
    """
    Viterbi算法:求解最优隐状态序列
    
    Returns:
        best_path: 最优路径
        best_prob: 最优概率
    """
    T = len(observations)
    N = len(pi)
    
    # 动态规划表
    delta = np.zeros((T, N))
    psi = np.zeros((T, N), dtype=int)
    
    # 初始化
    delta[0] = pi * B[:, observations[0]]
    
    # 递归
    for t in range(1, T):
        for j in range(N):
            trans_probs = delta[t - 1] * A[:, j]
            psi[t, j] = np.argmax(trans_probs)
            delta[t, j] = np.max(trans_probs) * B[j, observations[t]]
    
    # 回溯
    best_path = np.zeros(T, dtype=int)
    best_path[T - 1] = np.argmax(delta[T - 1])
    
    for t in range(T - 2, -1, -1):
        best_path[t] = psi[t + 1, best_path[t + 1]]
    
    best_prob = np.max(delta[T - 1])
    
    return best_path, best_prob

神经HMM:神经网络参数化

基本思想

用神经网络替代HMM的发射概率分布:

神经前向-后向算法

class NeuralHMM(nn.Module):
    """神经HMM:发射概率由神经网络参数化"""
    def __init__(self, n_states, obs_dim, hidden_dim=128):
        super().__init__()
        self.n_states = n_states
        
        # 转移矩阵(可学习)
        self.transition = nn.Parameter(torch.randn(n_states, n_states))
        self.transition_bias = nn.Parameter(torch.zeros(n_states))
        
        # 发射网络
        self.emission_net = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_states)
        )
        
        # 初始概率
        self.initial = nn.Parameter(torch.randn(n_states))
    
    def forward(self, observations):
        """
        计算对数似然
        observations: (T, obs_dim)
        """
        T = observations.shape[0]
        
        # 发射概率
        log_emission = F.log_softmax(self.emission_net(observations), dim=-1)
        
        # 转移概率(对数域)
        log_trans = F.log_softmax(self.transition, dim=-1)
        
        # 初始概率
        log_pi = F.log_softmax(self.initial, dim=-1)
        
        # 前向算法
        alpha = torch.zeros(T, self.n_states)
        alpha[0] = log_pi + log_emission[0]
        
        for t in range(1, T):
            # log-sum-exp技巧
            log_alpha_prev = alpha[t - 1].unsqueeze(1)  # (N, 1)
            log_transposed = log_trans.t()  # (N, N)
            sum_log = log_alpha_prev + log_transposed + log_emission[t]
            alpha[t] = torch.logsumexp(sum_log, dim=0)
        
        log_likelihood = torch.logsumexp(alpha[-1], dim=-1)
        return log_likelihood, alpha

连续发射分布

对于连续观测,使用混合高斯或神经网络解码:

class GaussianNeuralHMM(nn.Module):
    """高斯发射的神经HMM"""
    def __init__(self, n_states, obs_dim):
        super().__init__()
        self.n_states = n_states
        
        # 每个状态的高斯参数
        self.means = nn.Parameter(torch.randn(n_states, obs_dim))
        self.log_vars = nn.Parameter(torch.zeros(n_states, obs_dim))
        
        # 转移矩阵
        self.transition = nn.Parameter(torch.randn(n_states, n_states))
    
    def emission_log_prob(self, obs, state):
        """计算单个状态下的发射对数概率"""
        mean = self.means[state]
        log_var = self.log_vars[state]
        var = torch.exp(log_var)
        
        log_prob = -0.5 * (
            torch.log(2 * np.pi * var) + 
            (obs - mean) ** 2 / var
        ).sum(dim=-1)
        return log_prob
    
    def forward(self, observations):
        """使用动态规划计算对数似然"""
        T = observations.shape[0]
        
        # 发射概率矩阵
        log_B = torch.zeros(T, self.n_states)
        for t in range(T):
            for s in range(self.n_states):
                log_B[t, s] = self.emission_log_prob(observations[t], s)
        
        # 前向传递
        log_trans = F.log_softmax(self.transition, dim=-1)
        log_pi = F.log_softmax(self.transition[:, 0], dim=0)  # 简化的初始概率
        
        alpha = torch.zeros(T, self.n_states)
        alpha[0] = log_pi + log_B[0]
        
        for t in range(1, T):
            sum_log = alpha[t - 1].unsqueeze(1) + log_trans.t()
            alpha[t] = torch.logsumexp(sum_log, dim=0) + log_B[t]
        
        return torch.logsumexp(alpha[-1], dim=-1)

Transformer vs HMM:表达能力对比

关键发现3

研究问题:Transformer能否学习基本的序列模型(如HMM)?

实验设计:在各类HMM数据和变体上测试Transformer与RNN的表现。

主要结论

发现描述
深度需求Transformer需要至少对数深度的层数才能有效学习HMM
训练速度RNN在所有测试的HMM上始终优于Transformer
长记忆Transformer在长混合时间和无法访问中间隐状态时表现下降
困难案例存在Transformer难以学习但RNN能成功学习的HMM

理论分析

表达能力定理

给定序列长度 ,深度为 的Transformer可以以 的深度逼近任意HMM的转移概率。

Transformer的归纳偏置

  1. 全局注意力:适合建模长程依赖
  2. 并行计算:训练高效但需要更多深度
  3. 位置编码:需要显式注入位置信息

RNN/HMM的归纳偏置

  1. 马尔可夫假设:天然的局部依赖建模
  2. 递归结构:参数共享,适合长序列
  3. 隐状态:压缩的历史表示

Block CoT改进

为改进Transformer对HMM的学习,研究者提出Block Chain-of-Thought:

class BlockCoTTransformer(nn.Module):
    """Block CoT增强的Transformer"""
    def __init__(self, d_model, n_heads, n_layers):
        super().__init__()
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, n_heads),
            num_layers=n_layers
        )
    
    def forward_with_block_cot(self, x, block_size=8):
        """
        Block CoT: 将序列分成块,在训练时提供块级别的中间状态
        """
        T = x.shape[0]
        outputs = []
        
        for start in range(0, T, block_size):
            end = min(start + block_size, T)
            block = x[start:end]
            
            # 处理当前块
            out = self.transformer(block)
            outputs.append(out)
            
            # 为下一个块提供上下文
            if end < T:
                x[end:end+block_size] = x[end:end+block_size] + out[-1].unsqueeze(0)
        
        return torch.cat(outputs, dim=0)

Belief Net:滤波即神经网络

核心思想4

Belief Net提出将HMM的滤波算法框架化为神经网络:

模型参数化

  • 初始分布的对数
  • 转移矩阵的对数
  • 发射矩阵的对数

神经网络滤波

class BeliefNet(nn.Module):
    """
    Belief Net: 将HMM滤波转化为神经网络
    权重直接对应HMM参数
    """
    def __init__(self, n_states, obs_dim):
        super().__init__()
        self.n_states = n_states
        
        # HMM参数作为可学习权重
        self.log_pi = nn.Parameter(torch.zeros(n_states))
        self.log_A = nn.Parameter(torch.zeros(n_states, n_states))
        self.log_B = nn.Parameter(torch.randn(n_states, obs_dim))
    
    def filter(self, observations):
        """
        神经网络形式的滤波
        observations: (T, obs_dim)
        """
        T = observations.shape[0]
        beliefs = []
        
        # 初始化
        log_alpha = self.log_pi + self._emission_log_prob(observations[0])
        beliefs.append(F.softmax(log_alpha, dim=-1))
        
        # 递归
        for t in range(1, T):
            # 预测步
            log_pred = torch.logsumexp(
                log_alpha.unsqueeze(1) + self.log_A.t(), dim=0
            )
            
            # 更新步
            log_unnorm = log_pred + self._emission_log_prob(observations[t])
            log_alpha = log_unnorm - torch.logsumexp(log_unnorm, dim=-1)
            
            beliefs.append(F.softmax(log_alpha, dim=-1))
        
        return torch.stack(beliefs)
    
    def _emission_log_prob(self, obs):
        """发射概率的对数"""
        return F.log_softmax(self.log_B, dim=-1) @ obs
    
    def decode(self, observations):
        """Viterbi解码"""
        T = observations.shape[0]
        delta = torch.zeros(T, self.n_states)
        psi = torch.zeros(T, self.n_states, dtype=torch.long)
        
        # 初始化
        delta[0] = self.log_pi + self._emission_log_prob(observations[0])
        
        # 递归
        for t in range(1, T):
            trans = delta[t-1].unsqueeze(1) + self.log_A.t()
            psi[t] = torch.argmax(trans, dim=1)
            delta[t] = torch.max(trans, dim=1)[0] + self._emission_log_prob(observations[t])
        
        # 回溯
        path = torch.zeros(T, dtype=torch.long)
        path[-1] = torch.argmax(delta[-1])
        for t in range(T-2, -1, -1):
            path[t] = psi[t+1, path[t+1]]
        
        return path

优势

  1. 参数可解释:权重直接对应HMM参数
  2. 端到端训练:可结合下游任务
  3. 梯度优化:避免EM算法的局部最优

隐马尔可夫语言模型

扩展到语言建模

将HMM扩展到词级别序列:

参数

  • 词嵌入层替代发射矩阵
  • 神经网络预测发射概率
class HMM LanguageModel(nn.Module):
    """HMM语言模型"""
    def __init__(self, vocab_size, n_states, embed_dim=256):
        super().__init__()
        self.n_states = n_states
        self.embed = nn.Embedding(vocab_size, embed_dim)
        
        # 状态转移(可学习)
        self.transition = nn.Parameter(torch.randn(n_states, n_states))
        
        # 发射网络
        self.emission = nn.Sequential(
            nn.Linear(embed_dim, embed_dim),
            nn.ReLU(),
            nn.Linear(embed_dim, vocab_size)
        )
        
        # 初始状态嵌入
        self.initial_proj = nn.Linear(embed_dim, n_states)
    
    def forward(self, input_ids, target_ids):
        """
        计算语言模型损失
        input_ids: (B, T)
        target_ids: (B, T)
        """
        B, T = input_ids.shape
        
        # 嵌入
        embeddings = self.embed(input_ids)  # (B, T, D)
        
        # 简化的前向计算
        log_probs = []
        for t in range(T - 1):
            # 发射概率
            emit = F.log_softmax(self.emission(embeddings[:, t]), dim=-1)
            log_probs.append(emit)
        
        return -torch.stack(log_probs, dim=1)

HMM在深度学习时代的应用

1. 序列标注

class HMMForSequenceLabeling(nn.Module):
    """HMM用于序列标注(POS tagging, NER等)"""
    def __init__(self, n_tags, embed_dim=256):
        super().__init__()
        self.n_tags = n_tags
        self.embed = nn.Embedding(vocab_size, embed_dim)
        
        # HMM参数
        self.transition = nn.Parameter(torch.randn(n_tags, n_tags))
        self.emission = nn.Parameter(torch.randn(n_tags, embed_dim))
    
    def viterbi_decode(self, emissions):
        """Viterbi解码"""
        B, T, N = emissions.shape
        
        # 动态规划
        best_paths = []
        for b in range(B):
            viterbi = torch.zeros(T, N)
            backpointers = torch.zeros(T, N, dtype=torch.long)
            
            # 初始化
            viterbi[0] = emissions[b, 0]
            
            # 递归
            for t in range(1, T):
                trans = viterbi[t-1].unsqueeze(1) + self.transition
                backpointers[t] = torch.argmax(trans, dim=1)
                viterbi[t] = torch.max(trans, dim=1)[0] + emissions[b, t]
            
            # 回溯
            best_path = [torch.argmax(viterbi[-1])]
            for t in range(T-1, 0, -1):
                best_path.insert(0, backpointers[t, best_path[0]])
            
            best_paths.append(torch.stack(best_path))
        
        return torch.stack(best_paths)

2. 语音识别

HMM在传统语音识别中占主导地位,现代系统将HMM与深度学习结合:

class DeepSpeechHybrid(nn.Module):
    """深度学习混合语音识别"""
    def __init__(self, n_phones):
        super().__init__()
        # 声学模型(深度学习)
        self.acoustic = nn.Sequential(
            nn.Conv1d(40, 512, 5, padding=2),
            nn.ReLU(),
            nn.BatchNorm1d(512),
            nn.GRU(512, 512, num_layers=3, batch_first=True),
            nn.Linear(512, n_phones * 3)  # 三音素状态
        )
    
    def forward(self, mel_features):
        """提取发射概率"""
        # CNN + RNN处理声学特征
        x = self.acoustic(mel_features)
        # 输出HMM状态级别的发射概率
        return F.log_softmax(x, dim=-1)

总结

方法优势劣势
经典HMM可解释、参数少表达能力有限
神经HMM灵活、端到端失去可解释性
Transformer全局建模、高并行需要更多深度
RNN/LSTM序列压缩、局部建模训练困难

发展趋势

  1. 混合架构:HMM结构 + 神经网络参数化
  2. 层次化HMM:多尺度序列建模
  3. 神经-符号融合:HMM的离散结构与神经网络结合

参考资料


相关阅读

Footnotes

  1. Tran, K., et al. (2016). “Unsupervised Neural Hidden Markov Models”. Workshop on Structured Prediction for NLP.

  2. Deng, R., et al. (2024). “On Limitation of Transformer for Learning HMMs”. ICLR 2025.

  3. OpenReview. “On Limitation of Transformer for Learning HMMs”. https://openreview.net/forum?id=b5lXUwZiD3

  4. arXiv:2511.10571. “Belief Net: A Filter-Based Framework for Learning Hidden Markov Models from Observations”.