Vanilla RNN深度理论与梯度分析

概述

循环神经网络（RNN）是处理序列数据的经典架构。然而，其训练中长期存在的梯度消失和爆炸问题困扰学界三十余年。本文档系统整理：

Elman RNN的严格数学形式
BPTT的完整推导
梯度消失/爆炸的形式化理论：Jacobian谱分析
2024-2025新结果：
- Zucchet & Orvieto: 梯度消失爆炸不是故事结局
- Livi: Gated RNN的可学习窗口理论
- Cayci & Eryilmaz: 梯度下降收敛性非渐近分析
- 局部表示对齐RNN
吸引子视角（Ribeiro et al.）
实践建议

理解RNN的梯度行为是理解LSTM、GRU和现代SSM（如Mamba）的基础。¹

一、Elman RNN的数学形式

1.1 基础方程

对于输入序列 $x_{1}, x_{2}, \dots, x_{T}$ ，Vanilla RNN更新规则：

h_{t} = σ_{h} (W_{hh} h_{t - 1} + W_{x h} x_{t} + b_{h})

y_{t} = σ_{y} (W_{h y} h_{t} + b_{y})

其中：

$h_{t} \in R^{n}$ ：第 $t$ 步的隐藏状态
$x_{t} \in R^{d}$ ：第 $t$ 步的输入
$y_{t} \in R^{k}$ ：第 $t$ 步的输出
$W_{hh} \in R^{n \times n}$ 、 $W_{x h} \in R^{n \times d}$ 、 $W_{h y} \in R^{k \times n}$ ：权重矩阵
$σ_{h}$ 、 $σ_{y}$ ：激活函数（通常 $σ_{h} = tanh$ ， $σ_{y} = softmax$ ）

1.2 简化形式

令 $u_{t} = W_{x h} x_{t} + b_{h}$ ，则：

h_{t} = σ_{h} (W_{hh} h_{t - 1} + u_{t})

展开（Unrolling）：

h_{1} h_{2} = σ_{h} (W_{hh} h_{0} + u_{1}) = σ_{h} (W_{hh} σ_{h} (W_{hh} h_{0} + u_{1}) + u_{2}) = σ_{h} (W_{hh}^{2} h_{0} + W_{hh} u_{1} + u_{2})

更一般地：

h_{t} = σ_{h} (W_{hh}^{t} h_{0} + i = 1 \sum t W_{hh}^{t - i} u_{i})

（忽略激活函数的情况下）

1.3 状态空间模型视角

Vanilla RNN本质上是一个非线性状态空间模型：

h_{t + 1} = f (h_{t}, x_{t}), y_{t} = g (h_{t})

其中 $f$ 是非线性函数。

二、反向传播通过时间（BPTT）

2.1 损失函数

对于序列到序列任务，总损失是各时间步损失之和：

L = t = 1 \sum T L_{t} (y_{t}, \hat{y}_{t})

2.2 梯度推导

关键问题： $\frac{\partial L}{\partial W _{hh}}$ 涉及所有时间步的依赖。

链式法则：

\frac{\partial L}{\partial W _{hh}} = t = 1 \sum T k = 1 \sum t \frac{\partial L _{t}}{\partial h _{t}} \frac{\partial h _{t}}{\partial h _{k}} \frac{\partial ^{+} h _{k}}{\partial W _{hh}}

其中 $\frac{\partial ^{+} h _{k}}{\partial W _{hh}}$ 是直接偏导（不通过后续时间步）。

2.3 时间步之间的梯度

\frac{\partial h _{t}}{\partial h _{k}} = i = k + 1 \prod t \frac{\partial h _{i}}{\partial h _{i - 1}} = i = k + 1 \prod t W_{hh}^{T} diag (σ_{h}^{'} (z_{i}))

其中 $z_{i} = W_{hh} h_{i - 1} + W_{x h} x_{i} + b_{h}$ 。

简化（设 $σ_{h}^{'}$ 有界， $∣ σ_{h}^{'} ∣ \leq γ$ ）：

\frac{\partial h _{t}}{\partial h _{k}} \leq (γ ∥ W_{hh} ∥)^{t - k}

这就是梯度消失/爆炸的来源！

2.4 形式化定理

定理（梯度消失与爆炸）：

设 $ρ_{t} = \prod_{i = 1}^{t} W_{hh}^{T} diag (σ_{h}^{'} (z_{i}))$ 。当 $t \to \infty$ 时：

∥ ρ_{t} ∥ \to 0 if γ ∥ W_{hh} ∥ < 1 (梯度消失)

∥ ρ_{t} ∥ \to \infty if γ ∥ W_{hh} ∥ > 1 (梯度爆炸)

只有当 $γ ∥ W_{hh} ∥ \approx 1$ 时梯度稳定传播。

2.5 BPTT实现

import torch
import torch.nn as nn
 
 
class VanillaRNN(nn.Module):
    """Vanilla RNN（Elman）"""
    
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.hidden_dim = hidden_dim
        
        # 权重
        self.W_xh = nn.Linear(input_dim, hidden_dim, bias=False)
        self.W_hh = nn.Linear(hidden_dim, hidden_dim)
        self.W_hy = nn.Linear(hidden_dim, output_dim)
        
        # 初始化
        self._init_weights()
    
    def _init_weights(self):
        """Xavier初始化"""
        for m in [self.W_xh, self.W_hh, self.W_hy]:
            nn.init.xavier_normal_(m.weight)
            if m.bias is not None:
                nn.init.zeros_(m.bias)
    
    def forward(self, x, h0=None):
        """
        x: (batch, seq_len, input_dim)
        返回: outputs (batch, seq_len, output_dim), h_n (batch, hidden_dim)
        """
        batch_size, seq_len, _ = x.shape
        if h0 is None:
            h0 = torch.zeros(batch_size, self.hidden_dim, device=x.device)
        
        h = h0
        outputs = []
        hs = []  # 保存所有隐藏状态用于BPTT
        
        for t in range(seq_len):
            h = torch.tanh(self.W_hh(h) + self.W_xh(x[:, t, :]))
            y = self.W_hy(h)
            outputs.append(y)
            hs.append(h)
        
        outputs = torch.stack(outputs, dim=1)  # (batch, seq_len, output_dim)
        return outputs, h, hs
 
 
def bptt_manual(model, x, y_true, loss_fn, seq_len):
    """手动BPTT实现（理解用）"""
    # 前向
    outputs, h_n, hs = model(x)
    
    # 计算损失
    loss = loss_fn(outputs.view(-1, outputs.size(-1)), y_true.view(-1))
    
    # 反向：BPTT
    # dL/dW_hy: 简单
    # dL/dW_hh: 跨时间步累加
    
    dW_hh = torch.zeros_like(model.W_hh.weight)
    dW_xh = torch.zeros_like(model.W_xh.weight)
    db_hh = torch.zeros_like(model.W_hh.bias)
    
    # 计算每个时间步的梯度
    batch_size = x.size(0)
    dh_next = torch.zeros(batch_size, model.hidden_dim, device=x.device)
    
    for t in reversed(range(seq_len)):
        # 输出梯度
        dy = loss_fn(outputs[:, t], y_true[:, t], reduction='none').sum(dim=-1, keepdim=True)
        dy = dy * (outputs[:, t] - y_true[:, t])
        
        # h_t的梯度（来自输出 + 未来时间步）
        dhy = model.W_hy.weight.t() @ dy.unsqueeze(-1) if dy.dim() < 3 else model.W_hy.weight.t() @ dy
        dhy = dhy.squeeze(-1) if dhy.dim() > 2 else dhy
        dh = dhy + dh_next
        
        # tanh梯度
        dh_pre = dh * (1 - hs[t] ** 2)
        
        # 权重梯度
        if t > 0:
            dW_hh += (dh_pre.t() @ hs[t-1]) / batch_size
        dW_xh += (dh_pre.t() @ x[:, t, :]) / batch_size
        db_hh += dh_pre.mean(dim=0)
        
        # 传递到前一个时间步
        dh_next = model.W_hh.weight.t() @ dh_pre.t()
        dh_next = dh_next.t()
    
    return loss, dW_hh, dW_xh, db_hh

三、梯度消失爆炸的现代理论（2024-2025）

3.1 经典观点与挑战

Pascanu et al. (2013) 经典理论：

梯度消失/爆炸源于Jacobian的谱半径：

ρ (W_{hh}^{T} diag (σ_{h}^{'})) \neq = 1

30年的共识：必须通过门控（LSTM/GRU）或精心初始化解决。

3.2 Zucchet & Orvieto (2024) 的颠覆

Zucchet & Orvieto (arXiv 2405.21064, 2024) 证明：梯度消失/爆炸不是故事的全部。²

核心发现：

即使Jacobian谱半径 < 1（理论上梯度消失），RNN仍可学习长时依赖。原因：

隐藏状态演化： $h_{t}$ 在某些子空间上是混沌的，即使梯度范数小
Jacobian不是各向同性：主导特征值之外的特征值可能很大
激活函数的非线性：使Jacobian的谱结构复杂化

新理论框架：

\frac{\partial h _{t}}{\partial h _{k}} = i = k + 1 \prod t σ_{m a x}^{(i)}

其中 $σ_{m a x}^{(i)}$ 是第 $i$ 步Jacobian的最大奇异值。

关键洞察：即使 $σ_{m a x}^{(i)} < 1$ for all $i$ ，乘积 $∥ \prod ∥$ 可以保持（不会爆炸）或指数消失。

3.3 隐藏状态的混沌性

定理（Zucchet-Orvieto 2024）：

当 $∥ σ_{m a x}^{(i)} ∥ < 1$ 时，隐藏状态 $h_{t}$ 的轨道收敛到不动点：

h_{t} \to h^{*} as t \to \infty

但如果初始条件微小变化 $Δ h_{0}$ ，则：

h_{t} (h_{0}) - h_{t} (h_{0} + Δ h_{0}) \to 0

这意味着RNN对初始条件”无记忆”，长期学习困难。

3.4 Livi的可学习窗口理论（2025）

Livi (arXiv 2512.05790, 2025) 提出Learnability Window $H_{N}$ 概念。³

定义：

H_{N} = max {t : E [\frac{\partial h _{N}}{\partial h _{0}}] \geq δ}

即梯度信号在统计上可恢复的最大时间范围。

核心定理（Livi 2025）：

对于带门控的RNN（包括GRU、LSTM），可学习窗口为：

H_{N} \sim lo g (N) \cdot (1 - \frac{1}{g})^{- 1}

其中 $g$ 是门控机制的有效性参数。

实践启示：

门控机制的”门”应该接近1（允许信息通过）
但接近0的门过多会导致信息完全丢失
最优配置：部分门开启、部分门关闭

3.5 Cayci & Eryilmaz的非渐近分析

Cayci & Eryilmaz (arXiv 2402.12241, 2024) 给出对角权重RNN的精确非渐近收敛性分析。⁴

模型设置：

h_{t + 1} = σ (diag (w) h_{t} + W_{x} x_{t})

关键定理：

对于宽度 $m$ 、序列长度 $T$ ，梯度下降达到 $ϵ$ -最优解所需迭代次数：

T_{iter} = \tilde{O} (\frac{T ^{2}}{ϵ ^{2} m ^{2}})

这表明宽度多项式增长即可保证收敛，与早期认为需要指数宽度的悲观结果不同。

3.6 Ribeiro的吸引子视角

Ribeiro et al. (PMLR 2020) 提出吸引子动力学分析RNN训练。⁵

核心思想：

RNN训练可以看作在损失景观中寻找吸引子。每个吸引子对应一种隐藏状态动力学模式。

关键洞察：

梯度消失导致”死吸引子”：训练卡在不动的吸引子
梯度爆炸导致”混沌吸引子”：训练振荡
稳定训练：位于”周期吸引子”区域

理论意义：梯度消失爆炸不是bug，而是不同吸引子区域的特征。

3.7 Manchev的局部表示对齐（2025）

Manchev & Garcia-Peraza-Herrera (arXiv 2504.13531, 2025) 提出Local Representation Alignment RNN。⁶

核心思想：

传统RNN在每个时间步独立处理输入。LRA-RNN在每个时间步对齐输入表示与历史表示。

架构：

h_{t} = σ (W_{h} h_{t - 1} + W_{x} x_{t} + α \cdot align (x_{t}, h_{t - 1}))

其中对齐函数：

align (x, h) = W_{a} (x - \overset{ˉ}{h}) \cdot 1 [match]

优势：

改善局部时序建模
减少参数数量
适合流式数据

四、梯度问题的工程解决方案

4.1 梯度裁剪

裁剪梯度范数：

def clip_grad_norm(parameters, max_norm):
    """裁剪梯度范数"""
    total_norm = 0
    for p in parameters:
        if p.grad is not None:
            total_norm += p.grad.data.norm(2).item() ** 2
    total_norm = total_norm ** 0.5
    
    clip_coef = max_norm / (total_norm + 1e-6)
    if clip_coef < 1:
        for p in parameters:
            if p.grad is not None:
                p.grad.data.mul_(clip_coef)
    
    return total_norm
 
 
# PyTorch原生
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

4.2 正交初始化

关键思想：初始化 $W_{hh}$ 为正交矩阵，保证初始谱半径为1。

def orthogonal_init(linear_layer, gain=1.0):
    """正交初始化"""
    nn.init.orthogonal_(linear_layer.weight, gain=gain)
    if linear_layer.bias is not None:
        nn.init.zeros_(linear_layer.bias)
 
 
# RNN的正交初始化
def init_rnn_orthogonal(rnn):
    """RNN的正交初始化"""
    for name, param in rnn.named_parameters():
        if 'weight_hh' in name or 'W_hh' in name:
            orthogonal_init(type(rnn).__call__.__self__ if False else None)
            # 实际应用nn.init.orthogonal_
            nn.init.orthogonal_(param, gain=1.0)
        elif 'weight_ih' in name or 'W_xh' in name:
            nn.init.xavier_normal_(param)
        elif 'bias' in name:
            nn.init.zeros_(param)

4.3 LSTM/GRU的门控机制

详见 machine-learning/lstm.md 和 machine-learning/gru-gated-recurrent-unit.md。

4.4 跳跃连接

class SkipRNN(nn.Module):
    """带跳跃连接的RNN"""
    
    def __init__(self, input_dim, hidden_dim, output_dim, skip_steps=5):
        super().__init__()
        self.cell = nn.RNNCell(input_dim, hidden_dim)
        self.skip_steps = skip_steps
        self.fc = nn.Linear(hidden_dim, output_dim)
    
    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        batch_size, seq_len, _ = x.shape
        h = torch.zeros(batch_size, self.cell.hidden_size, device=x.device)
        outputs = []
        
        for t in range(seq_len):
            h = self.cell(x[:, t, :], h)
            # 跳跃连接：每skip_steps步直接传递状态
            outputs.append(self.fc(h))
        
        return torch.stack(outputs, dim=1)

五、Vanilla RNN的现代变体

5.1 QRNN（Quasi-Recurrent Neural Networks）

使用并行跨时间步卷积加速RNN：

class QRNNLayer(nn.Module):
    """QRNN层：并行跨时间步"""
    
    def __init__(self, input_dim, hidden_dim, kernel_size=2):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.kernel_size = kernel_size
        
        # 卷积同时生成 z, f, o 门
        self.conv = nn.Conv1d(
            input_dim, 3 * hidden_dim,
            kernel_size, padding=kernel_size - 1
        )
    
    def forward(self, x):
        # x: (batch, seq_len, input_dim)
        x_t = x.transpose(1, 2)  # (batch, input_dim, seq_len)
        gates = self.conv(x_t)[:, :, :-(self.kernel_size - 1)]  # 截断
        gates = gates.transpose(1, 2)  # (batch, seq_len, 3*hidden)
        
        z, f, o = gates.chunk(3, dim=-1)
        z = torch.tanh(z)
        f = torch.sigmoid(f)
        o = torch.sigmoid(o)
        
        # 动态方程：h_t = f_t * h_{t-1} + (1 - f_t) * z_t
        h = self._quasi_rnn_step(z, f, o)
        return h
    
    def _quasi_rnn_step(self, z, f, o):
        """QRNN动态（可并行计算）"""
        seq_len = z.size(1)
        h = torch.zeros_like(z[:, 0])
        hs = [h]
        
        for t in range(seq_len):
            h = f[:, t] * h + (1 - f[:, t]) * z[:, t]
            hs.append(h)
        
        h = torch.stack(hs[1:], dim=1)
        return o * h

5.2 Echo State Networks (ESN)

储备池计算（Reservoir Computing）：固定随机隐藏层，仅训练输出。

class EchoStateNetwork(nn.Module):
    """回声状态网络"""
    
    def __init__(self, input_dim, reservoir_dim, output_dim, 
                 spectral_radius=0.9, sparsity=0.1, ridge_lambda=1e-6):
        super().__init__()
        self.reservoir_dim = reservoir_dim
        self.ridge_lambda = ridge_lambda
        
        # 固定的随机输入权重
        W_in = torch.randn(reservoir_dim, input_dim) * 0.1
        
        # 稀疏随机循环权重
        W_res = torch.randn(reservoir_dim, reservoir_dim)
        mask = (torch.rand_like(W_res) < sparsity).float()
        W_res = W_res * mask
        
        # 缩放到指定谱半径
        eigvals = torch.linalg.eigvals(W_res)
        current_radius = torch.max(torch.abs(eigvals.real))
        W_res = W_res * (spectral_radius / current_radius)
        
        self.register_buffer('W_in', W_in)
        self.register_buffer('W_res', W_res)
        
        # 输出权重（可训练）
        self.W_out = nn.Linear(reservoir_dim, output_dim, bias=False)
    
    def forward(self, x):
        """收集储备池状态"""
        batch_size, seq_len, _ = x.shape
        states = []
        h = torch.zeros(batch_size, self.reservoir_dim, device=x.device)
        
        for t in range(seq_len):
            h = torch.tanh(self.W_in @ x[:, t, :].t() + self.W_res @ h.t()).t()
            states.append(h)
        
        states = torch.stack(states, dim=1)  # (batch, seq_len, reservoir_dim)
        return self.W_out(states)

5.3 IndRNN（Independently Recurrent Neural Networks）

每层神经元独立循环，允许更深的RNN：

class IndRNNCell(nn.Module):
    """独立循环单元"""
    
    def __init__(self, input_dim, hidden_dim):
        super().__init__()
        self.hidden_dim = hidden_dim
        
        # 每神经元一个权重
        self.W_rec = nn.Parameter(torch.zeros(hidden_dim))
        nn.init.uniform_(self.W_rec, 0, 1)
        
        self.W_in = nn.Linear(input_dim, hidden_dim)
        self.W_out = nn.Linear(hidden_dim, hidden_dim)  # 下一层
    
    def forward(self, x, h0=None):
        # x: (batch, seq_len, input_dim)
        batch_size, seq_len, _ = x.shape
        if h0 is None:
            h0 = torch.zeros(batch_size, self.hidden_dim, device=x.device)
        
        h = h0
        outputs = []
        for t in range(seq_len):
            u = self.W_in(x[:, t, :])
            # 独立循环：每个神经元独立
            h = torch.tanh(u + self.W_rec.unsqueeze(0) * h)
            outputs.append(h)
        
        return torch.stack(outputs, dim=1), h

优势：

缓解梯度消失/爆炸
允许构建非常深的RNN（100+层）
可与卷积结合

六、实践指南

6.1 何时使用Vanilla RNN

适合场景：

短序列任务（< 50步）
简单模式识别
教学/原型
与LSTM/GRU的基线对比

不适合场景：

长序列建模（> 100步）
长时依赖学习
大规模生产部署

6.2 超参数选择

def get_rnn_hyperparams():
    """RNN超参数指南"""
    return {
        # 架构
        'hidden_dim': 128,  # 数据复杂度的2-4倍
        'num_layers': 1,    # Vanilla RNN通常单层
        
        # 训练
        'optimizer': 'adam',
        'learning_rate': 1e-3,
        'gradient_clip': 1.0,  # 关键超参数
        'batch_size': 64,
        'epochs': 50,
        
        # 初始化
        'init': 'orthogonal',
        'init_gain': 1.0,
        
        # 正则化
        'dropout': 0.1,  # 输入/输出层
        'weight_decay': 1e-5,
    }

6.3 调试技巧

def diagnose_rnn(model, x, y, criterion):
    """RNN训练诊断"""
    
    # 1. 监控梯度范数
    grad_norms = []
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad_norms.append((name, param.grad.norm().item()))
    
    print("梯度范数:")
    for name, norm in grad_norms:
        flag = "🚨" if norm > 10 else "⚠️" if norm > 1 else "✓"
        print(f"  {flag} {name}: {norm:.4f}")
    
    # 2. 监控激活
    model.eval()
    with torch.no_grad():
        outputs, h_n, hs = model(x)
    
    print(f"\n激活统计:")
    print(f"  隐藏状态均值: {torch.stack(hs).mean():.4f}")
    print(f"  隐藏状态标准差: {torch.stack(hs).std():.4f}")
    print(f"  最终状态范数: {h_n.norm():.4f}")
    
    # 3. 损失分解
    seq_len = y.size(1)
    losses_per_step = []
    for t in range(seq_len):
        loss_t = criterion(outputs[:, t], y[:, t])
        losses_per_step.append(loss_t.item())
    
    print(f"\n每步损失:")
    for t in [0, seq_len // 4, seq_len // 2, -1]:
        print(f"  t={t}: loss={losses_per_step[t]:.4f}")

6.4 Vanilla RNN vs LSTM/GRU

特性	Vanilla RNN	LSTM	GRU
参数数量	少	多（约4倍）	中（约3倍）
训练速度	快	慢	中
长时依赖	差	好	好
内存效率	高	低	中
适用长度	< 50	< 1000	< 500

七、形式化分析的Python实现

7.1 Jacobian谱分析

def analyze_jacobian_spectrum(rnn, x):
    """分析RNN Jacobian的谱"""
    
    # 计算单个时间步的Jacobian
    h = torch.zeros(1, rnn.hidden_dim, requires_grad=True)
    
    jacobians = []
    for t in range(x.size(1)):
        # 前向
        h_new = torch.tanh(rnn.W_hh(h) + rnn.W_xh(x[:, t, :]))
        
        # 计算Jacobian
        jac = torch.autograd.functional.jacobian(
            lambda h: torch.tanh(rnn.W_hh(h) + rnn.W_xh(x[:, t, :])),
            h.detach()
        )
        jacobians.append(jac.squeeze())
        h = h_new
    
    # 分析谱
    print("Jacobian谱分析:")
    for t, jac in enumerate(jacobians):
        eigvals = torch.linalg.eigvals(jac)
        spectral_radius = eigvals.abs().max().item()
        print(f"  t={t}: 谱半径 = {spectral_radius:.4f}")
        
        if spectral_radius < 0.5:
            print(f"    → 严重梯度消失")
        elif spectral_radius > 2.0:
            print(f"    → 梯度爆炸风险")
        else:
            print(f"    → 健康范围")

7.2 可学习窗口测量

def measure_learnability_window(model, x, threshold=1e-3):
    """测量模型的可学习窗口"""
    
    batch_size = x.size(0)
    seq_len = x.size(1)
    
    # 初始化状态
    h = torch.zeros(batch_size, model.hidden_dim, device=x.device)
    
    # 累积梯度范数
    grad_norms_per_step = []
    
    # 保存隐藏状态用于梯度计算
    states = [h]
    
    # 前向
    for t in range(seq_len):
        h = torch.tanh(model.W_hh(h) + model.W_xh(x[:, t, :]))
        states.append(h)
    
    # 反向计算每个时间步对初始状态的梯度
    h_final = states[-1]
    initial_grad = torch.autograd.grad(
        h_final.sum(), h,
        retain_graph=True,
        create_graph=True
    )[0]
    
    # 计算相对于每个时间步输入的梯度
    learnability = []
    for t in range(seq_len):
        # 假设损失对 h_t 的梯度
        loss = states[t+1].sum() if t < len(states) - 1 else 0
        if t == 0:
            grad_to_init = initial_grad.norm().item()
        else:
            grad = torch.autograd.grad(
                states[t].sum(),
                h,
                retain_graph=True
            )[0]
            grad_to_init = grad.norm().item()
        learnability.append(grad_to_init)
    
    # 找到可学习窗口
    H = 0
    for grad in learnability:
        if grad > threshold * learnability[0]:
            H += 1
        else:
            break
    
    return H, learnability

八、未来方向

8.1 理论基础

开放问题：

完整的梯度行为理论：除谱半径外的其他因素
有限宽度效应：宽度对缩放定律的影响
深度RNN的可训练性：何时深度有利

8.2 与Transformer的关系

研究问题：

RNN可以被视为”无限深度但有限宽度”的Transformer。两者间的理论联系是当前研究热点。

8.3 神经动力学视角

将RNN训练视为动力系统演化，而非简单的优化问题。这一视角可能带来新突破。

九、参考资料

最后更新：2026-06-21

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. ICML 2013. ↩
Zucchet, N. & Orvieto, A. (2024). Recurrent neural networks: vanishing and exploding gradients are not the end of the story. arXiv:2405.21064. https://arxiv.org/html/2405.21064v2 ↩
Livi, L. (2025). Learnability Window in Gated Recurrent Neural Networks. arXiv:2512.05790. https://www.arxiv.org/pdf/2512.05790v2 ↩
Cayci, S. & Eryilmaz, A. (2024). Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis. arXiv:2402.12241. https://arxiv.org/pdf/2402.12241 ↩
Ribeiro, A.H., Tiels, K., Aguirre, L.A., & Schön, T.B. (2020). Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness. PMLR 2020. https://proceedings.mlr.press/v108/ribeiro20a/ribeiro20a.pdf ↩
Manchev, N. & Garcia-Peraza-Herrera, L.C. (2025). Can Local Representation Alignment RNNs Solve Temporal Tasks? arXiv:2504.13531. https://arxiv.org/pdf/2504.13531 ↩

Metaphor

探索