反向传播完整推导：从链式法则到通用自动微分

引言

反向传播（Backpropagation, BP）是深度学习的核心算法，由 Rumelhart, Hinton & Williams 在 1986 年的开创性论文中系统化。其本质是链式法则在计算图上的高效实现。

本文将建立从基础数学到通用自动微分的完整推导链：

单变量/多变量链式法则
计算图视角（节点、边、拓扑排序）
Jacobian 矩阵（多元导数的紧凑表示）
Hessian 矩阵（二阶导数）
自动微分（前向模式、反向模式、混合模式）
特殊结构的反向传播（卷积、循环、注意力）
数值稳定性（log-sum-exp、梯度检查）

最终，我们将在纯 NumPy 中实现一个通用反向传播引擎，可以处理任意计算图。

一、链式法则基础

1.1 单变量链式法则

对于复合函数 $y = f (g (x))$ ：

\frac{d y}{d x} = \frac{d y}{d g} \cdot \frac{d g}{d x} = f^{'} (g (x)) \cdot g^{'} (x)

多重复合：

y = f_{n} \circ f_{n - 1} \circ \dots \circ f_{1} (x)

\frac{d y}{d x} = f_{n}^{'} (f_{n - 1} (\dots)) \cdot f_{n - 1}^{'} (\dots) \cdot \dots \cdot f_{1}^{'} (x) = i = 1 \prod n f_{i}^{'} (x_{i - 1})

1.2 多元链式法则

情形1： $z = f (y)$ ， $y = g (x)$ ，其中 $x \in R^{n}$ ， $y \in R^{m}$ ， $z \in R$ 。

\frac{\partial z}{\partial x _{i}} = j = 1 \sum m \frac{\partial z}{\partial y _{j}} \cdot \frac{\partial y _{j}}{\partial x _{i}}

情形2： $z = f (y)$ ， $y = g (x)$ ，其中 $x \in R^{n}$ ， $y \in R^{m}$ ， $z \in R^{p}$ 。

Jacobian 链式法则：

\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}

其中 $\frac{\partial z}{\partial x} \in R^{p \times n}$ 是 Jacobian 矩阵：

(\frac{\partial z}{\partial x})_{ij} = \frac{\partial z _{i}}{\partial x _{j}}

二、计算图视角

2.1 计算图定义

计算图是一个有向无环图（DAG），其中：

节点：表示变量（输入、中间值、参数、输出）
边：表示函数依赖关系

# 简单计算图示例
# f(x, y, z) = (x + y) * z
# 计算图：
#   x, y, z (输入)
#    ↓   ↓
#    add  →  a = x + y
#     ↓
#    mul  →  b = a * z
#     ↓
#    output

2.2 前向传播与反向传播

前向传播（Forward Pass）：从输入到输出，按拓扑序计算每个节点。

反向传播（Backward Pass）：从输出到输入，按反向拓扑序计算梯度。

关键观察：

前向传播中，每个节点保存其值（用于反向传播）
反向传播中，每个节点计算其对输出损失的梯度
反向传播的时间复杂度 = 前向传播的常数倍

2.3 计算图的分解

任意复杂函数都可以分解为基本操作的组合：

算术： $+$ , $-$ , $\times$ , $\div$
激活函数： $σ$ , $tanh$ , $ReLU$
矩阵运算：matmul, transpose
归约：sum, mean, max
广播：broadcasting

class Node:
    """计算图节点"""
    def __init__(self, value, parents=(), op=''):
        self.value = value          # 前向值
        self.parents = parents      # 父节点
        self.op = op                # 生成此节点的操作
        self.grad = None            # 反向梯度（∂L/∂self）
        self._backward = lambda: None  # 反向传播函数
    
    def __repr__(self):
        return f"Node({self.value}, op={self.op})"

三、Jacobian 矩阵

3.1 定义与几何意义

设 $f : R^{n} \to R^{m}$ ，其 Jacobian 矩阵是：

J_{f} (x) = \frac{\partial f _{1}}{\partial x _{1}} ⋮ \frac{\partial f _{m}}{\partial x _{1}} \dots ⋱ \dots \frac{\partial f _{1}}{\partial x _{n}} ⋮ \frac{\partial f _{m}}{\partial x _{n}} \in R^{m \times n}

几何意义：

Jacobian 是线性变换 $f$ 在 $x$ 处的局部近似
$f (x + Δ x) \approx f (x) + J_{f} (x) \cdot Δ x$
矩阵的秩是局部维度

3.2 雅可比链式法则

定理：若 $z = f (y), y = g (x)$ ，则

\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x}

矩阵乘法自然地实现了链式法则。

3.3 反向传播的 Jacobian 视角

反向传播的每一步实际上是：

\nabla_{x} L = J_{f (x)}^{T} \nabla_{y} L

即：上游梯度 $\nabla_{y} L$ 通过 Jacobian 的转置流向下游。

示例：

$y = Wx$ ， $W \in R^{m \times n}$ ， $x \in R^{n}$
$J_{f} = W$ （常数）
$\nabla_{x} L = W^{T} \nabla_{y} L$

3.4 神经网络层的 Jacobian

线性层 $y = Wx + b$ ：

\frac{\partial y}{\partial x} = W, \frac{\partial y}{\partial W} = x^{T} \otimes I, \frac{\partial y}{\partial b} = I

（实际上是用 $\frac{\partial L}{\partial W} = \nabla_{y} L \cdot x^{T}$ ）

ReLU $y = max (0, x)$ ：

\frac{\partial y}{\partial x} = 1 [x > 0]

Softmax $y_{i} = \frac{e ^{x_{i}}}{\sum _{j} e ^{x_{j}}}$ ：

\frac{\partial y _{i}}{\partial x _{j}} = y_{i} (δ_{ij} - y_{j})

即 Jacobian = $diag (y) - y y^{T}$ 。

四、Hessian 矩阵

4.1 定义

设 $L : R^{n} \to R$ 是标量函数，Hessian 矩阵是：

H = \nabla^{2} L = \frac{\partial ^{2} L}{\partial x _{1}^{2}} ⋮ \frac{\partial ^{2} L}{\partial x _{n} \partial x _{1}} \dots ⋱ \dots \frac{\partial ^{2} L}{\partial x _{1} \partial x _{n}} ⋮ \frac{\partial ^{2} L}{\partial x _{n}^{2}} \in R^{n \times n}

关键性质：

对称（ $L$ 二阶连续可微时）
特征值是局部曲率

4.2 Hessian 与优化

牛顿法：

x_{t + 1} = x_{t} - H^{- 1} \nabla L (x_{t})

二次型：

L (x + Δ x) \approx L (x) + \nabla L \cdot Δ x + \frac{1}{2} Δ x^{T} H Δ x

Hessian 条件数：

κ (H) = \frac{λ _{m a x} ( H )}{λ _{m i n} ( H )}

$κ$ 大 → 优化难（病态）。

4.3 Hessian-vector product (HVP)

Hessian 完整计算太昂贵（ $O (n^{2})$ 空间），但 Hessian-vector product 是高效的（ $O (n)$ 时间）。

关键观察：

Hv = \nabla (\nabla L \cdot v)

实现：用 reverse-on-forward 自动微分。

def hessian_vector_product(loss_fn, params, v):
    """
    Hessian-vector product: Hv
    loss_fn: 计算损失的函数
    params: 参数列表
    v: 向量
    """
    # 1. 前向计算损失
    loss = loss_fn()
    
    # 2. 计算梯度 ∇L
    grads = torch.autograd.grad(loss, params, create_graph=True)
    
    # 3. 计算 ∇L · v
    grad_v = sum((g * v_i).sum() for g, v_i in zip(grads, v))
    
    # 4. 计算 ∇(∇L · v) = Hv
    Hv = torch.autograd.grad(grad_v, params)
    return Hv

五、自动微分

5.1 三种自动微分

模式	适用场景	计算复杂度
数值微分	简单	$O (n)$ 调用，每次 $O (1)$
符号微分	复杂表达式	表达式膨胀
前向模式 AD	$n ≪ m$ （输入少）	$O (n)$ 次前向
反向模式 AD	$m ≪ n$ （输出少）	$O (m)$ 次反向

反向模式 AD = 反向传播（用于深度学习）。

5.2 前向模式 AD

核心思想：对每个变量同时跟踪值和梯度（种子）。

class Dual:
    """前向模式 AD 的对偶数"""
    def __init__(self, real, dual=0):
        self.real = real  # 实际值
        self.dual = dual  # 导数（种子）
    
    def __add__(self, other):
        other = other if isinstance(other, Dual) else Dual(other)
        return Dual(self.real + other.real, self.dual + other.dual)
    
    def __mul__(self, other):
        other = other if isinstance(other, Dual) else Dual(other)
        return Dual(self.real * other.real, 
                    self.real * other.dual + self.dual * other.real)
    
    def sin(self):
        return Dual(np.sin(self.real), np.cos(self.real) * self.dual)
 
def f(x):
    return x * x + Dual(3) * x + Dual(5)
 
# 计算 df/dx 在 x=2 的值
x = Dual(2.0, dual=1.0)  # 种子 = 1
y = f(x)
print(f"f(2) = {y.real}, f'(2) = {y.dual}")

复杂度：每个操作做 2 个浮点运算，因此计算梯度是 $O (n)$ 步（前向传播的常数倍）。

5.3 反向模式 AD

核心思想：先前向计算所有中间值，然后反向计算梯度。

操作（以乘法为例）：

前向： $c = a \cdot b$
反向： $\nabla a + = b \cdot \nabla c$ ， $\nabla b + = a \cdot \nabla c$

def backward_mode_example():
    """反向模式 AD 示例：f(x, y) = x*y + sin(x)"""
    # 前向
    x = 2.0
    y = 3.0
    a = x * y         # 6
    b = np.sin(x)     # sin(2)
    z = a + b         # 6 + sin(2)
    
    # 反向
    grad_z = 1.0
    grad_a = grad_z  # ∂z/∂a = 1
    grad_b = grad_z  # ∂z/∂b = 1
    grad_x = grad_a * y + grad_b * np.cos(x)  # ∂z/∂x = y + cos(x)
    grad_y = grad_a * x  # ∂z/∂y = x
    
    print(f"z = {z:.4f}")
    print(f"∂z/∂x = {grad_x:.4f} (expected: {y + np.cos(x):.4f})")
    print(f"∂z/∂y = {grad_y:.4f} (expected: {x:.4f})")

复杂度：对每个中间节点访问常数次，总复杂度 = $O (m)$ 次反向（与输入维度无关）。

这就是为什么反向模式 AD适合深度学习：输入维度高（百万参数），输出维度低（标量损失）。

5.4 混合模式 AD

前向-反向混合：对部分变量用前向模式，对其他用反向模式。

实现：嵌套的 AD 调用。

应用：Hessian-vector product、计算特定参数的梯度。

六、特殊结构的反向传播

6.1 卷积层的反向传播

卷积操作： $Y = X * W$ ，其中 $X \in R^{H \times W \times C_{in}}$ ， $W \in R^{k \times k \times C_{in} \times C_{o u t}}$ 。

反向传播推导：

设 $L$ 是损失， $\frac{\partial L}{\partial Y}$ 已知。

对 $W$ 的梯度：

\frac{\partial L}{\partial W _{ij}} = X * \frac{\partial L}{\partial Y}_{相应位置}

对 $X$ 的梯度（转置卷积 / 反卷积）：

\frac{\partial L}{\partial X} = \frac{\partial L}{\partial Y} *_{T} W

其中 $*_{T}$ 表示转置卷积（用翻转的核做卷积）。

关键洞察：卷积的 Jacobian 是 Toeplitz 矩阵（每个对角线元素相同）。

def conv2d_backward(X, W, dY):
    """
    简化的 2D 卷积反向传播
    X: (B, C_in, H, W) 输入
    W: (C_out, C_in, kH, kW) 卷积核
    dY: (B, C_out, H_out, W_out) 上游梯度
    """
    B, C_in, H, W_ = X.shape
    C_out, _, kH, kW = W.shape
    
    # 1. 对 W 的梯度：dL/dW = X * dL/dY
    dW = np.zeros_like(W)
    for b in range(B):
        for co in range(C_out):
            for ci in range(C_in):
                dW[co, ci] += conv2d(X[b, ci], dY[b, co])
    
    # 2. 对 X 的梯度：dL/dX = 转置卷积 dL/dY with W
    dX = np.zeros_like(X)
    for b in range(B):
        for ci in range(C_in):
            for co in range(C_out):
                dX[b, ci] += conv_transpose2d(dY[b, co], W[co, ci])
    
    # 3. 对 bias 的梯度
    db = dY.sum(axis=(0, 2, 3))
    return dX, dW, db

6.2 RNN 的 BPTT

循环神经网络（RNN）： $h_{t} = f (h_{t - 1}, x_{t})$ ， $y_{t} = g (h_{t})$ 。

BPTT (Backpropagation Through Time)：将 RNN “展开”为前向网络。

梯度推导：

\frac{\partial L}{\partial h _{t}} = \frac{\partial L}{\partial y _{t}} \frac{\partial y _{t}}{\partial h _{t}} + \frac{\partial L}{\partial h _{t + 1}} \frac{\partial h _{t + 1}}{\partial h _{t}}

梯度消失/爆炸：

\frac{\partial L}{\partial h _{0}} = t = 1 \sum T \frac{\partial L}{\partial h _{t}} k = 1 \prod t \frac{\partial h _{k}}{\partial h _{k - 1}}

如果 $\frac{\partial h _{k}}{\partial h _{k - 1}} > 1$ ，梯度爆炸；如果 $< 1$ ，梯度消失。

LSTM 的解决方案：用加性更新 $\tilde{c}_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c}_{t}$ ，使梯度沿时间稳定传播。

def bptt(rnn, X, Y, T):
    """
    BPTT 实现
    X: (B, T, d_in) 输入序列
    Y: (B, T, d_out) 目标序列
    """
    B, T, d_in = X.shape
    h = rnn.initial_hidden(B)
    
    # 前向
    hs, ys = [], []
    for t in range(T):
        h, y = rnn.step(X[:, t], h)
        hs.append(h)
        ys.append(y)
    
    # 计算损失
    loss = ((torch.stack(ys) - Y) ** 2).mean()
    
    # 反向（PyTorch 自动）
    loss.backward()
    
    return loss

6.3 注意力层的反向传播

注意力 $Y = softmax (Q K^{T} / d_{k}) V$ 。

设 $A = softmax (S)$ ， $S = Q K^{T} / d_{k}$ ， $Y = AV$ 。

反向传播：

给定 $G_{Y} = \frac{\partial L}{\partial Y}$ ：

G_{V} = A^{T} G_{Y}

G_{A} = G_{Y} V^{T}

Softmax 反向（关键步骤）：

G_{S} = A ⊙ (G_{A} - i \sum G_{A_{ij}})

即： $\frac{\partial L}{\partial S _{ij}} = A_{ij} (G_{A_{ij}} - \sum_{k} G_{A_{ik}})$

def softmax_backward(dA, A):
    """softmax 反向传播"""
    # dA: 上游梯度
    # A: softmax 输出
    # 返回：dS（softmax 输入的梯度）
    sum_dA = dA.sum(dim=-1, keepdim=True)  # (..., 1)
    dS = A * (dA - sum_dA)
    return dS

最后：

G_{Q} = G_{S} K / d_{k}, G_{K} = G_{S}^{T} Q / d_{k}

七、数值稳定性

7.1 梯度检查

梯度检查（Gradient Check）：用数值微分验证解析梯度。

\frac{\partial L}{\partial θ _{i}} \approx \frac{L ( θ + ϵ e _{i} ) - L ( θ - ϵ e _{i} )}{2 ϵ}

def gradient_check(loss_fn, params, epsilon=1e-5, tol=1e-5):
    """梯度检查"""
    loss = loss_fn()
    loss.backward()
    
    for i, p in enumerate(params):
        analytical_grad = p.grad.clone()
        
        # 数值梯度
        p.data[i] += epsilon
        loss_plus = loss_fn().item()
        p.data[i] -= 2 * epsilon
        loss_minus = loss_fn().item()
        p.data[i] += epsilon
        numerical_grad = (loss_plus - loss_minus) / (2 * epsilon)
        
        # 比较
        diff = abs(analytical_grad[i] - numerical_grad)
        rel = diff / (abs(analytical_grad[i]) + abs(numerical_grad) + 1e-8)
        if rel > tol:
            print(f"参数 {i}: 解析={analytical_grad[i]:.6e}, "
                  f"数值={numerical_grad:.6e}, 相对误差={rel:.6e}")
        else:
            print(f"参数 {i}: OK")

7.2 数值问题

1. 梯度消失/爆炸：

解决：归一化（BatchNorm、LayerNorm）、残差连接、梯度裁剪

2. log(0) = -∞：

解决：log-sum-exp 技巧：

lo g i \sum e^{x_{i}} = m + lo g i \sum e^{x_{i} - m}, m = i max x_{i}

3. Softmax 溢出：

解决：减去最大值（数值等价）：

softmax (x_{i}) = \frac{e ^{x_{i} - m}}{\sum _{j} e ^{x_{j} - m}}

4. 除以零：

解决： $ϵ$ 平滑，例如 $1/ x^{2} + ϵ$

def log_softmax(x):
    """数值稳定的 log-softmax"""
    m = x.max(dim=-1, keepdim=True).values
    return x - m - torch.log(torch.exp(x - m).sum(dim=-1, keepdim=True))
 
def log_sum_exp(x):
    """数值稳定的 log-sum-exp"""
    m = x.max()
    return m + torch.log(torch.exp(x - m).sum())

八、通用反向传播引擎（NumPy 实现）

import numpy as np
from collections import defaultdict
 
class Tensor:
    """支持自动微分的张量"""
    def __init__(self, data, requires_grad=False, _children=(), _op=''):
        self.data = np.asarray(data, dtype=np.float32)
        self.requires_grad = requires_grad
        self.grad = None
        self._backward = lambda: None
        self._children = set(_children)
        self._op = _op
    
    def __repr__(self):
        return f"Tensor(shape={self.data.shape}, op={self._op})"
    
    # === 基本运算 ===
    def __add__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(self.data + other.data, 
                     requires_grad=self.requires_grad or other.requires_grad,
                     _children=(self, other), _op='+')
        
        def _backward():
            if self.requires_grad:
                self.grad = self._accumulate_grad(self.grad, out.grad)
            if other.requires_grad:
                other.grad = other._accumulate_grad(other.grad, out.grad)
        out._backward = _backward
        return out
    
    def __mul__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(self.data * other.data,
                     requires_grad=self.requires_grad or other.requires_grad,
                     _children=(self, other), _op='*')
        
        def _backward():
            if self.requires_grad:
                self.grad = self._accumulate_grad(self.grad, other.data * out.grad)
            if other.requires_grad:
                other.grad = other._accumulate_grad(other.grad, self.data * out.grad)
        out._backward = _backward
        return out
    
    def __matmul__(self, other):
        other = other if isinstance(other, Tensor) else Tensor(other)
        out = Tensor(self.data @ other.data,
                     requires_grad=self.requires_grad or other.requires_grad,
                     _children=(self, other), _op='@')
        
        def _backward():
            if self.requires_grad:
                self.grad = self._accumulate_grad(self.grad, out.grad @ other.data.T)
            if other.requires_grad:
                other.grad = other._accumulate_grad(other.grad, self.data.T @ out.grad)
        out._backward = _backward
        return out
    
    def relu(self):
        out = Tensor(np.maximum(0, self.data),
                     requires_grad=self.requires_grad,
                     _children=(self,), _op='relu')
        def _backward():
            if self.requires_grad:
                mask = (self.data > 0).astype(np.float32)
                self.grad = self._accumulate_grad(self.grad, mask * out.grad)
        out._backward = _backward
        return out
    
    def sum(self):
        out = Tensor(self.data.sum(),
                     requires_grad=self.requires_grad,
                     _children=(self,), _op='sum')
        def _backward():
            if self.requires_grad:
                self.grad = self._accumulate_grad(self.grad, 
                    np.broadcast_to(out.grad, self.data.shape))
        out._backward = _backward
        return out
    
    def _accumulate_grad(self, existing, new):
        if existing is None:
            return new
        return existing + new
    
    # === 反向传播 ===
    def backward(self, grad=None):
        if grad is None:
            grad = np.ones_like(self.data)
        self.grad = grad
        
        # 拓扑排序
        topo = []
        visited = set()
        def build_topo(node):
            if node not in visited:
                visited.add(node)
                for child in node._children:
                    build_topo(child)
                topo.append(node)
        build_topo(self)
        
        # 反向传播
        for node in reversed(topo):
            node._backward()
 
 
# === 测试 ===
def test_simple_function():
    """测试简单函数：f(x, y) = (x + y) * (x - y)"""
    x = Tensor([2.0], requires_grad=True)
    y = Tensor([3.0], requires_grad=True)
    
    a = x + y       # 5
    b = x - y       # -1
    f = a * b       # -5
    
    f.backward()
    
    print(f"x.grad = {x.grad} (expected: 2x = 4)")
    print(f"y.grad = {y.grad} (expected: -2y = -6)")
 
 
def test_mlp():
    """测试两层 MLP"""
    np.random.seed(42)
    X = Tensor(np.random.randn(4, 3), requires_grad=False)
    W1 = Tensor(np.random.randn(3, 5) * 0.1, requires_grad=True)
    b1 = Tensor(np.zeros(5), requires_grad=True)
    W2 = Tensor(np.random.randn(5, 2) * 0.1, requires_grad=True)
    b2 = Tensor(np.zeros(2), requires_grad=True)
    
    # 前向
    h = (X @ W1 + b1).relu()
    out = h @ W2 + b2
    loss = (out * out).sum()  # 平方损失（伪目标为 0）
    
    # 反向
    loss.backward()
    
    print(f"Loss = {loss.data[0]:.4f}")
    print(f"W1.grad shape = {W1.grad.shape}")
    print(f"W2.grad shape = {W2.grad.shape}")

九、关键洞察总结

9.1 三大数学对象

链式法则：反向传播的理论基础
Jacobian 矩阵：多元导数的紧凑表示
Hessian 矩阵：二阶优化、曲率分析

9.2 三大计算范式

前向模式 AD：对输出求导高效
反向模式 AD：对输入求导高效（深度学习用这个）
混合模式 AD：结合两者

9.3 三大数值挑战

梯度消失/爆炸：深层网络的根本问题
数值溢出：softmax、log 等函数
病态 Hessian：二阶优化困难

十、与其他专题的连接

自动微分库：PyTorch autograd、JAX grad、TensorFlow GradientTape
神经 ODE：连续时间 BP
Transformer 数学：注意力矩阵的反向传播
优化器：SGD、Adam、Muon 都依赖 BP

参考资料

Rumelhart, D.E. et al. (1986). Learning representations by back-propagating errors. Nature 323:533-536.
Baydin, A.G. et al. (2017). Automatic Differentiation in Machine Learning: a Survey. JMLR 18:1-43. arXiv:1502.05767
Griewank, A. & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM.
Goodfellow, I. et al. (2016). Deep Learning. MIT Press. Chapter 6.
Paszke, A. et al. (2017). Automatic Differentiation in PyTorch. NeurIPS Autodiff Workshop.
Chen, T.Q. et al. (2018). Neural Ordinary Differential Equations. NeurIPS 2018.
Werbos, P.J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard.
Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. MSc thesis (in Finnish), University of Helsinki.

最后更新：2026-06-22

Metaphor

探索

反向传播完整推导：从链式法则到通用自动微分

反向传播完整推导：从链式法则到通用自动微分

引言

一、链式法则基础

1.1 单变量链式法则

1.2 多元链式法则

二、计算图视角

2.1 计算图定义

2.2 前向传播与反向传播

2.3 计算图的分解

三、Jacobian 矩阵

3.1 定义与几何意义

3.2 雅可比链式法则

3.3 反向传播的 Jacobian 视角

3.4 神经网络层的 Jacobian

四、Hessian 矩阵

4.1 定义

4.2 Hessian 与优化

4.3 Hessian-vector product (HVP)

五、自动微分

5.1 三种自动微分

5.2 前向模式 AD

5.3 反向模式 AD

5.4 混合模式 AD

六、特殊结构的反向传播

6.1 卷积层的反向传播

6.2 RNN 的 BPTT

6.3 注意力层的反向传播

七、数值稳定性

7.1 梯度检查

7.2 数值问题

八、通用反向传播引擎（NumPy 实现）

九、关键洞察总结

9.1 三大数学对象

9.2 三大计算范式

9.3 三大数值挑战

十、与其他专题的连接

参考资料

关系图谱

目录

反向链接