从零实现MLP

概述

多层感知机（MLP）是深度学习最基础也最重要的架构。本文档提供一份从零开始的完整实现指南，涵盖：

NumPy从零实现：理解每个数学细节
PyTorch完整实现：工业级训练流程
初始化方案：Xavier、He、LSUV、正交初始化
正则化技术：Dropout、BatchNorm、Weight Decay、LayerNorm
优化器：SGD、Adam、AdamW
学习率调度：Step、Cosine、Warmup
完整训练流水线：MNIST/Fashion-MNIST分类

MLP是所有深度学习架构的”原子”，深入理解其实现细节对理解CNN、RNN、Transformer至关重要。[^1]

一、MLP的数学基础

1.1 网络结构

$L$ 层MLP的数学形式：

z^{(l)} a^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)} = σ_{l} (z^{(l)})

其中：

$a^{(0)} = x$ 是输入
$W^{(l)} \in R^{d_{l} \times d_{l - 1}}$ 是第 $l$ 层权重
$b^{(l)} \in R^{d_{l}}$ 是偏置
$σ_{l}$ 是第 $l$ 层激活函数
$a^{(L)}$ 是输出
$\hat{y} = softmax (a^{(L)})$ 是预测概率

1.2 损失函数

交叉熵损失（多分类）：

L (\hat{y}, y) = - i = 1 \sum N k = 1 \sum K y_{ik} lo g \overset{y}{^}_{ik}

均方误差（回归）：

L (\hat{y}, y) = \frac{1}{N} i = 1 \sum N ∥ \hat{y}_{i} - y_{i} ∥^{2}

1.3 反向传播

使用链式法则，损失对参数的梯度：

\frac{\partial L}{\partial W ^{(l)}} = \frac{\partial L}{\partial z ^{(l)}} (a^{(l - 1)})^{T} = δ^{(l)} (a^{(l - 1)})^{T}

其中 $δ^{(l)} = \frac{\partial L}{\partial z ^{(l)}}$ 是误差信号。

递推关系：

δ^{(l)} = (W^{(l + 1)})^{T} δ^{(l + 1)} ⊙ σ_{l}^{'} (z^{(l)})

二、NumPy从零实现

2.1 完整实现

import numpy as np
 
 
class MLP:
    """NumPy从零实现的多层感知机"""
    
    def __init__(self, layer_dims, activations, init_method='he', 
                 dropout_rate=0.0, weight_decay=0.0):
        """
        参数:
            layer_dims: 每层维度，例如 [784, 256, 128, 10]
            activations: 激活函数列表，例如 ['relu', 'relu', 'softmax']
            init_method: 初始化方法
            dropout_rate: dropout概率
            weight_decay: L2正则化系数
        """
        self.num_layers = len(layer_dims) - 1
        self.layer_dims = layer_dims
        self.activations = activations
        self.dropout_rate = dropout_rate
        self.weight_decay = weight_decay
        
        # 初始化参数
        self.parameters = {}
        self._initialize_parameters(init_method)
        
        # 训练模式标志
        self.training = True
    
    def _initialize_parameters(self, method):
        """权重初始化"""
        for l in range(1, self.num_layers + 1):
            fan_in = self.layer_dims[l - 1]
            fan_out = self.layer_dims[l]
            
            if method == 'xavier':
                # Xavier/Glorot: Var(W) = 2 / (fan_in + fan_out)
                scale = np.sqrt(2.0 / (fan_in + fan_out))
            elif method == 'he':
                # He/Kaiming: Var(W) = 2 / fan_in (for ReLU)
                scale = np.sqrt(2.0 / fan_in)
            elif method == 'xavier_normal':
                scale = np.sqrt(1.0 / fan_in)
            else:
                scale = 0.01
            
            # 权重矩阵
            self.parameters[f'W{l}'] = np.random.randn(fan_in, fan_out) * scale
            self.parameters[f'b{l}'] = np.zeros((1, fan_out))
    
    def _activate(self, z, activation):
        """激活函数"""
        if activation == 'relu':
            return np.maximum(0, z)
        elif activation == 'sigmoid':
            return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))
        elif activation == 'tanh':
            return np.tanh(z)
        elif activation == 'softmax':
            z_shifted = z - np.max(z, axis=1, keepdims=True)
            exp_z = np.exp(z_shifted)
            return exp_z / np.sum(exp_z, axis=1, keepdims=True)
        elif activation == 'leaky_relu':
            return np.where(z > 0, z, 0.01 * z)
        elif activation == 'identity':
            return z
        else:
            raise ValueError(f"未知激活函数: {activation}")
    
    def _activate_derivative(self, z, activation):
        """激活函数导数"""
        if activation == 'relu':
            return (z > 0).astype(float)
        elif activation == 'sigmoid':
            s = self._activate(z, 'sigmoid')
            return s * (1 - s)
        elif activation == 'tanh':
            return 1 - np.tanh(z) ** 2
        elif activation == 'leaky_relu':
            return np.where(z > 0, 1.0, 0.01)
        elif activation in ('softmax', 'identity'):
            # softmax的导数与交叉熵组合计算
            return np.ones_like(z)
        else:
            raise ValueError(f"未知激活函数: {activation}")
    
    def forward(self, X):
        """前向传播"""
        self.cache = {'A0': X}
        self.dropout_masks = {}
        
        A = X
        for l in range(1, self.num_layers + 1):
            W = self.parameters[f'W{l}']
            b = self.parameters[f'b{l}']
            activation = self.activations[l - 1]
            
            # 线性变换
            Z = A @ W + b
            self.cache[f'Z{l}'] = Z
            
            # 激活
            A = self._activate(Z, activation)
            
            # Dropout（除输出层外）
            if self.training and self.dropout_rate > 0 and l < self.num_layers:
                mask = (np.random.rand(*A.shape) > self.dropout_rate).astype(float)
                A = A * mask / (1 - self.dropout_rate)
                self.dropout_masks[f'D{l}'] = mask
            
            self.cache[f'A{l}'] = A
        
        return A
    
    def compute_loss(self, Y_pred, Y_true):
        """计算损失"""
        m = Y_true.shape[0]
        
        if self.activations[-1] == 'softmax':
            # 交叉熵损失
            epsilon = 1e-15
            Y_pred_clipped = np.clip(Y_pred, epsilon, 1 - epsilon)
            loss = -np.mean(np.sum(Y_true * np.log(Y_pred_clipped), axis=1))
        else:
            # 均方误差
            loss = np.mean((Y_pred - Y_true) ** 2)
        
        # L2正则化
        if self.weight_decay > 0:
            reg_loss = 0
            for l in range(1, self.num_layers + 1):
                reg_loss += np.sum(self.parameters[f'W{l}'] ** 2)
            loss += 0.5 * self.weight_decay * reg_loss / m
        
        return loss
    
    def backward(self, Y_pred, Y_true):
        """反向传播"""
        m = Y_true.shape[0]
        gradients = {}
        
        # 输出层误差
        if self.activations[-1] == 'softmax' and Y_true.shape[1] > 1:
            # 交叉熵 + softmax的组合梯度
            dA = (Y_pred - Y_true) / m
        else:
            dA = 2 * (Y_pred - Y_true) / m
        
        for l in reversed(range(1, self.num_layers + 1)):
            W = self.parameters[f'W{l}']
            Z = self.cache[f'Z{l}']
            A_prev = self.cache[f'A{l-1}']
            activation = self.activations[l - 1]
            
            # Dropout mask
            if f'D{l}' in self.dropout_masks:
                dA = dA * self.dropout_masks[f'D{l}'] / (1 - self.dropout_rate)
            
            # 激活函数梯度
            if activation == 'softmax':
                # softmax梯度已与交叉熵组合
                dZ = dA
            else:
                dZ = dA * self._activate_derivative(Z, activation)
            
            # 参数梯度
            gradients[f'dW{l}'] = A_prev.T @ dZ + self.weight_decay * W / m
            gradients[f'db{l}'] = np.sum(dZ, axis=0, keepdims=True)
            
            # 传递到前一层
            dA = dZ @ W.T
        
        return gradients
    
    def update_parameters(self, gradients, learning_rate):
        """梯度下降更新"""
        for l in range(1, self.num_layers + 1):
            self.parameters[f'W{l}'] -= learning_rate * gradients[f'dW{l}']
            self.parameters[f'b{l}'] -= learning_rate * gradients[f'db{l}']
    
    def train(self, X, Y, X_val=None, Y_val=None, epochs=100, batch_size=32,
              learning_rate=0.001, verbose=True):
        """训练流程"""
        history = {'train_loss': [], 'val_loss': [], 'val_acc': []}
        n_samples = X.shape[0]
        
        for epoch in range(epochs):
            # Shuffle
            indices = np.random.permutation(n_samples)
            X_shuffled = X[indices]
            Y_shuffled = Y[indices]
            
            epoch_loss = 0
            n_batches = 0
            
            for i in range(0, n_samples, batch_size):
                X_batch = X_shuffled[i:i+batch_size]
                Y_batch = Y_shuffled[i:i+batch_size]
                
                # 前向传播
                Y_pred = self.forward(X_batch)
                
                # 计算损失
                loss = self.compute_loss(Y_pred, Y_batch)
                epoch_loss += loss
                n_batches += 1
                
                # 反向传播
                gradients = self.backward(Y_pred, Y_batch)
                
                # 更新参数
                self.update_parameters(gradients, learning_rate)
            
            avg_loss = epoch_loss / n_batches
            history['train_loss'].append(avg_loss)
            
            # 验证
            if X_val is not None:
                self.training = False
                Y_val_pred = self.forward(X_val)
                val_loss = self.compute_loss(Y_val_pred, Y_val)
                val_acc = np.mean(np.argmax(Y_val_pred, axis=1) == np.argmax(Y_val, axis=1))
                history['val_loss'].append(val_loss)
                history['val_acc'].append(val_acc)
                self.training = True
                
                if verbose and (epoch + 1) % 10 == 0:
                    print(f"Epoch {epoch+1}/{epochs} - "
                          f"train_loss: {avg_loss:.4f} - "
                          f"val_loss: {val_loss:.4f} - "
                          f"val_acc: {val_acc:.4f}")
            elif verbose and (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch+1}/{epochs} - train_loss: {avg_loss:.4f}")
        
        return history
    
    def predict(self, X):
        """预测"""
        self.training = False
        Y_pred = self.forward(X)
        return np.argmax(Y_pred, axis=1)

2.2 MNIST训练测试

def load_mnist_simple():
    """简化的MNIST加载（需要torchvision）"""
    from torchvision import datasets, transforms
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
        transforms.Lambda(lambda x: x.view(-1))  # 展平
    ])
    train_data = datasets.MNIST('./data', train=True, download=True, transform=transform)
    test_data = datasets.MNIST('./data', train=False, transform=transform)
    
    X_train = np.stack([d[0].numpy() for d in train_data])
    Y_train = np.eye(10)[np.array([d[1] for d in train_data])]
    X_test = np.stack([d[0].numpy() for d in test_data])
    Y_test = np.eye(10)[np.array([d[1] for d in test_data])]
    
    return X_train, Y_train, X_test, Y_test
 
 
# 使用NumPy实现的MLP
X_train, Y_train, X_test, Y_test = load_mnist_simple()
 
mlp = MLP(
    layer_dims=[784, 256, 128, 10],
    activations=['relu', 'relu', 'softmax'],
    init_method='he',
    dropout_rate=0.2,
    weight_decay=1e-4
)
 
history = mlp.train(
    X_train[:50000], Y_train[:50000],
    X_val=X_train[50000:], Y_val=Y_train[50000:],
    epochs=30,
    batch_size=64,
    learning_rate=0.001
)
 
# 测试
test_acc = np.mean(mlp.predict(X_test) == np.argmax(Y_test, axis=1))
print(f"测试集准确率: {test_acc:.4f}")

三、PyTorch完整实现

3.1 模块化MLP

import torch
import torch.nn as nn
import torch.nn.functional as F
 
 
class LinearBlock(nn.Module):
    """线性层 + 归一化 + 激活 + Dropout"""
    
    def __init__(self, in_features, out_features, activation='relu',
                 use_batchnorm=True, dropout_rate=0.0, init_method='he'):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.use_bn = use_batchnorm
        
        if use_batchnorm:
            self.bn = nn.BatchNorm1d(out_features)
        
        # 激活函数
        if activation == 'relu':
            self.activation = nn.ReLU(inplace=True)
        elif activation == 'leaky_relu':
            self.activation = nn.LeakyReLU(0.01, inplace=True)
        elif activation == 'gelu':
            self.activation = nn.GELU()
        elif activation == 'silu':
            self.activation = nn.SiLU()
        elif activation == 'tanh':
            self.activation = nn.Tanh()
        elif activation == 'identity':
            self.activation = nn.Identity()
        else:
            raise ValueError(f"未知激活: {activation}")
        
        # Dropout
        self.dropout = nn.Dropout(dropout_rate) if dropout_rate > 0 else nn.Identity()
        
        # 初始化
        self._init_weights(init_method, activation)
    
    def _init_weights(self, method, activation):
        """权重初始化"""
        if method == 'he':
            # He/Kaiming初始化
            if activation in ('relu', 'leaky_relu'):
                nn.init.kaiming_normal_(self.linear.weight, nonlinearity='relu')
            elif activation == 'gelu' or activation == 'silu':
                # GELU/SiLU类似ReLU，使用相同的初始化
                nn.init.kaiming_normal_(self.linear.weight, nonlinearity='relu')
            else:
                # Xavier初始化
                nn.init.xavier_normal_(self.linear.weight)
        elif method == 'xavier':
            nn.init.xavier_normal_(self.linear.weight)
        elif method == 'orthogonal':
            nn.init.orthogonal_(self.linear.weight, gain=1.0)
        elif method == 'lsuv':
            # LSUV需要在前向传播时执行
            self._lsuv = True
        
        # 偏置初始化为0
        if self.linear.bias is not None:
            nn.init.zeros_(self.linear.bias)
    
    def forward(self, x):
        out = self.linear(x)
        if self.use_bn:
            out = self.bn(out)
        out = self.activation(out)
        out = self.dropout(out)
        return out
 
 
class MLP(nn.Module):
    """完整的多层感知机"""
    
    def __init__(self, input_dim, hidden_dims, output_dim,
                 activation='relu', output_activation='identity',
                 use_batchnorm=True, dropout_rate=0.0,
                 init_method='he', weight_decay=0.0):
        super().__init__()
        self.input_dim = input_dim
        self.output_dim = output_dim
        self.weight_decay = weight_decay
        
        # 构造层
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            layers.append(LinearBlock(
                in_features=prev_dim,
                out_features=hidden_dim,
                activation=activation,
                use_batchnorm=use_batchnorm,
                dropout_rate=dropout_rate,
                init_method=init_method
            ))
            prev_dim = hidden_dim
        
        # 输出层
        output_layer = nn.Linear(prev_dim, output_dim)
        if init_method == 'he':
            nn.init.kaiming_normal_(output_layer.weight)
        else:
            nn.init.xavier_normal_(output_layer.weight)
        nn.init.zeros_(output_layer.bias)
        layers.append(output_layer)
        
        # 输出激活
        if output_activation == 'softmax':
            layers.append(nn.Softmax(dim=-1))
        elif output_activation == 'log_softmax':
            layers.append(nn.LogSoftmax(dim=-1))
        elif output_activation == 'sigmoid':
            layers.append(nn.Sigmoid())
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        # 展平输入
        if x.dim() > 2:
            x = x.view(x.size(0), -1)
        return self.network(x)
 
 
# 示例
model = MLP(
    input_dim=784,
    hidden_dims=[512, 256, 128],
    output_dim=10,
    activation='relu',
    output_activation='log_softmax',
    use_batchnorm=True,
    dropout_rate=0.3,
    init_method='he'
)
print(model)

3.2 LSUV初始化

Layer-Sequential Unit-Variance (LSUV) 是Mishkin & Matas (2015)提出的实用初始化方法：

def lsuv_init(model, data_batch, target_std=1.0, target_mean=0.0, max_iter=10, tol=1e-3):
    """
    LSUV初始化：逐层调整权重使输出方差为1
    """
    model.eval()
    
    # 注册钩子获取每层输出
    outputs = {}
    handles = []
    
    def hook(name):
        def fn(module, input, output):
            outputs[name] = output.detach()
        return fn
    
    for name, module in model.named_modules():
        if isinstance(module, (nn.Conv2d, nn.Linear)):
            handles.append(module.register_forward_hook(hook(name)))
    
    # 前向传播
    with torch.no_grad():
        _ = model(data_batch)
    
    # 调整每层
    for name, module in model.named_modules():
        if isinstance(module, (nn.Conv2d, nn.Linear)) and name in outputs:
            out = outputs[name]
            # 去除BatchNorm等
            if out.dim() > 2:
                out = out.view(out.size(0), -1)
            
            for it in range(max_iter):
                current_std = out.std().item()
                current_mean = out.mean().item()
                
                # 调整权重缩放
                if abs(current_std - target_std) > tol:
                    module.weight.data *= (target_std / (current_std + 1e-8))
                
                # 调整偏置
                if abs(current_mean - target_mean) > tol:
                    if module.bias is not None:
                        module.bias.data += (target_mean - current_mean)
                
                # 重新前向
                _ = model(data_batch)
                out = outputs[name]
                if out.dim() > 2:
                    out = out.view(out.size(0), -1)
                
                if (abs(current_std - target_std) < tol and 
                    abs(current_mean - target_mean) < tol):
                    break
    
    # 清理钩子
    for handle in handles:
        handle.remove()
    
    model.train()

四、正则化技术详解

4.1 Dropout

原理：训练时随机将一部分神经元输出置零，推理时使用全部神经元（并缩放补偿）。

数学形式：

训练时：

\tilde{a}^{(l)} = m^{(l)} ⊙ a^{(l)} / (1 - p), m_{i} \sim Bernoulli (1 - p)

推理时：

\tilde{\mathbf{a}}^{(l)} = \mathbf{a}^{(l)} ``` **Inverted Dropout**（PyTorch使用）：训练时除以 $(1-p)$，推理时无需缩放。 ```python class DropoutAnalysis: """Dropout分析""" @staticmethod def expected_value_test(): """验证Dropout的无偏性""" p = 0.5 x = torch.ones(10000, 100) mask = (torch.rand_like(x) > p).float() / (1 - p) output = (x * mask).mean() # 期望接近1 print(f"Dropout(p={p})输出均值: {output.item():.4f} (期望1)") @staticmethod def variance_test(): """Dropout对梯度方差的影响""" # 训练时梯度方差较大，提供正则化 # 推理时无方差，全局稳定 pass # 测试 DropoutAnalysis.expected_value_test() ``` ### 4.2 Batch Normalization **原理**：对每个mini-batch的特征做归一化，使其均值为0、方差为1，然后学习缩放和平移。 **训练时**：

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta

**推理时**：使用移动平均的 $\mu, \sigma$：

\hat{x} = \frac{x - \mu_{\text{moving}}}{\sqrt{\sigma_{\text{moving}}^2 + \epsilon}}

**BatchNorm的优势**： 1. 允许更大学习率 2. 减少对初始化的依赖 3. 提供正则化效果 ```python class BatchNorm1dManual(nn.Module): """手动实现BatchNorm以理解其细节""" def __init__(self, num_features, eps=1e-5, momentum=0.1): super().__init__() self.eps = eps self.momentum = momentum self.gamma = nn.Parameter(torch.ones(num_features)) self.beta = nn.Parameter(torch.zeros(num_features)) self.register_buffer('running_mean', torch.zeros(num_features)) self.register_buffer('running_var', torch.ones(num_features)) def forward(self, x): if self.training: # 计算batch统计量 mean = x.mean(dim=0) var = x.var(dim=0, unbiased=False) # 更新移动平均 with torch.no_grad(): self.running_mean = (1 - self.momentum) * self.running_mean + self.momentum * mean self.running_var = (1 - self.momentum) * self.running_var + self.momentum * var else: mean = self.running_mean var = self.running_var # 归一化 x_hat = (x - mean) / torch.sqrt(var + self.eps) return self.gamma * x_hat + self.beta ``` ### 4.3 Layer Normalization **与BatchNorm的区别**：在**特征维度**而非batch维度归一化。

\hat{x}_i = \frac{x_i - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta

其中 $\mu_L = \frac{1}{d} \sum_{j=1}^{d} x_{ij}$。 **优势**： - 对batch size不敏感 - 适用于变长序列（RNN/Transformer） - 训练和推理行为一致 ```python class LayerNormManual(nn.Module): """手动实现LayerNorm""" def __init__(self, normalized_shape, eps=1e-5): super().__init__() self.eps = eps self.gamma = nn.Parameter(torch.ones(normalized_shape)) self.beta = nn.Parameter(torch.zeros(normalized_shape)) def forward(self, x): mean = x.mean(dim=-1, keepdim=True) var = x.var(dim=-1, keepdim=True, unbiased=False) x_hat = (x - mean) / torch.sqrt(var + self.eps) return self.gamma * x_hat + self.beta ``` ### 4.4 Weight Decay **L2正则化**：

\mathcal{L}{\text{reg}} = \mathcal{L}{\text{data}} + \frac{\lambda}{2} |\mathbf{W}|

梯度：

\frac{\partial \mathcal{L}{\text{reg}}}{\partial W} = \frac{\partial \mathcal{L}{\text{data}}}{\partial W} + \lambda W

* * 解耦 W e i g h t Dec a y （ A d amW ） * * ：

W \leftarrow W - \eta \left( \nabla_W \mathcal{L} + \lambda W \right)

Metaphor

探索