多层感知机与通用逼近定理

多层感知机（MLP）是深度学习的基础构建模块。本章深入探讨其理论基础，包括通用逼近定理及其证明思路。¹

1. 多层感知机基础

1.1 网络结构

MLP由输入层、隐藏层和输出层组成：

输入层      隐藏层 1       隐藏层 2       输出层
x₁ ──────○────○────┐
              │    │
x₂ ──────○──○────○──┼────○────○────── y₁
              │    │    │    │
x₃ ──────○──○────○──┼────○────○────── y₂
              │    │    │    │
              └────○────┴────○────── y₃

数学形式：

a^{(0)} = x \in R^{d_{0}}

z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)} \in R^{d_{l}}

a^{(l)} = σ (z^{(l)}) \in R^{d_{l}}

\overset{y}{^} = a^{(L)} \in R^{d_{L}}

其中 $W^{(l)} \in R^{d_{l} \times d_{l - 1}}$ ， $b^{(l)} \in R^{d_{l}}$ 。

1.2 激活函数

常用激活函数：

函数	表达式	导数	特点
Sigmoid	$σ (x) = \frac{1}{1 + e ^{- x}}$	$σ (x) (1 - σ (x))$	输出(0,1)，梯度消失
Tanh	$tanh (x)$	$1 - tanh^{2} (x)$	输出(-1,1)，零中心
ReLU	$max (0, x)$	$1_{x > 0}$	计算高效，稀疏激活
Leaky ReLU	$max (0.01 x, x)$	$1_{x > 0} + 0.01 1_{x \leq 0}$	避免神经元死亡
GELU	$x Φ (x)$	$Φ (x) + x ϕ (x)$	Transformer标配
Swish	$x σ (x)$	$σ (x) (1 + x (1 - σ (x)))$	自门控

# 激活函数实现
class Activations:
    @staticmethod
    def relu(x):
        return np.maximum(0, x)
    
    @staticmethod
    def relu_grad(x):
        return (x > 0).astype(float)
    
    @staticmethod
    def sigmoid(x):
        # 数值稳定版本
        return np.where(x >= 0,
                        1 / (1 + np.exp(-x)),
                        np.exp(x) / (1 + np.exp(x)))
    
    @staticmethod
    def tanh(x):
        return np.tanh(x)
    
    @staticmethod
    def gelu(x):
        # 近似实现
        return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))
    
    @staticmethod
    def swish(x, beta=1):
        return x * sigmoid(beta * x)

2. 通用逼近定理 (UAT)

2.1 经典定理：Cybenko (1989)

定理（Cybenko）：

设 $σ$ 为非常数的连续单调递增函数（如Sigmoid）。对于任意紧集 $K \subset R^{n}$ 、任意连续函数 $f : K \to R$ 和任意 $ϵ > 0$ ，存在整数 $N$ 、实数 $v_{i}, b_{i} \in R$ 和向量 $w_{i} \in R^{n}$ （ $i = 1, \dots, N$ ），使得：

f (x) - i = 1 \sum N v_{i} σ (w_{i}^{T} x + b_{i}) < ϵ

对所有 $x \in K$ 成立。

等价表述：仅含一个隐藏层（宽度 $N$ ）的MLP，在Sigmoid激活下，是通用逼近器。

2.2 Hornik定理（1989）

Stinchcombe和White进一步放宽了条件：

定理（Hornik）：

对于广泛类的激活函数 $σ$ （不要求连续或单调），单隐藏层MLP的输出函数类：

F_{N} = {i = 1 \sum N v_{i} σ (w_{i}^{T} x + b_{i}) : v_{i} \in R, w_{i} \in R^{n}, b_{i} \in R}

在 $C (K)$ （紧集 $K$ 上的连续函数空间）中稠密，当且仅当 $σ$ 不是多项式。

核心洞察：通用逼近能力不是激活函数的特殊性质，而是网络结构的涌现特性。

2.3 证明思路概览

UAT的典型证明方法：

1. Stone-Weierstrass定理法

利用多项式在紧集上的一致逼近性质，通过** sigmoidal多项式**来逼近任意连续函数。

2. 卷积核方法

将激活函数视为”小波”或”卷积核”，利用叠加原理：

f_{N} (x) = i = 1 \sum N c_{i} σ (w_{i}^{T} x + b_{i})

当网格细化（ $w_{i}, b_{i}$ 加密）时， $f_{N} \to f$ 。

3. 概率方法

利用激活函数的积分表示：

σ (x) = \int_{- \infty}^{\infty} ϕ (t) e^{x t} d t

将MLP表示为”均值”形式，然后应用大数定律。

3. 深度vs宽度

3.1 宽度优先：经典UAT

宽度定理：有限宽度 + 任意深度 = 通用逼近

设激活函数 $σ$ 满足某些温和条件。对于任意连续函数 $f$ ，存在一个宽度为 $m$ （与输入维度相关）的深度网络以任意精度逼近 $f$ 。

# 宽度需求示例
# 对于 n 维输入，需要的隐藏单元数 N 与以下因素相关：
# 1. 函数复杂度（光滑性）
# 2. 紧集的大小
# 3. 逼近精度 ε
 
def required_width(n, epsilon, smoothness='bounded'):
    """
    粗略估计所需宽度
    """
    if smoothness == 'bounded':
        # 基于B武警-Whitney定理的估计
        return int((C / epsilon) ** (1/n))
    elif smoothness == 'Lipshitz':
        # Lipschitz常数为L的函数
        return int((L * np.sqrt(n) / epsilon) ** n)

3.2 深度优先：深度定理

深度定理（Kidger & Lyons, 2020）：

对于一类宽而有界的深度网络，在窄而任意深的网络中同样可以实现通用逼近，只要满足：

网络是”窄而深”的（宽度有界，如宽度=输入维度）
激活函数是非常数的、解析的
激活函数是非多项式的

# 深度网络的宽度下界
class DepthTheorem:
    """
    深度优先定理表明：
    - 宽度m的深度网络可以逼近任何连续函数
    - 所需深度随函数复杂度指数增长
    """
    
    def depth_requirement(function_complexity, width):
        """
        粗略估计
        函数复杂度 ↑ → 所需深度 ↑
        网络宽度 ↑ → 所需深度 ↓
        """
        return int(np.exp(function_complexity) / width)

3.3 表达能力的深度-宽度权衡

方面	宽而浅	窄而深
参数效率	较低	较高
训练难度	较低	较高（梯度问题）
函数类	有限	可能更丰富
优化难度	较易	需残差连接等技术

4. 表象定理 (Representation Theorem)

4.1 定理内容

表象定理：对于任意固定的网络参数，MLP输出是参数的光滑（可微）函数。

这意味着：

\overset{y}{^} (x; W, b) \in C^{\infty} (R^{P})

其中 $P$ 是参数总数。

4.2 定理的实践意义

# 表象定理保证：
# 1. 梯度存在且定义良好
# 2. 优化器可以工作
# 3. 自动微分有意义
 
# 示例：梯度计算
x = torch.randn(1, 10)
model = MLP(input_dim=10, hidden_dim=64, output_dim=1)
output = model(x)
 
# 表象定理保证这个梯度存在
output.backward()
print(model.fc1.weight.grad.shape)  # torch.Size([64, 10])

5. 表达能力分析

5.1 分区复杂度

MLP将输入空间划分为若干线性区域：

def count_linear_regions(n_hidden, input_dim):
    """
    单隐藏层MLP的线性区域数
    
    对于具有n个隐藏单元的MLP：
    - 每个隐藏单元定义一个超平面
    - N个超平面最多将空间划分为 R(N, n) 个区域
    
    粗略估计：指数级增长
    """
    # N个隐藏单元产生约 N^n 个区域（指数增长）
    return n_hidden ** input_dim

5.2 函数类的增长

网络类型	参数规模	可表示函数复杂度
线性模型	$O (d)$	线性函数
单隐藏层MLP	$O (N)$	有限复杂度的分段线性函数
深度MLP	$O (N \times L)$	分层组合的复杂函数

5.3 ReLU网络的分区

import numpy as np
 
def relu_network_regions(layers):
    """
    ReLU网络线性区域数量估计
    
    layers: 每层神经元数列表，如 [d, n1, n2, ..., output]
    
    区域数受限于：
    1. 神经元数 → 超平面数
    2. 网络深度 → 组合复杂度
    """
    num_hyperplanes = sum(layers[1:-1])  # 每层的ReLU产生超平面
    # 最大区域数（粗略上界）
    max_regions = sum([
        np.math.comb(num_hyperplanes, k) 
        for k in range(len(layers) - 1)
    ])
    return max_regions

6. MLP的实现

6.1 基础实现

import torch
import torch.nn as nn
import torch.nn.functional as F
 
class MLP(nn.Module):
    """多层感知机"""
    
    def __init__(self, input_dim, hidden_dims, output_dim, 
                 activation='relu', dropout=0.0):
        super().__init__()
        
        # 构建层
        dims = [input_dim] + hidden_dims + [output_dim]
        layers = []
        
        for i in range(len(dims) - 1):
            layers.append(nn.Linear(dims[i], dims[i+1]))
            
            # 添加激活函数（除最后一层）
            if i < len(dims) - 2:
                if activation == 'relu':
                    layers.append(nn.ReLU(inplace=True))
                elif activation == 'gelu':
                    layers.append(nn.GELU())
                elif activation == 'tanh':
                    layers.append(nn.Tanh())
                elif activation == 'swish':
                    layers.append(nn.SiLU())
                
                if dropout > 0:
                    layers.append(nn.Dropout(dropout))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)
 
 
class DeepMLP(nn.Module):
    """深层MLP（用于实验深度-宽度权衡）"""
    
    def __init__(self, input_dim=784, hidden_dim=256, 
                 num_layers=4, output_dim=10):
        super().__init__()
        
        self.layers = nn.ModuleList()
        self.norms = nn.ModuleList()
        
        for i in range(num_layers):
            in_dim = input_dim if i == 0 else hidden_dim
            out_dim = output_dim if i == num_layers - 1 else hidden_dim
            
            self.layers.append(nn.Linear(in_dim, out_dim))
            
            # BatchNorm（除最后一层）
            if i < num_layers - 1:
                self.norms.append(nn.BatchNorm1d(out_dim))
            else:
                self.norms.append(nn.Identity())
        
        self.activation = nn.ReLU()
    
    def forward(self, x):
        for i, layer in enumerate(self.layers):
            x = layer(x)
            x = self.norms[i](x)
            if i < len(self.layers) - 1:
                x = self.activation(x)
        return x

6.2 训练循环

def train_mlp(model, train_loader, test_loader, 
              epochs=100, lr=0.001, device='cuda'):
    """MLP标准训练循环"""
    
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=epochs
    )
    
    history = {'train_loss': [], 'test_acc': []}
    
    for epoch in range(epochs):
        # 训练阶段
        model.train()
        train_loss = 0
        
        for batch_x, batch_y in train_loader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            
            optimizer.zero_grad()
            output = model(batch_x)
            loss = criterion(output, batch_y)
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        scheduler.step()
        
        # 评估阶段
        model.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for batch_x, batch_y in test_loader:
                batch_x, batch_y = batch_x.to(device), batch_y.to(device)
                output = model(batch_x)
                _, predicted = torch.max(output, 1)
                total += batch_y.size(0)
                correct += (predicted == batch_y).sum().item()
        
        test_acc = 100 * correct / total
        history['train_loss'].append(train_loss / len(train_loader))
        history['test_acc'].append(test_acc)
        
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}: Loss={history['train_loss'][-1]:.4f}, "
                  f"Test Acc={test_acc:.2f}%")
    
    return history

7. 现代MLP变体

7.1 MLP-Mixer

ViT成功之后，NLP中的Transformer被引入视觉。MLP-Mixer反其道而行：用纯MLP替代注意力：

class MLPMixer(nn.Module):
    """MLP-Mixer: 用MLP替代注意力"""
    
    def __init__(self, num_patches, hidden_dim, num_layers, 
                 tokens_mlp_dim, channels_mlp_dim):
        super().__init__()
        
        self.layers = nn.ModuleList([
            MixerLayer(tokens_mlp_dim, channels_mlp_dim, hidden_dim)
            for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(hidden_dim)
        self.head = nn.Linear(hidden_dim, num_classes)
    
    def forward(self, x):
        # x: (B, N, D) 其中 N = num_patches
        for layer in self.layers:
            x = layer(x)
        
        x = self.norm(x)
        x = x.mean(dim=1)  # 全局池化
        return self.head(x)
 
 
class MixerLayer(nn.Module):
    """Mixer层：交替在token和channel维度应用MLP"""
    
    def __init__(self, tokens_mlp_dim, channels_mlp_dim, hidden_dim):
        super().__init__()
        
        self.norm1 = nn.LayerNorm(hidden_dim)
        self.token_mlp = nn.Sequential(
            nn.Linear(num_patches, tokens_mlp_dim),
            nn.GELU(),
            nn.Linear(tokens_mlp_dim, num_patches)
        )
        
        self.norm2 = nn.LayerNorm(hidden_dim)
        self.channel_mlp = nn.Sequential(
            nn.Linear(hidden_dim, channels_mlp_dim),
            nn.GELU(),
            nn.Linear(channels_mlp_dim, hidden_dim)
        )
    
    def forward(self, x):
        # Token混合
        x = x + self.token_mlp(self.norm1(x).transpose(0, 1)).transpose(0, 1)
        # Channel混合
        x = x + self.channel_mlp(self.norm2(x))
        return x

7.2 稀疏混合专家 (Sparse MoE)

class SparseMoE(nn.Module):
    """稀疏混合专家"""
    
    def __init__(self, input_dim, hidden_dim, output_dim, num_experts, k=2):
        super().__init__()
        
        self.num_experts = num_experts
        self.k = k  # 每个token激活的专家数
        
        # 门控网络
        self.gate = nn.Linear(input_dim, num_experts)
        
        # 专家网络
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(input_dim, hidden_dim),
                nn.GELU(),
                nn.Linear(hidden_dim, output_dim)
            )
            for _ in range(num_experts)
        ])
    
    def forward(self, x):
        # 计算门控权重
        gate_logits = self.gate(x)
        gate_weights = F.softmax(gate_logits, dim=-1)
        
        # 选择Top-K专家
        top_k_weights, top_k_indices = torch.topk(
            gate_weights, self.k, dim=-1
        )
        top_k_weights = top_k_weights / top_k_weights.sum(dim=-1, keepdim=True)
        
        # 计算专家输出
        output = torch.zeros_like(x)
        for i, expert in enumerate(self.experts):
            # 获取分配给该专家的token
            mask = (top_k_indices == i).any(dim=-1)
            if mask.any():
                expert_output = expert(x[mask])
                # 加权累加
                expert_weight = top_k_weights[mask][top_k_indices[mask] == i].unsqueeze(-1)
                output[mask] += expert_output * expert_weight
        
        return output

8. 实践建议

8.1 激活函数选择

场景	推荐激活函数
默认推荐	GELU
需要快速推理	ReLU
避免神经元死亡	Leaky ReLU / PReLU
追求精度	Swish / GELU
生成模型	Tanh

8.2 初始化策略

激活函数	初始化方法
ReLU	He Initialization ( $W \sim N (0, 2/ n_{in})$ )
Sigmoid/Tanh	Xavier/Glorot ( $W \sim N (0, 2/ (n_{in} + n_{o u t}))$ )
GELU	近似He初始化
通用	Kaiming Normal / Xavier

# PyTorch中的初始化
def init_weights(model, activation='relu'):
    for m in model.modules():
        if isinstance(m, nn.Linear):
            if activation == 'relu':
                nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
            elif activation == 'gelu':
                nn.init.kaiming_normal_(m.weight, nonlinearity='relu')  # 近似
            nn.init.zeros_(m.bias)

9. 总结

通用逼近定理保证单隐藏层MLP可以逼近任意连续函数
深度-宽度权衡：深度网络在参数效率上有优势，但训练更困难
激活函数的选择影响网络的学习动态和收敛速度
表象定理保证梯度存在，使优化成为可能
现代变体如MLP-Mixer和Sparse MoE扩展了经典MLP的能力边界

参考文献

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks. ↩

Metaphor

探索

多层感知机与通用逼近定理

多层感知机与通用逼近定理

1. 多层感知机基础

1.1 网络结构

1.2 激活函数

2. 通用逼近定理 (UAT)

2.1 经典定理：Cybenko (1989)

2.2 Hornik定理（1989）

2.3 证明思路概览

3. 深度vs宽度

3.1 宽度优先：经典UAT

3.2 深度优先：深度定理

3.3 表达能力的深度-宽度权衡

4. 表象定理 (Representation Theorem)

4.1 定理内容

4.2 定理的实践意义

5. 表达能力分析

5.1 分区复杂度

5.2 函数类的增长

5.3 ReLU网络的分区

6. MLP的实现

6.1 基础实现

6.2 训练循环

7. 现代MLP变体

7.1 MLP-Mixer

7.2 稀疏混合专家 (Sparse MoE)

8. 实践建议

8.1 激活函数选择

8.2 初始化策略

9. 总结

参考文献

关系图谱

目录

反向链接

Metaphor

探索

多层感知机与通用逼近定理

多层感知机与通用逼近定理

1. 多层感知机基础

1.1 网络结构

1.2 激活函数

2. 通用逼近定理 (UAT)

2.1 经典定理：Cybenko (1989)

2.2 Hornik定理（1989）

2.3 证明思路概览

3. 深度vs宽度

3.1 宽度优先：经典UAT

3.2 深度优先：深度定理

3.3 表达能力的深度-宽度权衡

4. 表象定理 (Representation Theorem)

4.1 定理内容

4.2 定理的实践意义

5. 表达能力分析

5.1 分区复杂度

5.2 函数类的增长

5.3 ReLU网络的分区

6. MLP的实现

6.1 基础实现

6.2 训练循环

7. 现代MLP变体

7.1 MLP-Mixer

7.2 稀疏混合专家 (Sparse MoE)

8. 实践建议

8.1 激活函数选择

8.2 初始化策略

9. 总结

参考文献

Footnotes

关系图谱

目录

反向链接