CNN经典架构演进史：从LeNet到ResNet

概述

卷积神经网络（CNN）的历史是深度学习革命的缩影。从 1998 年 LeCun 提出的 LeNet-5，到 2012 年 AlexNet 引爆 ImageNet 竞赛，再到 2015 年 ResNet 突破 152 层，CNN 在十年间完成了”从实验室到工业”的飞跃。¹

本文系统梳理 ILSVRC 时代的关键 CNN 架构，分析每个架构的核心创新点、解决的问题、带来的启示，并给出 PyTorch 完整实现。

一、ILSVRC 时代背景

ImageNet Large Scale Visual Recognition Challenge（ILSVRC）是深度学习最重要的”试金石”：

数据集：1.4M 训练图像，1000 类
评测指标：top-5 错误率
时间：2010-2017
影响：CNN 架构的演进直接对应 top-5 误差的下降

关键数据：2011 年最佳非深度方法错误率 25.8%，2015 年 ResNet 达到 3.57%，超过人类水平（~5%）。

二、LeNet-5（1998）：CNN 的奠基

2.1 背景

LeCun 等人在 1998 年提出 LeNet-5²，用于手写数字识别（MNIST）。这是第一个被工业部署的 CNN，部署在 ATM 机和银行支票识别中。

2.2 架构

Input (32×32×1)
  ↓
C1: Conv 5×5, 6 filters, output 28×28×6
  ↓
S2: AvgPool 2×2, output 14×14×6
  ↓
C3: Conv 5×5, 16 filters, output 10×10×16
  ↓
S4: AvgPool 2×2, output 5×5×16
  ↓
C5: Conv 5×5, 120 filters, output 1×1×120
  ↓
F6: FC 84
  ↓
Output: 10 classes (softmax)

参数量：~60K（极小）

2.3 关键设计

卷积-池化-卷积-池化交替：奠定现代 CNN 范式
局部感受野：每个神经元只连前一层一小块
权值共享：同一卷积核在不同位置共享参数
sigmoid/tanh 激活：当时主流
径向基函数（RBF）输出：连接稀疏

2.4 局限

受限于计算能力和数据规模，没有大规模应用
池化后使用 sigmoid，梯度消失问题严重
缺少有效正则化手段

三、AlexNet（2012）：深度学习复兴的号角

3.1 历史时刻

2012 年，Krizhevsky、Sutskever、Hinton 提出的 AlexNet³ 以 15.3% top-5 错误率赢得 ILSVRC，比第二名（传统方法）低 10.8 个百分点——震惊学界。

3.2 关键创新

创新点	描述	影响
ReLU 激活	$max (0, x)$ 替代 sigmoid	缓解梯度消失，训练快 6 倍
Dropout	训练时随机失活 50% 神经元	有效防止过拟合
GPU 训练	2 块 GTX 580，5-6 天训练	开启 GPU 时代
数据增强	图像翻转、随机裁剪、PCA 抖动	扩大数据规模
LRN	局部响应归一化	模拟生物侧抑制
重叠池化	步长小于核尺寸	减少信息损失

3.3 架构

Input (224×224×3) [实际 227×227]
  ↓
Conv1: 11×11, stride 4, 96 filters → 55×55×96
  ↓
LRN + MaxPool 3×3, stride 2 → 27×27×96
  ↓
Conv2: 5×5, pad 2, 256 filters → 27×27×256
  ↓
LRN + MaxPool 3×3, stride 2 → 13×13×256
  ↓
Conv3: 3×3, pad 1, 384 filters → 13×13×384
  ↓
Conv4: 3×3, pad 1, 384 filters → 13×13×384
  ↓
Conv5: 3×3, pad 1, 256 filters → 13×13×256
  ↓
MaxPool 3×3, stride 2 → 6×6×256
  ↓
FC6: 4096 + Dropout
  ↓
FC7: 4096 + Dropout
  ↓
FC8: 1000 (softmax)

参数量：~60M
FLOPs：~0.7G（每张图）

3.4 PyTorch 实现

import torch
import torch.nn as nn
 
class AlexNet(nn.Module):
    """AlexNet 简化版（适配现代输入尺寸）"""
    
    def __init__(self, num_classes=1000):
        super().__init__()
        self.features = nn.Sequential(
            # Conv1: 大核大步长
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),
            # Conv2
            nn.Conv2d(96, 256, kernel_size=5, padding=2, groups=1),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=1e-4, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),
            # Conv3-5: 堆叠小核
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )
        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )
    
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), 256 * 6 * 6)
        x = self.classifier(x)
        return x

3.5 局限与启示

参数量大（60M），主要在 FC 层
大卷积核（11×11）计算昂贵
启示：ReLU + Dropout + GPU 是当时突破的关键

四、ZFNet（2013）：可视化驱动的设计

Zeiler & Fergus 用反卷积可视化 CNN 内部表示，提出改进版 AlexNet：

Conv1：11×11 → 7×7（保留更多信息）
Conv1, Conv2 stride：4 → 2（保留更多空间信息）
top-5 错误率：15.3% → 11.7%

核心贡献：让 CNN 不再是黑箱——通过可视化，发现浅层学边缘/纹理，深层学物体部分/整体。

五、VGG（2014）：深度与简单性的胜利

5.1 核心思想

VGG（Visual Geometry Group, Oxford）⁴ 团队提出一个极简原则：

用 3×3 小卷积核堆叠代替大卷积核

5.2 为什么是 3×3？

两个 3×3 卷积等效于一个 5×5 卷积（感受野相同），但：

参数更少： $2 \times 3^{2} = 18$ vs $5^{2} = 25$ （节省 28%）
非线性更多：2 次 ReLU vs 1 次
表达力更强

三个 3×3 卷积等效于一个 7×7，参数 $3 \times 3^{2} = 27$ vs $49$ （节省 45%）。

5.3 VGG-16 架构

Input (224×224×3)
  ↓
Block 1: 2× [Conv 3×3, 64] + MaxPool 2×2 → 112×112×64
  ↓
Block 2: 2× [Conv 3×3, 128] + MaxPool 2×2 → 56×56×128
  ↓
Block 3: 3× [Conv 3×3, 256] + MaxPool 2×2 → 28×28×256
  ↓
Block 4: 3× [Conv 3×3, 512] + MaxPool 2×2 → 14×14×512
  ↓
Block 5: 3× [Conv 3×3, 512] + MaxPool 2×2 → 7×7×512
  ↓
FC: 4096 → 4096 → 1000

参数量：138M（VGG-16）
top-5 错误率：7.3%
FLOPs：15.5G

5.4 PyTorch 实现

class VGG(nn.Module):
    """VGG-16 配置：每块卷积的通道数"""
    cfg = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 
           512, 512, 512, 'M', 512, 512, 512, 'M']
    
    def __init__(self, num_classes=1000):
        super().__init__()
        self.features = self._make_layers()
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.ReLU(True),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(True),
            nn.Dropout(0.5),
            nn.Linear(4096, num_classes),
        )
        self._initialize_weights()
    
    def _make_layers(self):
        layers = []
        in_channels = 3
        for v in self.cfg:
            if v == 'M':
                layers.append(nn.MaxPool2d(2, 2))
            else:
                layers += [
                    nn.Conv2d(in_channels, v, 3, padding=1),
                    nn.BatchNorm2d(v),  # 现代改进：加 BN
                    nn.ReLU(inplace=True),
                ]
                in_channels = v
        return nn.Sequential(*layers)
    
    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)
    
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

5.5 局限

138M 参数集中在 FC 层：占 119M
计算量大：FLOPs 比 AlexNet 高 20 倍
不适用于大图像：224×224 是上限

但 VGG 证明了一件事：深度 + 简单性 > 浅而复杂的网络。

六、GoogLeNet / Inception（2014）：宽度探索

6.1 核心思想

GoogLeNet⁵（Szegedy 等，Google）提出一个深刻的问题：

必须串行堆叠卷积吗？能否并行尝试不同尺寸？

答案是 Inception Module。

6.2 Inception Module（v1）

        Input
       /  |  \  \
      ↓   ↓   ↓   ↓
   1×1  3×3  5×5  3×3
                   pool
      ↓   ↓   ↓   ↓
       \  |  /  /
        Concat
          ↓
        Output

四个分支并行，捕获不同尺度的特征：

1×1：点级特征
3×3：局部特征
5×5：全局特征
Pool：保留底层信息

最后沿通道维度拼接。

6.3 1×1 卷积的关键作用

朴素 Inception 的 5×5 卷积计算量巨大。在 3×3、5×5 之前先用 1×1 卷积降维：

        Input
       /  |  \  \
    1×1  1×1   1×1  1×1
      ↓   ↓     ↓   ↓
   1×1  3×3   5×5  3×3
      ↓   ↓     ↓   ↓
       \  |   /  /
        Concat

1×1 卷积是跨通道的线性组合，可大幅降维。例如：

5×5 卷积前 1×1 把 256 通道压到 32，计算量减少到原来的 1/8

6.4 辅助分类器

GoogLeNet 在中间层额外加了两个分类头（auxiliary classifier）：

训练时提供额外梯度，缓解深层网络梯度消失
测试时不使用
损失加权 0.3

6.5 架构概览

GoogLeNet 有 22 层，但参数仅 5M（比 AlexNet 少 12 倍！）。

阶段	输出尺寸	层
Conv1-2	112×112	7×7 + 3×3
Conv3	56×56	3×3
Inception (4a, 4b, 4c)	28×28	3 个模块
Inception (5a, 5b)	14×14	2 个模块
Inception (6a-6c)	7×7	3 个模块 + aux
Inception (7a, 7b)	7×7	2 个模块 + aux
Pool + Linear	1×1	avg pool + FC

top-5 错误率：6.7%

6.6 PyTorch 实现

class InceptionModule(nn.Module):
    """带降维的 Inception Module"""
    
    def __init__(self, in_channels, ch_1x1, ch_3x3_red, ch_3x3, 
                 ch_5x5_red, ch_5x5, pool_proj):
        super().__init__()
        # 1×1 分支
        self.branch1 = nn.Conv2d(in_channels, ch_1x1, 1)
        # 3×3 分支（1×1 降维 + 3×3）
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, ch_3x3_red, 1),
            nn.Conv2d(ch_3x3_red, ch_3x3, 3, padding=1),
        )
        # 5×5 分支（1×1 降维 + 5×5）
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, ch_5x5_red, 1),
            nn.Conv2d(ch_5x5_red, ch_5x5, 5, padding=2),
        )
        # Pool 分支
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(3, stride=1, padding=1),
            nn.Conv2d(in_channels, pool_proj, 1),
        )
    
    def forward(self, x):
        return torch.cat([
            self.branch1(x),
            self.branch2(x),
            self.branch3(x),
            self.branch4(x),
        ], dim=1)
 
 
class GoogLeNet(nn.Module):
    """GoogLeNet 简化版"""
    
    def __init__(self, num_classes=1000, aux_logits=True):
        super().__init__()
        self.aux_logits = aux_logits
        
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, 64, 7, stride=2, padding=3),
            nn.ReLU(True),
            nn.MaxPool2d(3, stride=2, padding=1),
        )
        self.conv2 = nn.Sequential(
            nn.Conv2d(64, 192, 3, padding=1),
            nn.ReLU(True),
            nn.MaxPool2d(3, stride=2, padding=1),
        )
        
        # Inception 3a, 3b
        self.inception3a = InceptionModule(192, 64, 96, 128, 16, 32, 32)
        self.inception3b = InceptionModule(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d(3, stride=2, padding=1)
        
        # Inception 4a-4e
        self.inception4a = InceptionModule(480, 192, 96, 208, 16, 48, 64)
        # ... 中间层省略 ...
        
        # 辅助分类器
        if self.aux_logits:
            self.aux1 = self._aux_head(512, num_classes)
            self.aux2 = self._aux_head(528, num_classes)
    
    def _aux_head(self, in_channels, num_classes):
        return nn.Sequential(
            nn.AdaptiveAvgPool2d((4, 4)),
            nn.Conv2d(in_channels, 128, 1),
            nn.ReLU(True),
            nn.Flatten(),
            nn.Linear(128 * 16, 1024),
            nn.ReLU(True),
            nn.Dropout(0.7),
            nn.Linear(1024, num_classes),
        )
    
    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        # ... 中间层前向 ...
        if self.training and self.aux_logits:
            aux1 = self.aux1(x)
        # ... 
        return x  # 主输出（+ aux1, aux2 训练时）

6.7 Inception 后续改进

版本	年份	改进
v1	2014	原始 Inception
v2	2015	加入 BatchNorm；5×5 → 两个 3×3
v3	2015	7×7 → 三个 3×3；辅助分类器改为 BN
v4	2016	与残差连接结合

七、ResNet（2015）：极深网络的突破

7.1 问题：深度瓶颈

2014-2015 年的实验观察：

网络越深，训练误差越大（不是过拟合！）

56 层网络比 20 层网络训练误差和测试误差都高。这说明优化困难而非泛化问题。

7.2 残差学习

何恺明等人提出残差学习（Residual Learning）⁶ 解决此问题：

y = F (x, {W_{i}}) + x

即让网络学习残差 $F$ ，而非直接学习 $y$ 。

7.3 为什么有效？

核心洞察：恒等映射 $y = x$ 很难学，但残差 0 很容易学。

反向传播时，梯度可以通过恒等映射路径直接回传：

\frac{\partial L}{\partial x _{l}} = \frac{\partial L}{\partial x _{L}} (1 + \frac{\partial}{\partial x _{l}} i = l \sum L - 1 F_{i})

右侧的 “1” 保证梯度不会消失。

7.4 瓶颈设计（Bottleneck）

为了减少计算量，深层 ResNet 用三层瓶颈结构：

        Input
          ↓
        1×1 Conv (降维)
          ↓
        3×3 Conv
          ↓
        1×1 Conv (升维)
          ↓
        + Input (skip)
          ↓
        Output

例如从 256 维降为 64 维计算 3×3，再升回 256 维，计算量减少为原来的 1/4。

7.5 ResNet-50 架构

阶段	输出	层
Conv1	112×112	7×7, 64, stride 2
Conv2_x	56×56	1×1, 64 3×3, 64 1×1, 256
Conv3_x	28×28	1×1, 128 3×3, 128 1×1, 512
Conv4_x	14×14	1×1, 256 3×3, 256 1×1, 1024
Conv5_x	7×7	1×1, 512 3×3, 512 1×1, 2048
Pool + FC	1×1	avg pool + FC

参数量：25.6M
top-5 错误率：5.25%
FLOPs：4.1G

7.6 PyTorch 实现

class Bottleneck(nn.Module):
    """ResNet 瓶颈块（ResNet-50/101/152）"""
    expansion = 4
    
    def __init__(self, in_channels, mid_channels, stride=1, downsample=None):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channels, mid_channels, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(mid_channels)
        self.conv2 = nn.Conv2d(mid_channels, mid_channels, 3, 
                               stride=stride, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(mid_channels)
        self.conv3 = nn.Conv2d(mid_channels, mid_channels * 4, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(mid_channels * 4)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
    
    def forward(self, x):
        identity = x
        
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        
        if self.downsample is not None:
            identity = self.downsample(x)
        
        out += identity
        return self.relu(out)
 
 
class ResNet(nn.Module):
    """ResNet-50 完整实现"""
    
    def __init__(self, block, num_blocks, num_classes=1000):
        super().__init__()
        self.in_channels = 64
        
        self.conv1 = nn.Conv2d(3, 64, 7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(3, stride=2, padding=1)
        
        # 四个阶段的残差块
        self.layer1 = self._make_layer(block, 64, num_blocks[0], stride=1)
        self.layer2 = self._make_layer(block, 128, num_blocks[1], stride=2)
        self.layer3 = self._make_layer(block, 256, num_blocks[2], stride=2)
        self.layer4 = self._make_layer(block, 512, num_blocks[3], stride=2)
        
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)
        
        # 初始化
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)
    
    def _make_layer(self, block, mid_channels, num_blocks, stride):
        downsample = None
        if stride != 1 or self.in_channels != mid_channels * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channels, mid_channels * block.expansion, 1, 
                          stride=stride, bias=False),
                nn.BatchNorm2d(mid_channels * block.expansion),
            )
        
        layers = [block(self.in_channels, mid_channels, stride, downsample)]
        self.in_channels = mid_channels * block.expansion
        for _ in range(1, num_blocks):
            layers.append(block(self.in_channels, mid_channels))
        
        return nn.Sequential(*layers)
    
    def forward(self, x):
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.maxpool(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)
 
 
def resnet50(num_classes=1000):
    return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)

7.7 ResNet 变体

模型	深度	top-5 错误率
ResNet-18	18	10.63%
ResNet-34	34	8.58%
ResNet-50	50	5.25%
ResNet-101	101	4.60%
ResNet-152	152	4.49%

八、横向对比：架构设计的演进逻辑

8.1 ImageNet 误差曲线

年份	模型	top-5 错误率	创新点
2010	传统方法	28.2%	-
2011	传统方法	25.8%	-
2012	AlexNet	15.3%	ReLU + Dropout + GPU
2013	ZFNet	11.7%	可视化优化
2014	VGG	7.3%	更深 + 小核
2014	GoogLeNet	6.7%	Inception 模块
2015	ResNet	3.57%	残差连接
2016	Ensemble	3.0%	模型融合

8.2 参数量 vs 性能

模型	参数量	top-5	性价比
AlexNet	60M	15.3%	低
VGG-16	138M	7.3%	中
GoogLeNet	5M	6.7%	高
ResNet-50	25M	5.25%	高

8.3 三种设计哲学

VGG 哲学：深 + 简单（深度优先）
GoogLeNet 哲学：多尺度 + 宽度（结构先验）
ResNet 哲学：残差 + 极深（优化友好）

九、历史启示

9.1 架构设计的”动机驱动”

每个成功的架构都源于对前一个架构局限的洞察：

LeNet → AlexNet：增加 ReLU/Dropout/GPU，应对"深度无法训练"
AlexNet → VGG：用小核堆叠应对"参数爆炸"
VGG → GoogLeNet：并行多尺度应对"参数量仍大"
GoogLeNet → ResNet：残差连接应对"深度达到极限"

9.2 共同的设计原则

小核优于大核：3×3 几乎成为默认
BN 是默认配置：训练稳定性
瓶颈设计：1×1 降维 + 3×3 + 1×1 升维
全局平均池化代替 FC：减少参数
残差连接：解决深度瓶颈

9.3 CNN 之后的演进

ResNet 之后，CNN 架构继续演进：

方向	代表
更高效	MobileNet, EfficientNet, ShuffleNet
注意力	SENet, CBAM, ECA-Net
神经架构搜索	NASNet, DARTS, EfficientNet
视觉 Transformer	ViT, Swin, ConvNeXt

但所有后续工作都建立在 LeNet→ResNet 奠定的基础上。

十、参考文献

附录：完整 PyTorch 模型工厂

def build_classic_cnn(name, num_classes=1000):
    """构建经典 CNN 模型"""
    if name == 'alexnet':
        return AlexNet(num_classes)
    elif name == 'vgg16':
        return VGG(num_classes)
    elif name == 'googlenet':
        return GoogLeNet(num_classes, aux_logits=True)
    elif name == 'resnet18':
        return ResNet(BasicBlock, [2, 2, 2, 2], num_classes)
    elif name == 'resnet50':
        return ResNet(Bottleneck, [3, 4, 6, 3], num_classes)
    elif name == 'resnet101':
        return ResNet(Bottleneck, [3, 4, 23, 3], num_classes)
    else:
        raise ValueError(f"Unknown model: {name}")

最后更新：2026-06-22

Russakovsky, O., Deng, J., Su, H., et al. (2015). ImageNet Large Scale Visual Recognition Challenge. IJCV. ↩
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE. ↩
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS. ↩
Simonyan, K., & Zisserman, A. (2014). Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556. ↩
Szegedy, C., et al. (2014). Going Deeper with Convolutions. arXiv:1409.4842. ↩
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385. ↩

Metaphor

探索