现代卷积神经网络架构

从2012年AlexNet引爆深度学习热潮至今,卷积神经网络(CNN)经历了多次重大革新。本文系统梳理从VGGNet到ConvNeXt的经典架构演进。1

1. 早期奠基:AlexNet与VGGNet

1.1 AlexNet (2012)

ImageNet竞赛冠军,开创深度学习时代

架构特点:

  • 5层卷积 + 3层全连接
  • ReLU激活函数
  • Dropout正则化
  • GPU并行训练(双卡)
  • Local Response Normalization (LRN)
# AlexNet简化实现
class AlexNet(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.features = nn.Sequential(
            # Conv1: 96 kernels, 11×11, stride=4
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 55×55 → 27×27
            
            # Conv2: 256 kernels, 5×5
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 27×27 → 13×13
            
            # Conv3-5: 384, 384, 256
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 13×13 → 6×6
        )
        
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

1.2 VGGNet (2014)

深化网络,简化滤波器设计

核心贡献:

  • 统一使用 卷积核
  • 通过堆叠小卷积实现大感受野
  • 卷积 ≈ 感受野
  • 参数更少,表达能力更强

感受野计算

其中 是第 层核大小, 是第 层步长。

VGG-16架构

层类型输出尺寸通道数
Conv1_1-222464
MaxPool11264
Conv2_1-2112128
MaxPool56128
Conv3_1-356256
MaxPool28256
Conv4_1-328512
MaxPool14512
Conv5_1-314512
MaxPool7512
FC14096
FC14096
FC11000

2. GoogLeNet与Inception模块

2.1 动机:稀疏连接

传统密集连接的问题是:

  • 参数量爆炸:
  • 容易过拟合

Hebbian原理:“一起发射的神经元会连接在一起”

解决方案:利用稀疏性,但通过密集矩阵运算模拟稀疏连接

2.2 Inception模块

核心思想:多尺度并行卷积 + 自适应池化

Input
    ├── 1×1 Conv ─────────────────────┐
    ├── 1×1 Conv → 3×3 Conv ─────────┤
    ├── 1×1 Conv → 5×5 Conv ─────────┤→ Concat → Output
    ├── 3×3 MaxPool → 1×1 Conv ──────┘

Inception v1模块

class InceptionModule(nn.Module):
    def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj):
        super().__init__()
        
        # 1×1 卷积分支
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, ch1x1, kernel_size=1),
            nn.ReLU(inplace=True)
        )
        
        # 1×1 → 3×3 分支
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, ch3x3red, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch3x3red, ch3x3, kernel_size=3, padding=1),
            nn.ReLU(inplace=True)
        )
        
        # 1×1 → 5×5 分支
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, ch5x5red, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch5x5red, ch5x5, kernel_size=5, padding=2),
            nn.ReLU(inplace=True)
        )
        
        # MaxPool → 1×1 分支
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, pool_proj, kernel_size=1),
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        b1 = self.branch1(x)
        b2 = self.branch2(x)
        b3 = self.branch3(x)
        b4 = self.branch4(x)
        return torch.cat([b1, b2, b3, b4], dim=1)

2.3 维度 reduction

Inception模块中的 卷积有两个作用:

  1. 降维:减少通道数,降低计算量
  2. 非线性:增加网络深度
输入: 192 channels
    ├── 1×1, 64   (64 params)     → 64 channels
    ├── 1×1, 96 → 3×3, 128 (12.9K params) → 128 channels
    └── 1×1, 16 → 5×5, 32  (12.3K params) → 32 channels
    
总计: 224 channels vs 直接3×3: 192×256 = 49K

3. ResNet:残差学习的革命

(详见 resnet-deep-residual-learning

4. ResNeXt:多分支聚合

4.1 核心思想

ResNeXt = ResNet + Inception 的思想融合:

  • 分组卷积 (Cardinality)
  • 保持ResNet的残差结构
  • 类似VGG的重复块设计

4.2 分组卷积

标准卷积:所有输入通道参与计算

# 输入: (C_in, H, W)
# 卷积核: (C_out, C_in, k, k)
# 输出: (C_out, H', W')

分组卷积:将输入分成 组,分别卷积后拼接

def grouped_convolution(x, in_channels, out_channels, groups, kernel_size):
    # x: (B, C_in, H, W)
    in_per_group = in_channels // groups
    out_per_group = out_channels // groups
    
    outputs = []
    for i in range(groups):
        # 每组独立卷积
        group_input = x[:, i*in_per_group:(i+1)*in_per_group, :, :]
        group_conv = nn.Conv2d(in_per_group, out_per_group, 
                               kernel_size, groups=1)
        outputs.append(group_conv(group_input))
    
    # 沿通道维度拼接
    return torch.cat(outputs, dim=1)

4.3 ResNeXt Bottleneck

输入 (256d)
    │
    ├── 1×1, 128 ─────────────────────────────────┐
    ├── 分组卷积: 128通道, groups=32, 3×3 ─────────┤→ Add → ReLU
    ├── 1×1, 256 ─────────────────────────────────┘
    │
    └─ Shortcut (如果需要维度匹配)

计算量对比

  • 普通ResNet:
  • ResNeXt (g=32):

减少约 12倍 计算量!

4.4 架构变体

配置Cardinality通道数参数量ImageNet Top-1
ResNet-50164→25625.6M76.0%
ResNeXt-50_32×4d324→25627.3M78.8%
ResNeXt-101_32×8d328→204846.5M80.9%

5. DenseNet:密集连接

5.1 核心思想

与ResNet的相加(Add)不同,DenseNet采用拼接(Concat):

5.2 密集块结构

Dense Block:
    
Layer 0: Input (H×W×C_0)
    ↓
Layer 1: H_1([x_0]) → H×W×C_1
    ↓ concat: H×W×(C_0 + C_1)
Layer 2: H_2([x_0, x_1]) → H×W×C_1
    ↓ concat: H×W×(C_0 + 2C_1)
Layer 3: H_3([x_0, x_1, x_2]) → H×W×C_1
    ↓ concat: H×W×(C_0 + 3C_1)
    ...

5.3 DenseNet实现

class DenseBlock(nn.Module):
    def __init__(self, in_channels, growth_rate, num_layers):
        super().__init__()
        self.layers = nn.ModuleList()
        for i in range(num_layers):
            # 每个层的输入通道数递增
            layer_in_channels = in_channels + i * growth_rate
            self.layers.append(self._make_layer(layer_in_channels, growth_rate))
    
    def _make_layer(self, in_channels, growth_rate):
        return nn.Sequential(
            nn.BatchNorm2d(in_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, growth_rate, kernel_size=3, padding=1, bias=False)
        )
    
    def forward(self, x):
        features = [x]
        for layer in self.layers:
            new_feature = layer(torch.cat(features, dim=1))
            features.append(new_feature)
        return torch.cat(features, dim=1)
 
 
class TransitionLayer(nn.Module):
    """过渡层:压缩通道数和尺寸"""
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Sequential(
            nn.BatchNorm2d(in_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.AvgPool2d(kernel_size=2, stride=2)
        )
    
    def forward(self, x):
        return self.conv(x)

5.4 特性对比

特性ResNetDenseNet
连接方式AddConcat
参数量较多较少(复用特征)
内存占用较低较高(需存储所有特征)
梯度流直接传播多路径传播
特征复用隐式显式

6. EfficientNet:复合缩放

6.1 核心思想

传统缩放方式:

  • 深度缩放:增加网络层数
  • 宽度缩放:增加每层通道数
  • 分辨率缩放:增加输入分辨率

问题:单独缩放一个维度会导致效率下降。

6.2 复合缩放公式

EfficientNet提出联合缩放

约束条件:

6.3 Mobile Inverted Bottleneck (MBConv)

EfficientNet基于MobileNetv2的MBConv:

标准ResNet Bottleneck:          MBConv:
    Input                           Input
       ↓                               ↓
    1×1 Conv (↓)                    1×1 Conv (↑) ← 扩展
       ↓                               ↓
    3×3 Conv                       Depthwise 3×3
       ↓                               ↓
    1×1 Conv (↑)                    1×1 Conv (↓) + Linear
       ↓                               ↓
    Add + ReLU                     Add + SE ← Squeeze-Excitation
       ↓                               ↓
    Output                          Output

关键特点:

  1. 深度可分离卷积:大幅降低计算量
  2. 扩展-压缩:先升维再降维,增加非线性
  3. Skip Connection:只在扩展比>1时使用

6.4 EfficientNet-B0 基准

class EfficientNetB0(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        
        # Stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.SiLU(inplace=True)
        )
        
        # MBConv blocks (简化版)
        self.blocks = nn.Sequential(
            # Block 1: 112×112, 16 channels, expand=1
            MBConv(32, 16, kernel_size=3, stride=1, expand_ratio=1),
            # Block 2-9: ...
        )
        
        # Head
        self.head = nn.Sequential(
            nn.Conv2d(320, 1280, kernel_size=1, bias=False),
            nn.BatchNorm2d(1280),
            nn.SiLU(inplace=True),
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Dropout(0.2),
            nn.Linear(1280, num_classes)
        )
    
    def forward(self, x):
        x = self.stem(x)
        x = self.blocks(x)
        x = self.head(x)
        return x

6.5 性能对比

模型Top-1参数量FLOPs
ResNet-5076.0%26M4.1G
DenseNet-16976.2%14M3.4G
EfficientNet-B077.1%5.3M0.39G
EfficientNet-B482.9%19M4.2G
EfficientNet-B784.3%66M37G

7. ConvNeXt:现代ConvNet

7.1 背景

Vision Transformer (ViT) 在2020-2021年超越了CNN。ConvNeXt的问世证明了现代ConvNet可以与Transformer匹敌

7.2 设计原则

ConvNeXt通过系统性地”现代化”ResNet,追赶甚至超越ViT:

设计ResNet (Baseline)ConvNeXt
激活函数ReLUGELU
归一化位置BN after ConvLN after Conv
归一化层数每层一个BN更少的BN
卷积核大小3×37×7(大核卷积)
Block结构BottleneckInverted Bottleneck
下采样独立下采样层Patchify化
阶段比例[1,1,2,1][1,1,3,1]

7.3 关键改进

1. Macro Design(宏观设计)

阶段比例ResNetConvNeXt-T
Stages[3,4,6,3][3,4,6,3]
通道[64,128,256,512][96,192,384,768]
图像块化4×4 maxpool4×4, stride=4 conv

2. Inverted Bottleneck

# ResNet Bottleneck:     ConvNeXt Block:
#     1×1 ↓                   1×1 ↑
#     3×3                     7×7 depthwise
#     1×1 ↑                   1×1 ↓
 
class ConvNeXtBlock(nn.Module):
    def __init__(self, dim, kernel_size=7):
        super().__init__()
        # 深度可分离卷积
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=kernel_size, 
                                 padding=kernel_size//2, groups=dim)
        self.norm = nn.LayerNorm(dim, eps=1e-6)
        self.pwconv1 = nn.Linear(dim, 4 * dim)  # 扩展
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)   # 压缩
        
        # 残差连接
        self.drop_path = DropPath(drop_path_rate)
    
    def forward(self, x):
        input = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)  # (B, C, H, W) → (B, H, W, C)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        x = x.permute(0, 3, 1, 2)  # 恢复
        
        return input + self.drop_path(x)

3. 大核卷积

核心观察:

  • ViT的全局注意力 = 大核卷积的极限
  • 深度可分离卷积的计算量与 接近
# 深度可分离卷积计算量对比
# 3×3: 9 × C × H × W
# 7×7: 49 × C × H × W ≈ 5.4×
 
# 但深度可分离版本:
# 3×3: 9 × C × H × W  (只对单通道)
# 7×7: 49 × C × H × W
# 比例 ≈ 5.4×

7.4 性能对比

模型ImageNet Top-1参数量FLOPs
Swin-T81.3%28M4.5G
ConvNeXt-T82.1%28M4.5G
Swin-S83.1%50M8.7G
ConvNeXt-S83.1%50M8.7G
Swin-B83.9%88M15.4G
ConvNeXt-B83.8%89M15.4G

8. RepLKNet:大核卷积

8.1 核心思想

RepLKNet: “Scaling Up Your Kernels to 31×31”

关键发现:

  • ViT成功部分归因于全局感受野
  • 大核CNN通过增大卷积核可以达到类似效果
  • 31×31 卷积核是ImageNet上精度-效率的最佳平衡点

8.2 架构设计

class RepLKNetBlock(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=31, stride=1):
        super().__init__()
        
        # 大核深度可分离卷积
        self.dwconv = nn.Conv2d(
            in_channels, in_channels,
            kernel_size=kernel_size,
            padding=kernel_size // 2,
            groups=in_channels
        )
        
        # 逐点卷积
        self.pwconv = nn.Conv2d(in_channels, out_channels, kernel_size=1)
        
        # 短路连接
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.AvgPool2d(stride) if stride > 1 else nn.Identity(),
                nn.Conv2d(in_channels, out_channels, 1)
            )
        else:
            self.shortcut = nn.Identity()
    
    def forward(self, x):
        return self.shortcut(x) + self.pwconv(self.dwconv(x))

8.3 性能

模型ImageNet Top-1推理速度
Swin-B83.9%1.0×
RepLKNet-31B84.8%1.2× (更快)

9. 架构对比总结

架构核心创新优势劣势
VGGNet深网络 + 小卷积简单、规则参数量大
GoogLeNetInception模块多尺度特征结构复杂
ResNet残差连接训练稳定、易扩展仍有优化空间
DenseNet密集连接特征复用内存占用大
EfficientNet复合缩放效率高设计依赖NAS
ConvNeXt现代ConvNet综合最优结构较新

10. 设计趋势

  1. 大核卷积:ConvNeXt、RepLKNet证明 或更大的卷积核更有效
  2. 深度可分离卷积:降低计算量的标准方法
  3. Inverted Bottleneck:先扩展再压缩的瓶颈设计
  4. 更少的归一化层:LayerNorm优于BatchNorm
  5. GELU激活函数:取代ReLU成为默认选择

参考文献

Footnotes

  1. He, K., et al. (2016). Deep Residual Learning for Image Recognition. CVPR.