现代卷积神经网络架构

从2012年AlexNet引爆深度学习热潮至今，卷积神经网络（CNN）经历了多次重大革新。本文系统梳理从VGGNet到ConvNeXt的经典架构演进。¹

1. 早期奠基：AlexNet与VGGNet

1.1 AlexNet (2012)

ImageNet竞赛冠军，开创深度学习时代

架构特点：

5层卷积 + 3层全连接
ReLU激活函数
Dropout正则化
GPU并行训练（双卡）
Local Response Normalization (LRN)

# AlexNet简化实现
class AlexNet(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.features = nn.Sequential(
            # Conv1: 96 kernels, 11×11, stride=4
            nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 55×55 → 27×27
            
            # Conv2: 256 kernels, 5×5
            nn.Conv2d(96, 256, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 27×27 → 13×13
            
            # Conv3-5: 384, 384, 256
            nn.Conv2d(256, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),  # 13×13 → 6×6
        )
        
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(p=0.5),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

1.2 VGGNet (2014)

深化网络，简化滤波器设计

核心贡献：

统一使用 $3 \times 3$ 卷积核
通过堆叠小卷积实现大感受野
$3$ 个 $3 \times 3$ 卷积 ≈ $1$ 个 $7 \times 7$ 感受野
参数更少，表达能力更强

感受野计算：

感受野 = (l_{1} \cdot l_{2} \cdot \dots) + (k - 1) \cdot i = 1 \sum n - 1 j = 1 \prod i s_{j}

其中 $l_{i}$ 是第 $i$ 层核大小， $s_{j}$ 是第 $j$ 层步长。

VGG-16架构：

层类型	输出尺寸	通道数
Conv1_1-2	224	64
MaxPool	112	64
Conv2_1-2	112	128
MaxPool	56	128
Conv3_1-3	56	256
MaxPool	28	256
Conv4_1-3	28	512
MaxPool	14	512
Conv5_1-3	14	512
MaxPool	7	512
FC	1	4096
FC	1	4096
FC	1	1000

2. GoogLeNet与Inception模块

2.1 动机：稀疏连接

传统密集连接的问题是：

参数量爆炸： $1024 \times 1024 = 1 M$
容易过拟合

Hebbian原理：“一起发射的神经元会连接在一起”

解决方案：利用稀疏性，但通过密集矩阵运算模拟稀疏连接。

2.2 Inception模块

核心思想：多尺度并行卷积 + 自适应池化

Input
    ├── 1×1 Conv ─────────────────────┐
    ├── 1×1 Conv → 3×3 Conv ─────────┤
    ├── 1×1 Conv → 5×5 Conv ─────────┤→ Concat → Output
    ├── 3×3 MaxPool → 1×1 Conv ──────┘

Inception v1模块：

class InceptionModule(nn.Module):
    def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj):
        super().__init__()
        
        # 1×1 卷积分支
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, ch1x1, kernel_size=1),
            nn.ReLU(inplace=True)
        )
        
        # 1×1 → 3×3 分支
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, ch3x3red, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch3x3red, ch3x3, kernel_size=3, padding=1),
            nn.ReLU(inplace=True)
        )
        
        # 1×1 → 5×5 分支
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, ch5x5red, kernel_size=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(ch5x5red, ch5x5, kernel_size=5, padding=2),
            nn.ReLU(inplace=True)
        )
        
        # MaxPool → 1×1 分支
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, pool_proj, kernel_size=1),
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        b1 = self.branch1(x)
        b2 = self.branch2(x)
        b3 = self.branch3(x)
        b4 = self.branch4(x)
        return torch.cat([b1, b2, b3, b4], dim=1)

2.3 维度 reduction

Inception模块中的 $1 \times 1$ 卷积有两个作用：

降维：减少通道数，降低计算量
非线性：增加网络深度

输入: 192 channels
    ├── 1×1, 64   (64 params)     → 64 channels
    ├── 1×1, 96 → 3×3, 128 (12.9K params) → 128 channels
    └── 1×1, 16 → 5×5, 32  (12.3K params) → 32 channels
    
总计: 224 channels vs 直接3×3: 192×256 = 49K

3. ResNet：残差学习的革命

（详见 resnet-deep-residual-learning）

4. ResNeXt：多分支聚合

4.1 核心思想

ResNeXt = ResNet + Inception 的思想融合：

分组卷积 (Cardinality)
保持ResNet的残差结构
类似VGG的重复块设计

4.2 分组卷积

标准卷积：所有输入通道参与计算

# 输入: (C_in, H, W)
# 卷积核: (C_out, C_in, k, k)
# 输出: (C_out, H', W')

分组卷积：将输入分成 $g$ 组，分别卷积后拼接

def grouped_convolution(x, in_channels, out_channels, groups, kernel_size):
    # x: (B, C_in, H, W)
    in_per_group = in_channels // groups
    out_per_group = out_channels // groups
    
    outputs = []
    for i in range(groups):
        # 每组独立卷积
        group_input = x[:, i*in_per_group:(i+1)*in_per_group, :, :]
        group_conv = nn.Conv2d(in_per_group, out_per_group, 
                               kernel_size, groups=1)
        outputs.append(group_conv(group_input))
    
    # 沿通道维度拼接
    return torch.cat(outputs, dim=1)

4.3 ResNeXt Bottleneck

输入 (256d)
    │
    ├── 1×1, 128 ─────────────────────────────────┐
    ├── 分组卷积: 128通道, groups=32, 3×3 ─────────┤→ Add → ReLU
    ├── 1×1, 256 ─────────────────────────────────┘
    │
    └─ Shortcut (如果需要维度匹配)

计算量对比：

普通ResNet: $256 \times 3 \times 3 \times 256 = 589, 824$
ResNeXt (g=32): $256 \times 1 \times 1 \times 128 + 128 \times 3 \times 3 \times 4 = 49, 152 + 4, 608$

减少约 12倍 计算量！

4.4 架构变体

配置	Cardinality	通道数	参数量	ImageNet Top-1
ResNet-50	1	64→256	25.6M	76.0%
ResNeXt-50_32×4d	32	4→256	27.3M	78.8%
ResNeXt-101_32×8d	32	8→2048	46.5M	80.9%

5. DenseNet：密集连接

5.1 核心思想

与ResNet的相加（Add）不同，DenseNet采用拼接（Concat）：

x_{l} = H_{l} ([x_{0}, x_{1}, \dots, x_{l - 1}])

5.2 密集块结构

Dense Block:
    
Layer 0: Input (H×W×C_0)
    ↓
Layer 1: H_1([x_0]) → H×W×C_1
    ↓ concat: H×W×(C_0 + C_1)
Layer 2: H_2([x_0, x_1]) → H×W×C_1
    ↓ concat: H×W×(C_0 + 2C_1)
Layer 3: H_3([x_0, x_1, x_2]) → H×W×C_1
    ↓ concat: H×W×(C_0 + 3C_1)
    ...

5.3 DenseNet实现

class DenseBlock(nn.Module):
    def __init__(self, in_channels, growth_rate, num_layers):
        super().__init__()
        self.layers = nn.ModuleList()
        for i in range(num_layers):
            # 每个层的输入通道数递增
            layer_in_channels = in_channels + i * growth_rate
            self.layers.append(self._make_layer(layer_in_channels, growth_rate))
    
    def _make_layer(self, in_channels, growth_rate):
        return nn.Sequential(
            nn.BatchNorm2d(in_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, growth_rate, kernel_size=3, padding=1, bias=False)
        )
    
    def forward(self, x):
        features = [x]
        for layer in self.layers:
            new_feature = layer(torch.cat(features, dim=1))
            features.append(new_feature)
        return torch.cat(features, dim=1)
 
 
class TransitionLayer(nn.Module):
    """过渡层：压缩通道数和尺寸"""
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.conv = nn.Sequential(
            nn.BatchNorm2d(in_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.AvgPool2d(kernel_size=2, stride=2)
        )
    
    def forward(self, x):
        return self.conv(x)

5.4 特性对比

特性	ResNet	DenseNet
连接方式	Add	Concat
参数量	较多	较少（复用特征）
内存占用	较低	较高（需存储所有特征）
梯度流	直接传播	多路径传播
特征复用	隐式	显式

6. EfficientNet：复合缩放

6.1 核心思想

传统缩放方式：

深度缩放：增加网络层数
宽度缩放：增加每层通道数
分辨率缩放：增加输入分辨率

问题：单独缩放一个维度会导致效率下降。

6.2 复合缩放公式

EfficientNet提出联合缩放：

depth : d = α^{ϕ}

width : w = β^{ϕ}

resolution : r = γ^{ϕ}

约束条件：

α \times β^{2} \times γ^{2} \approx 2

且 $α \geq 1, β \geq 1, γ \geq 1$

6.3 Mobile Inverted Bottleneck (MBConv)

EfficientNet基于MobileNetv2的MBConv：

标准ResNet Bottleneck:          MBConv:
    Input                           Input
       ↓                               ↓
    1×1 Conv (↓)                    1×1 Conv (↑) ← 扩展
       ↓                               ↓
    3×3 Conv                       Depthwise 3×3
       ↓                               ↓
    1×1 Conv (↑)                    1×1 Conv (↓) + Linear
       ↓                               ↓
    Add + ReLU                     Add + SE ← Squeeze-Excitation
       ↓                               ↓
    Output                          Output

关键特点：

深度可分离卷积：大幅降低计算量
扩展-压缩：先升维再降维，增加非线性
Skip Connection：只在扩展比>1时使用

6.4 EfficientNet-B0 基准

class EfficientNetB0(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        
        # Stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32),
            nn.SiLU(inplace=True)
        )
        
        # MBConv blocks (简化版)
        self.blocks = nn.Sequential(
            # Block 1: 112×112, 16 channels, expand=1
            MBConv(32, 16, kernel_size=3, stride=1, expand_ratio=1),
            # Block 2-9: ...
        )
        
        # Head
        self.head = nn.Sequential(
            nn.Conv2d(320, 1280, kernel_size=1, bias=False),
            nn.BatchNorm2d(1280),
            nn.SiLU(inplace=True),
            nn.AdaptiveAvgPool2d(1),
            nn.Flatten(),
            nn.Dropout(0.2),
            nn.Linear(1280, num_classes)
        )
    
    def forward(self, x):
        x = self.stem(x)
        x = self.blocks(x)
        x = self.head(x)
        return x

6.5 性能对比

模型	Top-1	参数量	FLOPs
ResNet-50	76.0%	26M	4.1G
DenseNet-169	76.2%	14M	3.4G
EfficientNet-B0	77.1%	5.3M	0.39G
EfficientNet-B4	82.9%	19M	4.2G
EfficientNet-B7	84.3%	66M	37G

7. ConvNeXt：现代ConvNet

7.1 背景

Vision Transformer (ViT) 在2020-2021年超越了CNN。ConvNeXt的问世证明了现代ConvNet可以与Transformer匹敌。

7.2 设计原则

ConvNeXt通过系统性地”现代化”ResNet，追赶甚至超越ViT：

设计	ResNet (Baseline)	ConvNeXt
激活函数	ReLU	GELU
归一化位置	BN after Conv	LN after Conv
归一化层数	每层一个BN	更少的BN
卷积核大小	3×3	7×7（大核卷积）
Block结构	Bottleneck	Inverted Bottleneck
下采样	独立下采样层	Patchify化
阶段比例	[1,1,2,1]	[1,1,3,1]

7.3 关键改进

1. Macro Design（宏观设计）

阶段比例	ResNet	ConvNeXt-T
Stages	[3,4,6,3]	[3,4,6,3]
通道	[64,128,256,512]	[96,192,384,768]
图像块化	4×4 maxpool	4×4, stride=4 conv

2. Inverted Bottleneck

# ResNet Bottleneck:     ConvNeXt Block:
#     1×1 ↓                   1×1 ↑
#     3×3                     7×7 depthwise
#     1×1 ↑                   1×1 ↓
 
class ConvNeXtBlock(nn.Module):
    def __init__(self, dim, kernel_size=7):
        super().__init__()
        # 深度可分离卷积
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=kernel_size, 
                                 padding=kernel_size//2, groups=dim)
        self.norm = nn.LayerNorm(dim, eps=1e-6)
        self.pwconv1 = nn.Linear(dim, 4 * dim)  # 扩展
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)   # 压缩
        
        # 残差连接
        self.drop_path = DropPath(drop_path_rate)
    
    def forward(self, x):
        input = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)  # (B, C, H, W) → (B, H, W, C)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        x = x.permute(0, 3, 1, 2)  # 恢复
        
        return input + self.drop_path(x)

3. 大核卷积

核心观察：

ViT的全局注意力 = 大核卷积的极限
$7 \times 7$ 深度可分离卷积的计算量与 $3 \times 3$ 接近

# 深度可分离卷积计算量对比
# 3×3: 9 × C × H × W
# 7×7: 49 × C × H × W ≈ 5.4×
 
# 但深度可分离版本：
# 3×3: 9 × C × H × W  (只对单通道)
# 7×7: 49 × C × H × W
# 比例 ≈ 5.4×

7.4 性能对比

模型	ImageNet Top-1	参数量	FLOPs
Swin-T	81.3%	28M	4.5G
ConvNeXt-T	82.1%	28M	4.5G
Swin-S	83.1%	50M	8.7G
ConvNeXt-S	83.1%	50M	8.7G
Swin-B	83.9%	88M	15.4G
ConvNeXt-B	83.8%	89M	15.4G

8. RepLKNet：大核卷积

8.1 核心思想

RepLKNet: “Scaling Up Your Kernels to 31×31”

关键发现：

ViT成功部分归因于全局感受野
大核CNN通过增大卷积核可以达到类似效果
31×31 卷积核是ImageNet上精度-效率的最佳平衡点

8.2 架构设计

class RepLKNetBlock(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=31, stride=1):
        super().__init__()
        
        # 大核深度可分离卷积
        self.dwconv = nn.Conv2d(
            in_channels, in_channels,
            kernel_size=kernel_size,
            padding=kernel_size // 2,
            groups=in_channels
        )
        
        # 逐点卷积
        self.pwconv = nn.Conv2d(in_channels, out_channels, kernel_size=1)
        
        # 短路连接
        if stride != 1 or in_channels != out_channels:
            self.shortcut = nn.Sequential(
                nn.AvgPool2d(stride) if stride > 1 else nn.Identity(),
                nn.Conv2d(in_channels, out_channels, 1)
            )
        else:
            self.shortcut = nn.Identity()
    
    def forward(self, x):
        return self.shortcut(x) + self.pwconv(self.dwconv(x))

8.3 性能

模型	ImageNet Top-1	推理速度
Swin-B	83.9%	1.0×
RepLKNet-31B	84.8%	1.2× (更快)

9. 架构对比总结

架构	核心创新	优势	劣势
VGGNet	深网络 + 小卷积	简单、规则	参数量大
GoogLeNet	Inception模块	多尺度特征	结构复杂
ResNet	残差连接	训练稳定、易扩展	仍有优化空间
DenseNet	密集连接	特征复用	内存占用大
EfficientNet	复合缩放	效率高	设计依赖NAS
ConvNeXt	现代ConvNet	综合最优	结构较新

10. 设计趋势

大核卷积：ConvNeXt、RepLKNet证明 $7 \times 7$ 或更大的卷积核更有效
深度可分离卷积：降低计算量的标准方法
Inverted Bottleneck：先扩展再压缩的瓶颈设计
更少的归一化层：LayerNorm优于BatchNorm
GELU激活函数：取代ReLU成为默认选择

参考文献

He, K., et al. (2016). Deep Residual Learning for Image Recognition. CVPR. ↩

Metaphor

探索

现代卷积神经网络架构

现代卷积神经网络架构

1. 早期奠基：AlexNet与VGGNet

1.1 AlexNet (2012)

1.2 VGGNet (2014)

2. GoogLeNet与Inception模块

2.1 动机：稀疏连接

2.2 Inception模块

2.3 维度 reduction

3. ResNet：残差学习的革命

4. ResNeXt：多分支聚合

4.1 核心思想

4.2 分组卷积

4.3 ResNeXt Bottleneck

4.4 架构变体

5. DenseNet：密集连接

5.1 核心思想

5.2 密集块结构

5.3 DenseNet实现

5.4 特性对比

6. EfficientNet：复合缩放

6.1 核心思想

6.2 复合缩放公式

6.3 Mobile Inverted Bottleneck (MBConv)

6.4 EfficientNet-B0 基准

6.5 性能对比

7. ConvNeXt：现代ConvNet

7.1 背景

7.2 设计原则

7.3 关键改进

7.4 性能对比

8. RepLKNet：大核卷积

8.1 核心思想

8.2 架构设计

8.3 性能

9. 架构对比总结

10. 设计趋势

参考文献

Footnotes

关系图谱

目录

反向链接