现代卷积神经网络架构
从2012年AlexNet引爆深度学习热潮至今,卷积神经网络(CNN)经历了多次重大革新。本文系统梳理从VGGNet到ConvNeXt的经典架构演进。1
1. 早期奠基:AlexNet与VGGNet
1.1 AlexNet (2012)
ImageNet竞赛冠军,开创深度学习时代
架构特点:
- 5层卷积 + 3层全连接
- ReLU激活函数
- Dropout正则化
- GPU并行训练(双卡)
- Local Response Normalization (LRN)
# AlexNet简化实现
class AlexNet(nn.Module):
def __init__(self, num_classes=1000):
super().__init__()
self.features = nn.Sequential(
# Conv1: 96 kernels, 11×11, stride=4
nn.Conv2d(3, 96, kernel_size=11, stride=4, padding=2),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2), # 55×55 → 27×27
# Conv2: 256 kernels, 5×5
nn.Conv2d(96, 256, kernel_size=5, padding=2),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2), # 27×27 → 13×13
# Conv3-5: 384, 384, 256
nn.Conv2d(256, 384, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(384, 384, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(384, 256, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2), # 13×13 → 6×6
)
self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
self.classifier = nn.Sequential(
nn.Dropout(p=0.5),
nn.Linear(256 * 6 * 6, 4096),
nn.ReLU(inplace=True),
nn.Dropout(p=0.5),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Linear(4096, num_classes),
)1.2 VGGNet (2014)
深化网络,简化滤波器设计
核心贡献:
- 统一使用 卷积核
- 通过堆叠小卷积实现大感受野
- 个 卷积 ≈ 个 感受野
- 参数更少,表达能力更强
感受野计算:
其中 是第 层核大小, 是第 层步长。
VGG-16架构:
| 层类型 | 输出尺寸 | 通道数 |
|---|---|---|
| Conv1_1-2 | 224 | 64 |
| MaxPool | 112 | 64 |
| Conv2_1-2 | 112 | 128 |
| MaxPool | 56 | 128 |
| Conv3_1-3 | 56 | 256 |
| MaxPool | 28 | 256 |
| Conv4_1-3 | 28 | 512 |
| MaxPool | 14 | 512 |
| Conv5_1-3 | 14 | 512 |
| MaxPool | 7 | 512 |
| FC | 1 | 4096 |
| FC | 1 | 4096 |
| FC | 1 | 1000 |
2. GoogLeNet与Inception模块
2.1 动机:稀疏连接
传统密集连接的问题是:
- 参数量爆炸:
- 容易过拟合
Hebbian原理:“一起发射的神经元会连接在一起”
解决方案:利用稀疏性,但通过密集矩阵运算模拟稀疏连接。
2.2 Inception模块
核心思想:多尺度并行卷积 + 自适应池化
Input
├── 1×1 Conv ─────────────────────┐
├── 1×1 Conv → 3×3 Conv ─────────┤
├── 1×1 Conv → 5×5 Conv ─────────┤→ Concat → Output
├── 3×3 MaxPool → 1×1 Conv ──────┘
Inception v1模块:
class InceptionModule(nn.Module):
def __init__(self, in_channels, ch1x1, ch3x3red, ch3x3, ch5x5red, ch5x5, pool_proj):
super().__init__()
# 1×1 卷积分支
self.branch1 = nn.Sequential(
nn.Conv2d(in_channels, ch1x1, kernel_size=1),
nn.ReLU(inplace=True)
)
# 1×1 → 3×3 分支
self.branch2 = nn.Sequential(
nn.Conv2d(in_channels, ch3x3red, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(ch3x3red, ch3x3, kernel_size=3, padding=1),
nn.ReLU(inplace=True)
)
# 1×1 → 5×5 分支
self.branch3 = nn.Sequential(
nn.Conv2d(in_channels, ch5x5red, kernel_size=1),
nn.ReLU(inplace=True),
nn.Conv2d(ch5x5red, ch5x5, kernel_size=5, padding=2),
nn.ReLU(inplace=True)
)
# MaxPool → 1×1 分支
self.branch4 = nn.Sequential(
nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
nn.Conv2d(in_channels, pool_proj, kernel_size=1),
nn.ReLU(inplace=True)
)
def forward(self, x):
b1 = self.branch1(x)
b2 = self.branch2(x)
b3 = self.branch3(x)
b4 = self.branch4(x)
return torch.cat([b1, b2, b3, b4], dim=1)2.3 维度 reduction
Inception模块中的 卷积有两个作用:
- 降维:减少通道数,降低计算量
- 非线性:增加网络深度
输入: 192 channels
├── 1×1, 64 (64 params) → 64 channels
├── 1×1, 96 → 3×3, 128 (12.9K params) → 128 channels
└── 1×1, 16 → 5×5, 32 (12.3K params) → 32 channels
总计: 224 channels vs 直接3×3: 192×256 = 49K
3. ResNet:残差学习的革命
(详见 resnet-deep-residual-learning)
4. ResNeXt:多分支聚合
4.1 核心思想
ResNeXt = ResNet + Inception 的思想融合:
- 分组卷积 (Cardinality)
- 保持ResNet的残差结构
- 类似VGG的重复块设计
4.2 分组卷积
标准卷积:所有输入通道参与计算
# 输入: (C_in, H, W)
# 卷积核: (C_out, C_in, k, k)
# 输出: (C_out, H', W')分组卷积:将输入分成 组,分别卷积后拼接
def grouped_convolution(x, in_channels, out_channels, groups, kernel_size):
# x: (B, C_in, H, W)
in_per_group = in_channels // groups
out_per_group = out_channels // groups
outputs = []
for i in range(groups):
# 每组独立卷积
group_input = x[:, i*in_per_group:(i+1)*in_per_group, :, :]
group_conv = nn.Conv2d(in_per_group, out_per_group,
kernel_size, groups=1)
outputs.append(group_conv(group_input))
# 沿通道维度拼接
return torch.cat(outputs, dim=1)4.3 ResNeXt Bottleneck
输入 (256d)
│
├── 1×1, 128 ─────────────────────────────────┐
├── 分组卷积: 128通道, groups=32, 3×3 ─────────┤→ Add → ReLU
├── 1×1, 256 ─────────────────────────────────┘
│
└─ Shortcut (如果需要维度匹配)
计算量对比:
- 普通ResNet:
- ResNeXt (g=32):
减少约 12倍 计算量!
4.4 架构变体
| 配置 | Cardinality | 通道数 | 参数量 | ImageNet Top-1 |
|---|---|---|---|---|
| ResNet-50 | 1 | 64→256 | 25.6M | 76.0% |
| ResNeXt-50_32×4d | 32 | 4→256 | 27.3M | 78.8% |
| ResNeXt-101_32×8d | 32 | 8→2048 | 46.5M | 80.9% |
5. DenseNet:密集连接
5.1 核心思想
与ResNet的相加(Add)不同,DenseNet采用拼接(Concat):
5.2 密集块结构
Dense Block:
Layer 0: Input (H×W×C_0)
↓
Layer 1: H_1([x_0]) → H×W×C_1
↓ concat: H×W×(C_0 + C_1)
Layer 2: H_2([x_0, x_1]) → H×W×C_1
↓ concat: H×W×(C_0 + 2C_1)
Layer 3: H_3([x_0, x_1, x_2]) → H×W×C_1
↓ concat: H×W×(C_0 + 3C_1)
...
5.3 DenseNet实现
class DenseBlock(nn.Module):
def __init__(self, in_channels, growth_rate, num_layers):
super().__init__()
self.layers = nn.ModuleList()
for i in range(num_layers):
# 每个层的输入通道数递增
layer_in_channels = in_channels + i * growth_rate
self.layers.append(self._make_layer(layer_in_channels, growth_rate))
def _make_layer(self, in_channels, growth_rate):
return nn.Sequential(
nn.BatchNorm2d(in_channels),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels, growth_rate, kernel_size=3, padding=1, bias=False)
)
def forward(self, x):
features = [x]
for layer in self.layers:
new_feature = layer(torch.cat(features, dim=1))
features.append(new_feature)
return torch.cat(features, dim=1)
class TransitionLayer(nn.Module):
"""过渡层:压缩通道数和尺寸"""
def __init__(self, in_channels, out_channels):
super().__init__()
self.conv = nn.Sequential(
nn.BatchNorm2d(in_channels),
nn.ReLU(inplace=True),
nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=False),
nn.AvgPool2d(kernel_size=2, stride=2)
)
def forward(self, x):
return self.conv(x)5.4 特性对比
| 特性 | ResNet | DenseNet |
|---|---|---|
| 连接方式 | Add | Concat |
| 参数量 | 较多 | 较少(复用特征) |
| 内存占用 | 较低 | 较高(需存储所有特征) |
| 梯度流 | 直接传播 | 多路径传播 |
| 特征复用 | 隐式 | 显式 |
6. EfficientNet:复合缩放
6.1 核心思想
传统缩放方式:
- 深度缩放:增加网络层数
- 宽度缩放:增加每层通道数
- 分辨率缩放:增加输入分辨率
问题:单独缩放一个维度会导致效率下降。
6.2 复合缩放公式
EfficientNet提出联合缩放:
约束条件:
且
6.3 Mobile Inverted Bottleneck (MBConv)
EfficientNet基于MobileNetv2的MBConv:
标准ResNet Bottleneck: MBConv:
Input Input
↓ ↓
1×1 Conv (↓) 1×1 Conv (↑) ← 扩展
↓ ↓
3×3 Conv Depthwise 3×3
↓ ↓
1×1 Conv (↑) 1×1 Conv (↓) + Linear
↓ ↓
Add + ReLU Add + SE ← Squeeze-Excitation
↓ ↓
Output Output
关键特点:
- 深度可分离卷积:大幅降低计算量
- 扩展-压缩:先升维再降维,增加非线性
- Skip Connection:只在扩展比>1时使用
6.4 EfficientNet-B0 基准
class EfficientNetB0(nn.Module):
def __init__(self, num_classes=1000):
super().__init__()
# Stem
self.stem = nn.Sequential(
nn.Conv2d(3, 32, kernel_size=3, stride=2, padding=1, bias=False),
nn.BatchNorm2d(32),
nn.SiLU(inplace=True)
)
# MBConv blocks (简化版)
self.blocks = nn.Sequential(
# Block 1: 112×112, 16 channels, expand=1
MBConv(32, 16, kernel_size=3, stride=1, expand_ratio=1),
# Block 2-9: ...
)
# Head
self.head = nn.Sequential(
nn.Conv2d(320, 1280, kernel_size=1, bias=False),
nn.BatchNorm2d(1280),
nn.SiLU(inplace=True),
nn.AdaptiveAvgPool2d(1),
nn.Flatten(),
nn.Dropout(0.2),
nn.Linear(1280, num_classes)
)
def forward(self, x):
x = self.stem(x)
x = self.blocks(x)
x = self.head(x)
return x6.5 性能对比
| 模型 | Top-1 | 参数量 | FLOPs |
|---|---|---|---|
| ResNet-50 | 76.0% | 26M | 4.1G |
| DenseNet-169 | 76.2% | 14M | 3.4G |
| EfficientNet-B0 | 77.1% | 5.3M | 0.39G |
| EfficientNet-B4 | 82.9% | 19M | 4.2G |
| EfficientNet-B7 | 84.3% | 66M | 37G |
7. ConvNeXt:现代ConvNet
7.1 背景
Vision Transformer (ViT) 在2020-2021年超越了CNN。ConvNeXt的问世证明了现代ConvNet可以与Transformer匹敌。
7.2 设计原则
ConvNeXt通过系统性地”现代化”ResNet,追赶甚至超越ViT:
| 设计 | ResNet (Baseline) | ConvNeXt |
|---|---|---|
| 激活函数 | ReLU | GELU |
| 归一化位置 | BN after Conv | LN after Conv |
| 归一化层数 | 每层一个BN | 更少的BN |
| 卷积核大小 | 3×3 | 7×7(大核卷积) |
| Block结构 | Bottleneck | Inverted Bottleneck |
| 下采样 | 独立下采样层 | Patchify化 |
| 阶段比例 | [1,1,2,1] | [1,1,3,1] |
7.3 关键改进
1. Macro Design(宏观设计)
| 阶段比例 | ResNet | ConvNeXt-T |
|---|---|---|
| Stages | [3,4,6,3] | [3,4,6,3] |
| 通道 | [64,128,256,512] | [96,192,384,768] |
| 图像块化 | 4×4 maxpool | 4×4, stride=4 conv |
2. Inverted Bottleneck
# ResNet Bottleneck: ConvNeXt Block:
# 1×1 ↓ 1×1 ↑
# 3×3 7×7 depthwise
# 1×1 ↑ 1×1 ↓
class ConvNeXtBlock(nn.Module):
def __init__(self, dim, kernel_size=7):
super().__init__()
# 深度可分离卷积
self.dwconv = nn.Conv2d(dim, dim, kernel_size=kernel_size,
padding=kernel_size//2, groups=dim)
self.norm = nn.LayerNorm(dim, eps=1e-6)
self.pwconv1 = nn.Linear(dim, 4 * dim) # 扩展
self.act = nn.GELU()
self.pwconv2 = nn.Linear(4 * dim, dim) # 压缩
# 残差连接
self.drop_path = DropPath(drop_path_rate)
def forward(self, x):
input = x
x = self.dwconv(x)
x = x.permute(0, 2, 3, 1) # (B, C, H, W) → (B, H, W, C)
x = self.norm(x)
x = self.pwconv1(x)
x = self.act(x)
x = self.pwconv2(x)
x = x.permute(0, 3, 1, 2) # 恢复
return input + self.drop_path(x)3. 大核卷积
核心观察:
- ViT的全局注意力 = 大核卷积的极限
- 深度可分离卷积的计算量与 接近
# 深度可分离卷积计算量对比
# 3×3: 9 × C × H × W
# 7×7: 49 × C × H × W ≈ 5.4×
# 但深度可分离版本:
# 3×3: 9 × C × H × W (只对单通道)
# 7×7: 49 × C × H × W
# 比例 ≈ 5.4×7.4 性能对比
| 模型 | ImageNet Top-1 | 参数量 | FLOPs |
|---|---|---|---|
| Swin-T | 81.3% | 28M | 4.5G |
| ConvNeXt-T | 82.1% | 28M | 4.5G |
| Swin-S | 83.1% | 50M | 8.7G |
| ConvNeXt-S | 83.1% | 50M | 8.7G |
| Swin-B | 83.9% | 88M | 15.4G |
| ConvNeXt-B | 83.8% | 89M | 15.4G |
8. RepLKNet:大核卷积
8.1 核心思想
RepLKNet: “Scaling Up Your Kernels to 31×31”
关键发现:
- ViT成功部分归因于全局感受野
- 大核CNN通过增大卷积核可以达到类似效果
- 31×31 卷积核是ImageNet上精度-效率的最佳平衡点
8.2 架构设计
class RepLKNetBlock(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size=31, stride=1):
super().__init__()
# 大核深度可分离卷积
self.dwconv = nn.Conv2d(
in_channels, in_channels,
kernel_size=kernel_size,
padding=kernel_size // 2,
groups=in_channels
)
# 逐点卷积
self.pwconv = nn.Conv2d(in_channels, out_channels, kernel_size=1)
# 短路连接
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.AvgPool2d(stride) if stride > 1 else nn.Identity(),
nn.Conv2d(in_channels, out_channels, 1)
)
else:
self.shortcut = nn.Identity()
def forward(self, x):
return self.shortcut(x) + self.pwconv(self.dwconv(x))8.3 性能
| 模型 | ImageNet Top-1 | 推理速度 |
|---|---|---|
| Swin-B | 83.9% | 1.0× |
| RepLKNet-31B | 84.8% | 1.2× (更快) |
9. 架构对比总结
| 架构 | 核心创新 | 优势 | 劣势 |
|---|---|---|---|
| VGGNet | 深网络 + 小卷积 | 简单、规则 | 参数量大 |
| GoogLeNet | Inception模块 | 多尺度特征 | 结构复杂 |
| ResNet | 残差连接 | 训练稳定、易扩展 | 仍有优化空间 |
| DenseNet | 密集连接 | 特征复用 | 内存占用大 |
| EfficientNet | 复合缩放 | 效率高 | 设计依赖NAS |
| ConvNeXt | 现代ConvNet | 综合最优 | 结构较新 |
10. 设计趋势
- 大核卷积:ConvNeXt、RepLKNet证明 或更大的卷积核更有效
- 深度可分离卷积:降低计算量的标准方法
- Inverted Bottleneck:先扩展再压缩的瓶颈设计
- 更少的归一化层:LayerNorm优于BatchNorm
- GELU激活函数:取代ReLU成为默认选择
参考文献
Footnotes
-
He, K., et al. (2016). Deep Residual Learning for Image Recognition. CVPR. ↩