概述
随着深度学习在移动设备和边缘计算中的广泛应用,轻量化CNN成为研究与工程的核心议题。本文档系统梳理以下内容:
- 深度可分离卷积(Depthwise Separable Convolution):将标准卷积分解为深度卷积和逐点卷积,大幅降低参数量和计算量
- MobileNet系列:v1/v2/v3的演进与设计哲学
- EfficientNet家族:复合缩放策略
- 其他轻量化架构:GhostNet、ShuffleNet、RegNet
- CNN-Transformer混合:EfficientFormer、MobileViT、PoolFormer
- 2025新进展:Wavelet Convolutions等
轻量化CNN的核心目标是在精度和效率之间取得最优权衡,这对实际部署至关重要。1
一、深度可分离卷积理论基础
1.1 标准卷积的计算复杂度
对于输入 ,输出 ,卷积核 :
示例( 卷积,):
1.2 深度可分离卷积分解
两步分解:
Step 1 - 深度卷积(Depthwise Convolution):
每个输入通道独立使用一个 卷积核:
参数量:
Step 2 - 逐点卷积(Pointwise Convolution):
标准卷积进行通道混合:
参数量:
1.3 复杂度对比
总FLOPs:
压缩比:
对于 :
约 8.7 倍的FLOPs减少。
1.4 几何解释
标准卷积同时进行:
- 空间相关性建模( 核)
- 通道相关性建模( 线性变换)
深度可分离卷积将这两个任务解耦:
- 深度卷积仅建模空间相关性
- 逐点卷积仅建模通道相关性
这种解耦假设:空间相关性和通道相关性可以分别学习,这在实践中被证明是有效的。
1.5 PyTorch实现
import torch
import torch.nn as nn
class DepthwiseSeparableConv2d(nn.Module):
"""深度可分离卷积"""
def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False):
super().__init__()
self.depthwise = nn.Conv2d(
in_channels, in_channels, kernel_size,
stride=stride, padding=padding, groups=in_channels, bias=bias
)
self.pointwise = nn.Conv2d(
in_channels, out_channels, 1, bias=bias
)
def forward(self, x):
x = self.depthwise(x)
x = self.pointwise(x)
return x
def count_flops_dsc(in_channels, out_channels, kernel_size, H, W):
"""计算深度可分离卷积FLOPs"""
dw = H * W * in_channels * kernel_size * kernel_size
pw = H * W * in_channels * out_channels
return dw + pw
def count_flops_standard(in_channels, out_channels, kernel_size, H, W):
"""计算标准卷积FLOPs"""
return H * W * in_channels * out_channels * kernel_size * kernel_size
# 示例
H, W = 56, 56
c_in, c_out = 128, 256
ks = 3
flops_std = count_flops_standard(c_in, c_out, ks, H, W)
flops_dsc = count_flops_dsc(c_in, c_out, ks, H, W)
print(f"标准卷积 FLOPs: {flops_std:,}")
print(f"深度可分离 FLOPs: {flops_dsc:,}")
print(f"压缩比: {flops_dsc/flops_std:.3f}")二、MobileNet系列演进
2.1 MobileNet v1(2017)
核心创新:引入深度可分离卷积作为基础块。
架构:
Input (224×224×3)
↓
Conv 3×3, stride=2 → 112×112×32
↓
DepthwiseSeparable × 13
↓
AvgPool 7×7
↓
FC 1000
关键超参:
- 宽度乘子(Width Multiplier) :缩放通道数
- 分辨率乘子(Resolution Multiplier) :缩放输入尺寸
调整后:
2.2 MobileNet v2(2018)
核心创新:线性瓶颈 + 反向残差(Linear Bottlenecks & Inverted Residuals)。
设计动机:
ReLU激活对低维(瓶颈)空间的破坏性更强。MobileNet v2在高维空间做非线性,在低维空间保持线性。
反向残差块(与传统ResNet相反):
传统ResNet:(压缩后恢复)
MobileNet v2:(先扩展后压缩)
class InvertedResidual(nn.Module):
"""MobileNet v2的反向残差块"""
def __init__(self, in_channels, out_channels, stride, expansion=6):
super().__init__()
hidden = in_channels * expansion
self.use_residual = (stride == 1 and in_channels == out_channels)
layers = []
# 1. 扩展(逐点卷积)
if expansion != 1:
layers.append(nn.Conv2d(in_channels, hidden, 1, bias=False))
layers.append(nn.BatchNorm2d(hidden))
layers.append(nn.ReLU6(inplace=True))
# 2. 深度卷积
layers.extend([
nn.Conv2d(hidden, hidden, 3, stride=stride, padding=1, groups=hidden, bias=False),
nn.BatchNorm2d(hidden),
nn.ReLU6(inplace=True)
])
# 3. 投影(线性逐点卷积,无激活)
layers.extend([
nn.Conv2d(hidden, out_channels, 1, bias=False),
nn.BatchNorm2d(out_channels)
])
self.conv = nn.Sequential(*layers)
def forward(self, x):
if self.use_residual:
return x + self.conv(x)
return self.conv(x)2.3 MobileNet v3(2019)
核心创新:结合神经架构搜索(NAS)、Squeeze-and-Excitation(SE)注意力和h-swish激活。2
h-swish激活:
相比swish ,h-swish避免sigmoid计算,更适合硬件加速。
SE注意力模块:
MobileNetV3-Large完整块:
class MobileNetV3Block(nn.Module):
"""MobileNet v3的bneck块"""
def __init__(self, in_channels, out_channels, kernel_size, stride,
use_se=True, use_hs=True, expansion=6):
super().__init__()
self.use_residual = (stride == 1 and in_channels == out_channels)
hidden = in_channels * expansion
activation = nn.Hardswish() if use_hs else nn.ReLU()
layers = []
# 扩展
if expansion != 1:
layers.extend([
nn.Conv2d(in_channels, hidden, 1, bias=False),
nn.BatchNorm2d(hidden),
activation
])
# 深度卷积
padding = (kernel_size - 1) // 2
layers.extend([
nn.Conv2d(hidden, hidden, kernel_size, stride=stride,
padding=padding, groups=hidden, bias=False),
nn.BatchNorm2d(hidden),
activation
])
# SE注意力
if use_se:
layers.append(SqueezeExcite(hidden))
# 投影
layers.extend([
nn.Conv2d(hidden, out_channels, 1, bias=False),
nn.BatchNorm2d(out_channels)
])
self.conv = nn.Sequential(*layers)
def forward(self, x):
if self.use_residual:
return x + self.conv(x)
return self.conv(x)
class SqueezeExcite(nn.Module):
"""Squeeze-and-Excitation模块"""
def __init__(self, channels, reduction=4):
super().__init__()
hidden = max(1, channels // reduction)
self.fc = nn.Sequential(
nn.AdaptiveAvgPool2d(1),
nn.Conv2d(channels, hidden, 1),
nn.ReLU(inplace=True),
nn.Conv2d(hidden, channels, 1),
nn.Hardswish()
)
def forward(self, x):
return x * self.fc(x)2.4 MobileNet版本对比
| 特性 | v1 (2017) | v2 (2018) | v3 (2019) |
|---|---|---|---|
| 核心模块 | 深度可分离 | 线性瓶颈+反向残差 | NAS+SE+h-swish |
| 激活函数 | ReLU6 | ReLU6 | h-swish/ReLU |
| ImageNet Top-1 | 70.6% | 72.0% | 75.2% |
| FLOPs (Large) | 569M | 300M | 219M |
三、EfficientNet系列
3.1 复合缩放理论
核心问题:给定FLOPs预算,如何最优地缩放网络的深度、宽度、分辨率?
复合缩放公式(Tan & Le, ICML 2019):
约束:
其中 是用户指定的复合系数。
3.2 EfficientNet-B0基线
class MBConvBlock(nn.Module):
"""EfficientNet的MBConv块(基于MobileNet v2 + SE)"""
def __init__(self, in_channels, out_channels, kernel_size, stride,
expand_ratio, se_ratio=0.25, drop_rate=0.0):
super().__init__()
self.use_residual = (stride == 1 and in_channels == out_channels)
hidden = in_channels * expand_ratio
self.drop_rate = drop_rate
layers = []
# 扩展(仅当expand_ratio ≠ 1时)
if expand_ratio != 1:
layers.extend([
nn.Conv2d(in_channels, hidden, 1, bias=False),
nn.BatchNorm2d(hidden),
nn.SiLU() # Swish
])
# 深度卷积
padding = (kernel_size - 1) // 2
layers.extend([
nn.Conv2d(hidden, hidden, kernel_size, stride=stride,
padding=padding, groups=hidden, bias=False),
nn.BatchNorm2d(hidden),
nn.SiLU()
])
# SE
if se_ratio > 0:
se_hidden = max(1, int(hidden * se_ratio))
layers.append(SqueezeExcite(hidden, se_hidden))
# 投影
layers.extend([
nn.Conv2d(hidden, out_channels, 1, bias=False),
nn.BatchNorm2d(out_channels)
])
self.conv = nn.Sequential(*layers)
def forward(self, x):
out = self.conv(x)
if self.use_residual:
if self.drop_rate > 0 and self.training:
out = self._drop_connect(out)
out = out + x
return out
def _drop_connect(self, x):
"""DropConnect正则化"""
keep_prob = 1.0 - self.drop_rate
mask = torch.empty((x.size(0), 1, 1, 1), device=x.device).bernoulli_(keep_prob)
return x * mask / keep_prob3.3 EfficientNet的8个版本
| 模型 | 分辨率 | FLOPs | Top-1 | |
|---|---|---|---|---|
| B0 | 0 | 224 | 390M | 77.3% |
| B1 | 1 | 240 | 700M | 79.1% |
| B2 | 2 | 260 | 1.0G | 80.1% |
| B3 | 3 | 300 | 1.8G | 81.6% |
| B4 | 4 | 380 | 4.2G | 82.9% |
| B5 | 5 | 456 | 9.9G | 83.6% |
| B6 | 6 | 528 | 19.0G | 84.0% |
| B7 | 7 | 600 | 37.0G | 84.3% |
3.4 EfficientNetV2(2021)
改进:
- 引入Fused-MBConv(早期层用标准卷积替代深度可分离)
- 渐进学习(Progressive Learning):训练过程中逐渐增大图像尺寸
- 自适应正则化
class FusedMBConv(nn.Module):
"""EfficientNetV2的Fused-MBConv(早期层)"""
def __init__(self, in_channels, out_channels, stride, expand_ratio=1):
super().__init__()
hidden = in_channels * expand_ratio
self.use_residual = (stride == 1 and in_channels == out_channels)
layers = []
if expand_ratio != 1:
layers.extend([
nn.Conv2d(in_channels, hidden, 3, stride=stride,
padding=1, bias=False),
nn.BatchNorm2d(hidden),
nn.SiLU()
])
layers.extend([
nn.Conv2d(hidden, out_channels, 1, bias=False),
nn.BatchNorm2d(out_channels)
])
self.conv = nn.Sequential(*layers)
def forward(self, x):
out = self.conv(x)
if self.use_residual:
return out + x
return out四、其他高效架构
4.1 GhostNet(2020)
核心观察:CNN特征图中存在大量冗余(相似特征图)。GhostNet以低成本生成这些冗余特征。
Ghost模块:
class GhostModule(nn.Module):
"""Ghost模块:廉价操作生成冗余特征"""
def __init__(self, in_channels, out_channels, kernel_size=1, ratio=2, dw_size=3):
super().__init__()
self.out_channels = out_channels
init_channels = out_channels // ratio
new_channels = out_channels - init_channels
# 第一步:标准1×1卷积生成内在特征
self.primary_conv = nn.Sequential(
nn.Conv2d(in_channels, init_channels, kernel_size, bias=False),
nn.BatchNorm2d(init_channels),
nn.ReLU(inplace=True)
)
# 第二步:廉价线性变换(深度卷积)生成幻影特征
self.cheap_operation = nn.Sequential(
nn.Conv2d(init_channels, new_channels, dw_size, padding=dw_size//2,
groups=init_channels, bias=False),
nn.BatchNorm2d(new_channels),
nn.ReLU(inplace=True)
)
def forward(self, x):
x1 = self.primary_conv(x)
x2 = self.cheap_operation(x1)
return torch.cat([x1, x2], dim=1)复杂度对比(输出 ,常规 vs Ghost):
通常 ,压缩约 50% FLOPs。
4.2 ShuffleNet(2018)
核心思想:使用通道混洗(Channel Shuffle)解决分组卷积的信息隔离问题。
class ShuffleBlock(nn.Module):
"""ShuffleNet单元"""
def __init__(self, in_channels, out_channels, stride, groups=3):
super().__init__()
self.stride = stride
self.groups = groups
if stride == 2:
out_channels -= in_channels # 拼接而非相加
bottleneck = out_channels // 4
self.conv1 = nn.Conv2d(in_channels, bottleneck, 1, bias=False)
self.bn1 = nn.BatchNorm2d(bottleneck)
self.dwconv = nn.Conv2d(bottleneck, bottleneck, 3,
stride=stride, padding=1,
groups=bottleneck, bias=False)
self.bn2 = nn.BatchNorm2d(bottleneck)
self.conv3 = nn.Conv2d(bottleneck, out_channels, 1, bias=False)
self.bn3 = nn.BatchNorm2d(out_channels)
def channel_shuffle(self, x, groups):
"""通道混洗"""
B, C, H, W = x.shape
x = x.view(B, groups, C // groups, H, W)
x = x.transpose(1, 2).contiguous()
return x.view(B, C, H, W)
def forward(self, x):
identity = x
out = F.relu(self.bn1(self.conv1(x)))
out = self.channel_shuffle(out, self.groups)
out = self.bn2(self.dwconv(out))
out = F.relu(self.bn3(self.conv3(out)))
if self.stride == 1:
out = out + identity
else:
out = torch.cat([identity, out], dim=1)
return out4.3 RegNet(2020)
核心贡献:通过网络设计空间分析发现最优网络设计原则。
关键设计约束(基于大规模实验):
- 共享瓶颈比 (bottleneck宽度 = 输入宽度)
- 共享分组宽度 或
- 深度约 20 blocks
- 共享参数化宽度 ( 为block索引)
class RegNetBlock(nn.Module):
"""RegNet的设计参数化block"""
def __init__(self, in_channels, out_channels, stride, group_width):
super().__init__()
bottleneck = out_channels # 瓶颈比为1
groups = bottleneck // group_width
self.conv1 = nn.Conv2d(in_channels, bottleneck, 1, bias=False)
self.bn1 = nn.BatchNorm2d(bottleneck)
self.conv2 = nn.Conv2d(bottleneck, bottleneck, 3, stride=stride,
padding=1, groups=groups, bias=False)
self.bn2 = nn.BatchNorm2d(bottleneck)
self.conv3 = nn.Conv2d(bottleneck, out_channels, 1, bias=False)
self.bn3 = nn.BatchNorm2d(out_channels)
self.use_skip = (stride == 1 and in_channels == out_channels)
def forward(self, x):
identity = x
out = F.relu(self.bn1(self.conv1(x)))
out = F.relu(self.bn2(self.conv2(out)))
out = self.bn3(self.conv3(out))
if self.use_skip:
out = out + identity
return out五、CNN-Transformer混合架构
5.1 设计动机
CNN和Transformer各有优势:
- CNN:局部性、归纳偏置、计算高效
- Transformer:全局依赖、长程关系、可扩展性
混合架构尝试结合两者优势。
5.2 MobileViT(2021)
核心思想:Transformer作为卷积的”全局”补足。
MobileViT Block:
class MobileViTBlock(nn.Module):
"""MobileViT块:局部卷积 + 全局Transformer"""
def __init__(self, in_channels, transformer_dim, patch_size=(2, 2)):
super().__init__()
self.patch_size = patch_size
# 局部表示
self.local_rep = nn.Conv2d(in_channels, in_channels, 3, padding=1)
# 投影到Transformer维度
self.proj_in = nn.Conv2d(in_channels, transformer_dim, 1)
# Transformer块
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(
d_model=transformer_dim, nhead=4, dim_feedforward=transformer_dim*2,
batch_first=True
),
num_layers=2
)
# 投影回
self.proj_out = nn.Conv2d(transformer_dim, in_channels, 1)
# 融合
self.fuse = nn.Conv2d(2 * in_channels, in_channels, 1)
def forward(self, x):
local = self.local_rep(x)
# 投影 + 分patch
B, C, H, W = x.shape
ph, pw = self.patch_size
x_t = self.proj_in(local) # (B, T_d, H, W)
# 分patch: (B, T_d, H, W) -> (B, T_d, H/ph * W/pw, ph*pw)
x_t = x_t.unfold(2, ph, ph).unfold(3, pw, pw) # 划分patch
x_t = x_t.contiguous().view(B, x_t.shape[1], -1, ph * pw)
x_t = x_t.permute(0, 2, 1, 3).contiguous().view(B, -1, x_t.shape[1] * ph * pw)
# Transformer处理
x_t = self.transformer(x_t)
# 恢复空间结构
x_t = x_t.view(B, H // ph, W // pw, x_t.shape[1] // (ph * pw), ph * pw)
x_t = x_t.permute(0, 3, 1, 2, 4).contiguous().view(B, -1, H, W)
x_t = self.proj_out(x_t)
# 融合局部与全局
return self.fuse(torch.cat([local, x_t], dim=1))5.3 EfficientFormer(2022)
核心思想:设计高效的Vision Transformer,延迟导向而非FLOPs导向。
关键设计:
- Dimension-consistent 设计:避免reshape(节省内存)
- Latency-driven slimming:根据设备延迟剪枝
5.4 PoolFormer(2022)
惊人发现:用简单的池化操作替代注意力,性能仍能保持。
PoolFormer Block:
class PoolFormerBlock(nn.Module):
"""PoolFormer块:池化作为token混合"""
def __init__(self, dim, pool_size=3):
super().__init__()
self.norm1 = nn.GroupNorm(1, dim)
self.pool = nn.AvgPool2d(pool_size, stride=1, padding=pool_size//2)
# 实际上池化作用于空间维度
self.norm2 = nn.GroupNorm(1, dim)
self.mlp = nn.Sequential(
nn.Conv2d(dim, 4 * dim, 1),
nn.GELU(),
nn.Conv2d(4 * dim, dim, 1)
)
def forward(self, x):
# x: (B, H, W, C) 假设NHWC布局
# 池化代替attention
residual = x
x = self.norm1(x)
x = self.pool(x) - x # 差分作为注意力替代
x = residual + x
# MLP
residual = x
x = self.norm2(x)
x = self.mlp(x)
return residual + x洞察:Transformer的成功可能主要来自通用架构(残差+MLP),而非注意力机制本身。
六、2025前沿进展
6.1 Wavelet Convolutions
Finder et al. (2024) 提出小波卷积,构造具有全局感受野但稀疏激活的卷积核。3
数学构造:
小波核定义在多尺度上:
其中 是第 尺度、第 平移的小波基函数。
优势:
- 全局感受野(与ViT相同)
- 稀疏激活(计算高效)
- 多尺度分析能力
6.2 GhostNetV2(2022)
GhostNetV2 引入**解耦全注意力(DFC)**捕获长程依赖:
class GhostNetV2Block(nn.Module):
"""GhostNetV2:DFC注意力 + Ghost模块"""
def __init__(self, in_channels, out_channels, stride, kernel_size=3, ratio=2):
super().__init__()
# Ghost模块
self.ghost1 = GhostModule(in_channels, out_channels, ratio=ratio)
# DFC注意力
self.dfc = DFCAttention(out_channels)
self.ghost2 = GhostModule(out_channels, out_channels, ratio=ratio)
self.shortcut = (stride == 1 and in_channels == out_channels)
def forward(self, x):
residual = x
x = self.ghost1(x)
x = self.dfc(x) * x # 注意力加权
x = self.ghost2(x)
if self.shortcut:
x = x + residual
return x6.3 Foundation Models的轻量化
2025年出现轻量化基础模型趋势:
| 模型 | 参数量 | 任务 |
|---|---|---|
| MobileCLIP | 50M | 图文检索 |
| EfficientSAM | 9M | 图像分割 |
| MobileVLM v2 | 1.7B | 视觉问答 |
这些模型通过蒸馏、剪枝、量化从大型基础模型继承能力。
6.4 神经架构搜索(NAS)的演化
2025年NAS的最新趋势:
- 零成本NAS(Zero-Cost NAS):无需训练的代理指标
- 可微NAS(Differentiable NAS):DARTS系列
- 超网络NAS(One-Shot NAS):权重共享搜索
详见 machine-learning/neural-architecture-search/ 目录。
七、轻量化CNN的实践指南
7.1 选型决策树
部署平台?
├── 移动端/边缘
│ ├── 极致轻量 (< 1M参数)
│ │ └── MobileNetV3-Small / ShuffleNetV2
│ ├── 中等 (1-10M)
│ │ └── MobileNetV3-Large / EfficientNet-B0
│ └── 高精度 (10-50M)
│ └── EfficientNet-B3 / ConvNeXt-Tiny
├── 服务器GPU
│ └── ResNet / ConvNeXt / EfficientNet-B7
└── 长序列/全局依赖
└── Transformer / MobileViT / EfficientFormer
7.2 训练技巧
1. 学习率:
轻量化模型学习率应略低于大型模型:
- MobileNet: lr = 0.05
- ResNet-50: lr = 0.1
- EfficientNet: lr = 0.01
2. 渐进学习(EfficientNetV2):
训练过程中逐渐增加图像尺寸:
def progressive_resize(epoch, initial_size=128, max_size=380):
"""渐进调整训练分辨率"""
size = initial_size + (max_size - initial_size) * epoch / 300
return int(size)3. 蒸馏:
从大模型蒸馏轻量模型:
def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
"""知识蒸馏损失"""
soft_loss = F.kl_div(
F.log_softmax(student_logits / T, dim=1),
F.softmax(teacher_logits / T, dim=1),
reduction='batchmean'
) * (T * T)
hard_loss = F.cross_entropy(student_logits, labels)
return alpha * soft_loss + (1 - alpha) * hard_loss7.3 量化与部署
训练后量化(PTQ):
import torch.quantization as quant
# 准备量化
model.eval()
model_fp32 = copy.deepcopy(model)
model_fp32.qconfig = quant.get_default_qconfig('qnnpack')
# 量化敏感层
model_fp32.conv1.qconfig = None # 跳过第一层
model_fp32.fc.qconfig = None # 跳过最后一层
# 融合 + 量化
model_fused = quant.fuse_modules(model_fp32, [['conv1', 'bn1', 'relu']])
model_quant = quant.prepare(model_fused, inplace=False)
# 校准(用一小批校准数据)
# ...
model_int8 = quant.convert(model_quant, inplace=False)TensorRT优化:
import torch_tensorrt
# 编译为TensorRT
trt_model = torch_tensorrt.compile(
model,
inputs=[
torch_tensorrt.Input(
min_shape=[1, 3, 224, 224],
opt_shape=[16, 3, 224, 224],
max_shape=[32, 3, 224, 224]
)
],
enabled_precisions={torch.float16} # FP16加速
)八、参考资料
最后更新:2026-06-21
Footnotes
-
Howard, A.G., et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861. ↩
-
Howard, A., et al. (2019). Searching for MobileNetV3. ICCV 2019. https://arxiv.org/pdf/1905.02244 ↩
-
Finder, S.E., Amoyal, R., Treister, E., & Freifeld, O. (2024). Wavelet Convolutions for Large Receptive Fields. arXiv:2407.05848. https://arxiv.org/pdf/2407.05848 ↩