概述

随着深度学习在移动设备和边缘计算中的广泛应用,轻量化CNN成为研究与工程的核心议题。本文档系统梳理以下内容:

  1. 深度可分离卷积(Depthwise Separable Convolution):将标准卷积分解为深度卷积和逐点卷积,大幅降低参数量和计算量
  2. MobileNet系列:v1/v2/v3的演进与设计哲学
  3. EfficientNet家族:复合缩放策略
  4. 其他轻量化架构:GhostNet、ShuffleNet、RegNet
  5. CNN-Transformer混合:EfficientFormer、MobileViT、PoolFormer
  6. 2025新进展:Wavelet Convolutions等

轻量化CNN的核心目标是在精度和效率之间取得最优权衡,这对实际部署至关重要。1


一、深度可分离卷积理论基础

1.1 标准卷积的计算复杂度

对于输入 ,输出 ,卷积核

示例 卷积,):

1.2 深度可分离卷积分解

两步分解

Step 1 - 深度卷积(Depthwise Convolution)

每个输入通道独立使用一个 卷积核:

参数量:

Step 2 - 逐点卷积(Pointwise Convolution)

标准卷积进行通道混合:

参数量:

1.3 复杂度对比

总FLOPs:

压缩比

对于

约 8.7 倍的FLOPs减少

1.4 几何解释

标准卷积同时进行:

  1. 空间相关性建模( 核)
  2. 通道相关性建模( 线性变换)

深度可分离卷积将这两个任务解耦

  • 深度卷积仅建模空间相关性
  • 逐点卷积仅建模通道相关性

这种解耦假设:空间相关性和通道相关性可以分别学习,这在实践中被证明是有效的。

1.5 PyTorch实现

import torch
import torch.nn as nn
 
 
class DepthwiseSeparableConv2d(nn.Module):
    """深度可分离卷积"""
    
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False):
        super().__init__()
        self.depthwise = nn.Conv2d(
            in_channels, in_channels, kernel_size,
            stride=stride, padding=padding, groups=in_channels, bias=bias
        )
        self.pointwise = nn.Conv2d(
            in_channels, out_channels, 1, bias=bias
        )
    
    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x
 
 
def count_flops_dsc(in_channels, out_channels, kernel_size, H, W):
    """计算深度可分离卷积FLOPs"""
    dw = H * W * in_channels * kernel_size * kernel_size
    pw = H * W * in_channels * out_channels
    return dw + pw
 
 
def count_flops_standard(in_channels, out_channels, kernel_size, H, W):
    """计算标准卷积FLOPs"""
    return H * W * in_channels * out_channels * kernel_size * kernel_size
 
 
# 示例
H, W = 56, 56
c_in, c_out = 128, 256
ks = 3
 
flops_std = count_flops_standard(c_in, c_out, ks, H, W)
flops_dsc = count_flops_dsc(c_in, c_out, ks, H, W)
print(f"标准卷积 FLOPs: {flops_std:,}")
print(f"深度可分离 FLOPs: {flops_dsc:,}")
print(f"压缩比: {flops_dsc/flops_std:.3f}")

二、MobileNet系列演进

2.1 MobileNet v1(2017)

核心创新:引入深度可分离卷积作为基础块。

架构

Input (224×224×3)
  ↓
Conv 3×3, stride=2 → 112×112×32
  ↓
DepthwiseSeparable × 13
  ↓
AvgPool 7×7
  ↓
FC 1000

关键超参

  • 宽度乘子(Width Multiplier) :缩放通道数
  • 分辨率乘子(Resolution Multiplier) :缩放输入尺寸

调整后:

2.2 MobileNet v2(2018)

核心创新线性瓶颈 + 反向残差(Linear Bottlenecks & Inverted Residuals)。

设计动机

ReLU激活对低维(瓶颈)空间的破坏性更强。MobileNet v2在高维空间做非线性,在低维空间保持线性。

反向残差块(与传统ResNet相反):

传统ResNet:(压缩后恢复)
MobileNet v2:(先扩展后压缩)

class InvertedResidual(nn.Module):
    """MobileNet v2的反向残差块"""
    
    def __init__(self, in_channels, out_channels, stride, expansion=6):
        super().__init__()
        hidden = in_channels * expansion
        self.use_residual = (stride == 1 and in_channels == out_channels)
        
        layers = []
        # 1. 扩展(逐点卷积)
        if expansion != 1:
            layers.append(nn.Conv2d(in_channels, hidden, 1, bias=False))
            layers.append(nn.BatchNorm2d(hidden))
            layers.append(nn.ReLU6(inplace=True))
        
        # 2. 深度卷积
        layers.extend([
            nn.Conv2d(hidden, hidden, 3, stride=stride, padding=1, groups=hidden, bias=False),
            nn.BatchNorm2d(hidden),
            nn.ReLU6(inplace=True)
        ])
        
        # 3. 投影(线性逐点卷积,无激活)
        layers.extend([
            nn.Conv2d(hidden, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.conv = nn.Sequential(*layers)
    
    def forward(self, x):
        if self.use_residual:
            return x + self.conv(x)
        return self.conv(x)

2.3 MobileNet v3(2019)

核心创新:结合神经架构搜索(NAS)Squeeze-and-Excitation(SE)注意力h-swish激活2

h-swish激活

相比swish ,h-swish避免sigmoid计算,更适合硬件加速。

SE注意力模块

MobileNetV3-Large完整块

class MobileNetV3Block(nn.Module):
    """MobileNet v3的bneck块"""
    
    def __init__(self, in_channels, out_channels, kernel_size, stride,
                 use_se=True, use_hs=True, expansion=6):
        super().__init__()
        self.use_residual = (stride == 1 and in_channels == out_channels)
        hidden = in_channels * expansion
        activation = nn.Hardswish() if use_hs else nn.ReLU()
        
        layers = []
        # 扩展
        if expansion != 1:
            layers.extend([
                nn.Conv2d(in_channels, hidden, 1, bias=False),
                nn.BatchNorm2d(hidden),
                activation
            ])
        
        # 深度卷积
        padding = (kernel_size - 1) // 2
        layers.extend([
            nn.Conv2d(hidden, hidden, kernel_size, stride=stride,
                      padding=padding, groups=hidden, bias=False),
            nn.BatchNorm2d(hidden),
            activation
        ])
        
        # SE注意力
        if use_se:
            layers.append(SqueezeExcite(hidden))
        
        # 投影
        layers.extend([
            nn.Conv2d(hidden, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.conv = nn.Sequential(*layers)
    
    def forward(self, x):
        if self.use_residual:
            return x + self.conv(x)
        return self.conv(x)
 
 
class SqueezeExcite(nn.Module):
    """Squeeze-and-Excitation模块"""
    
    def __init__(self, channels, reduction=4):
        super().__init__()
        hidden = max(1, channels // reduction)
        self.fc = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(channels, hidden, 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(hidden, channels, 1),
            nn.Hardswish()
        )
    
    def forward(self, x):
        return x * self.fc(x)

2.4 MobileNet版本对比

特性v1 (2017)v2 (2018)v3 (2019)
核心模块深度可分离线性瓶颈+反向残差NAS+SE+h-swish
激活函数ReLU6ReLU6h-swish/ReLU
ImageNet Top-170.6%72.0%75.2%
FLOPs (Large)569M300M219M

三、EfficientNet系列

3.1 复合缩放理论

核心问题:给定FLOPs预算,如何最优地缩放网络的深度、宽度、分辨率

复合缩放公式(Tan & Le, ICML 2019):

约束:

其中 是用户指定的复合系数。

3.2 EfficientNet-B0基线

class MBConvBlock(nn.Module):
    """EfficientNet的MBConv块(基于MobileNet v2 + SE)"""
    
    def __init__(self, in_channels, out_channels, kernel_size, stride,
                 expand_ratio, se_ratio=0.25, drop_rate=0.0):
        super().__init__()
        self.use_residual = (stride == 1 and in_channels == out_channels)
        hidden = in_channels * expand_ratio
        self.drop_rate = drop_rate
        
        layers = []
        # 扩展(仅当expand_ratio ≠ 1时)
        if expand_ratio != 1:
            layers.extend([
                nn.Conv2d(in_channels, hidden, 1, bias=False),
                nn.BatchNorm2d(hidden),
                nn.SiLU()  # Swish
            ])
        
        # 深度卷积
        padding = (kernel_size - 1) // 2
        layers.extend([
            nn.Conv2d(hidden, hidden, kernel_size, stride=stride,
                      padding=padding, groups=hidden, bias=False),
            nn.BatchNorm2d(hidden),
            nn.SiLU()
        ])
        
        # SE
        if se_ratio > 0:
            se_hidden = max(1, int(hidden * se_ratio))
            layers.append(SqueezeExcite(hidden, se_hidden))
        
        # 投影
        layers.extend([
            nn.Conv2d(hidden, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.conv = nn.Sequential(*layers)
    
    def forward(self, x):
        out = self.conv(x)
        if self.use_residual:
            if self.drop_rate > 0 and self.training:
                out = self._drop_connect(out)
            out = out + x
        return out
    
    def _drop_connect(self, x):
        """DropConnect正则化"""
        keep_prob = 1.0 - self.drop_rate
        mask = torch.empty((x.size(0), 1, 1, 1), device=x.device).bernoulli_(keep_prob)
        return x * mask / keep_prob

3.3 EfficientNet的8个版本

模型分辨率FLOPsTop-1
B00224390M77.3%
B11240700M79.1%
B222601.0G80.1%
B333001.8G81.6%
B443804.2G82.9%
B554569.9G83.6%
B6652819.0G84.0%
B7760037.0G84.3%

3.4 EfficientNetV2(2021)

改进

  • 引入Fused-MBConv(早期层用标准卷积替代深度可分离)
  • 渐进学习(Progressive Learning):训练过程中逐渐增大图像尺寸
  • 自适应正则化
class FusedMBConv(nn.Module):
    """EfficientNetV2的Fused-MBConv(早期层)"""
    
    def __init__(self, in_channels, out_channels, stride, expand_ratio=1):
        super().__init__()
        hidden = in_channels * expand_ratio
        self.use_residual = (stride == 1 and in_channels == out_channels)
        
        layers = []
        if expand_ratio != 1:
            layers.extend([
                nn.Conv2d(in_channels, hidden, 3, stride=stride,
                          padding=1, bias=False),
                nn.BatchNorm2d(hidden),
                nn.SiLU()
            ])
        
        layers.extend([
            nn.Conv2d(hidden, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.conv = nn.Sequential(*layers)
    
    def forward(self, x):
        out = self.conv(x)
        if self.use_residual:
            return out + x
        return out

四、其他高效架构

4.1 GhostNet(2020)

核心观察:CNN特征图中存在大量冗余(相似特征图)。GhostNet以低成本生成这些冗余特征。

Ghost模块

class GhostModule(nn.Module):
    """Ghost模块:廉价操作生成冗余特征"""
    
    def __init__(self, in_channels, out_channels, kernel_size=1, ratio=2, dw_size=3):
        super().__init__()
        self.out_channels = out_channels
        init_channels = out_channels // ratio
        new_channels = out_channels - init_channels
        
        # 第一步:标准1×1卷积生成内在特征
        self.primary_conv = nn.Sequential(
            nn.Conv2d(in_channels, init_channels, kernel_size, bias=False),
            nn.BatchNorm2d(init_channels),
            nn.ReLU(inplace=True)
        )
        
        # 第二步:廉价线性变换(深度卷积)生成幻影特征
        self.cheap_operation = nn.Sequential(
            nn.Conv2d(init_channels, new_channels, dw_size, padding=dw_size//2,
                      groups=init_channels, bias=False),
            nn.BatchNorm2d(new_channels),
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        x1 = self.primary_conv(x)
        x2 = self.cheap_operation(x1)
        return torch.cat([x1, x2], dim=1)

复杂度对比(输出 ,常规 vs Ghost):

通常 ,压缩约 50% FLOPs。

4.2 ShuffleNet(2018)

核心思想:使用通道混洗(Channel Shuffle)解决分组卷积的信息隔离问题。

class ShuffleBlock(nn.Module):
    """ShuffleNet单元"""
    
    def __init__(self, in_channels, out_channels, stride, groups=3):
        super().__init__()
        self.stride = stride
        self.groups = groups
        if stride == 2:
            out_channels -= in_channels  # 拼接而非相加
        
        bottleneck = out_channels // 4
        self.conv1 = nn.Conv2d(in_channels, bottleneck, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(bottleneck)
        
        self.dwconv = nn.Conv2d(bottleneck, bottleneck, 3,
                                stride=stride, padding=1,
                                groups=bottleneck, bias=False)
        self.bn2 = nn.BatchNorm2d(bottleneck)
        
        self.conv3 = nn.Conv2d(bottleneck, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)
    
    def channel_shuffle(self, x, groups):
        """通道混洗"""
        B, C, H, W = x.shape
        x = x.view(B, groups, C // groups, H, W)
        x = x.transpose(1, 2).contiguous()
        return x.view(B, C, H, W)
    
    def forward(self, x):
        identity = x
        
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.channel_shuffle(out, self.groups)
        out = self.bn2(self.dwconv(out))
        out = F.relu(self.bn3(self.conv3(out)))
        
        if self.stride == 1:
            out = out + identity
        else:
            out = torch.cat([identity, out], dim=1)
        return out

4.3 RegNet(2020)

核心贡献:通过网络设计空间分析发现最优网络设计原则。

关键设计约束(基于大规模实验):

  1. 共享瓶颈比 (bottleneck宽度 = 输入宽度)
  2. 共享分组宽度
  3. 深度约 20 blocks
  4. 共享参数化宽度 为block索引)
class RegNetBlock(nn.Module):
    """RegNet的设计参数化block"""
    
    def __init__(self, in_channels, out_channels, stride, group_width):
        super().__init__()
        bottleneck = out_channels  # 瓶颈比为1
        
        groups = bottleneck // group_width
        
        self.conv1 = nn.Conv2d(in_channels, bottleneck, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(bottleneck)
        self.conv2 = nn.Conv2d(bottleneck, bottleneck, 3, stride=stride,
                               padding=1, groups=groups, bias=False)
        self.bn2 = nn.BatchNorm2d(bottleneck)
        self.conv3 = nn.Conv2d(bottleneck, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)
        self.use_skip = (stride == 1 and in_channels == out_channels)
    
    def forward(self, x):
        identity = x
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        if self.use_skip:
            out = out + identity
        return out

五、CNN-Transformer混合架构

5.1 设计动机

CNN和Transformer各有优势:

  • CNN:局部性、归纳偏置、计算高效
  • Transformer:全局依赖、长程关系、可扩展性

混合架构尝试结合两者优势

5.2 MobileViT(2021)

核心思想:Transformer作为卷积的”全局”补足。

MobileViT Block

class MobileViTBlock(nn.Module):
    """MobileViT块:局部卷积 + 全局Transformer"""
    
    def __init__(self, in_channels, transformer_dim, patch_size=(2, 2)):
        super().__init__()
        self.patch_size = patch_size
        # 局部表示
        self.local_rep = nn.Conv2d(in_channels, in_channels, 3, padding=1)
        # 投影到Transformer维度
        self.proj_in = nn.Conv2d(in_channels, transformer_dim, 1)
        # Transformer块
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(
                d_model=transformer_dim, nhead=4, dim_feedforward=transformer_dim*2,
                batch_first=True
            ),
            num_layers=2
        )
        # 投影回
        self.proj_out = nn.Conv2d(transformer_dim, in_channels, 1)
        # 融合
        self.fuse = nn.Conv2d(2 * in_channels, in_channels, 1)
    
    def forward(self, x):
        local = self.local_rep(x)
        # 投影 + 分patch
        B, C, H, W = x.shape
        ph, pw = self.patch_size
        x_t = self.proj_in(local)  # (B, T_d, H, W)
        # 分patch: (B, T_d, H, W) -> (B, T_d, H/ph * W/pw, ph*pw)
        x_t = x_t.unfold(2, ph, ph).unfold(3, pw, pw)  # 划分patch
        x_t = x_t.contiguous().view(B, x_t.shape[1], -1, ph * pw)
        x_t = x_t.permute(0, 2, 1, 3).contiguous().view(B, -1, x_t.shape[1] * ph * pw)
        # Transformer处理
        x_t = self.transformer(x_t)
        # 恢复空间结构
        x_t = x_t.view(B, H // ph, W // pw, x_t.shape[1] // (ph * pw), ph * pw)
        x_t = x_t.permute(0, 3, 1, 2, 4).contiguous().view(B, -1, H, W)
        x_t = self.proj_out(x_t)
        # 融合局部与全局
        return self.fuse(torch.cat([local, x_t], dim=1))

5.3 EfficientFormer(2022)

核心思想:设计高效的Vision Transformer,延迟导向而非FLOPs导向。

关键设计

  • Dimension-consistent 设计:避免reshape(节省内存)
  • Latency-driven slimming:根据设备延迟剪枝

5.4 PoolFormer(2022)

惊人发现:用简单的池化操作替代注意力,性能仍能保持。

PoolFormer Block

class PoolFormerBlock(nn.Module):
    """PoolFormer块:池化作为token混合"""
    
    def __init__(self, dim, pool_size=3):
        super().__init__()
        self.norm1 = nn.GroupNorm(1, dim)
        self.pool = nn.AvgPool2d(pool_size, stride=1, padding=pool_size//2)
        # 实际上池化作用于空间维度
        self.norm2 = nn.GroupNorm(1, dim)
        self.mlp = nn.Sequential(
            nn.Conv2d(dim, 4 * dim, 1),
            nn.GELU(),
            nn.Conv2d(4 * dim, dim, 1)
        )
    
    def forward(self, x):
        # x: (B, H, W, C) 假设NHWC布局
        # 池化代替attention
        residual = x
        x = self.norm1(x)
        x = self.pool(x) - x  # 差分作为注意力替代
        x = residual + x
        
        # MLP
        residual = x
        x = self.norm2(x)
        x = self.mlp(x)
        return residual + x

洞察:Transformer的成功可能主要来自通用架构(残差+MLP),而非注意力机制本身。


六、2025前沿进展

6.1 Wavelet Convolutions

Finder et al. (2024) 提出小波卷积,构造具有全局感受野但稀疏激活的卷积核。3

数学构造

小波核定义在多尺度上:

其中 是第 尺度、第 平移的小波基函数。

优势

  • 全局感受野(与ViT相同)
  • 稀疏激活(计算高效)
  • 多尺度分析能力

6.2 GhostNetV2(2022)

GhostNetV2 引入**解耦全注意力(DFC)**捕获长程依赖:

class GhostNetV2Block(nn.Module):
    """GhostNetV2:DFC注意力 + Ghost模块"""
    
    def __init__(self, in_channels, out_channels, stride, kernel_size=3, ratio=2):
        super().__init__()
        # Ghost模块
        self.ghost1 = GhostModule(in_channels, out_channels, ratio=ratio)
        # DFC注意力
        self.dfc = DFCAttention(out_channels)
        self.ghost2 = GhostModule(out_channels, out_channels, ratio=ratio)
        self.shortcut = (stride == 1 and in_channels == out_channels)
    
    def forward(self, x):
        residual = x
        x = self.ghost1(x)
        x = self.dfc(x) * x  # 注意力加权
        x = self.ghost2(x)
        if self.shortcut:
            x = x + residual
        return x

6.3 Foundation Models的轻量化

2025年出现轻量化基础模型趋势:

模型参数量任务
MobileCLIP50M图文检索
EfficientSAM9M图像分割
MobileVLM v21.7B视觉问答

这些模型通过蒸馏、剪枝、量化从大型基础模型继承能力。

6.4 神经架构搜索(NAS)的演化

2025年NAS的最新趋势:

  1. 零成本NAS(Zero-Cost NAS):无需训练的代理指标
  2. 可微NAS(Differentiable NAS):DARTS系列
  3. 超网络NAS(One-Shot NAS):权重共享搜索

详见 machine-learning/neural-architecture-search/ 目录。


七、轻量化CNN的实践指南

7.1 选型决策树

部署平台?
├── 移动端/边缘
│   ├── 极致轻量 (< 1M参数)
│   │   └── MobileNetV3-Small / ShuffleNetV2
│   ├── 中等 (1-10M)
│   │   └── MobileNetV3-Large / EfficientNet-B0
│   └── 高精度 (10-50M)
│       └── EfficientNet-B3 / ConvNeXt-Tiny
├── 服务器GPU
│   └── ResNet / ConvNeXt / EfficientNet-B7
└── 长序列/全局依赖
    └── Transformer / MobileViT / EfficientFormer

7.2 训练技巧

1. 学习率

轻量化模型学习率应略低于大型模型:

  • MobileNet: lr = 0.05
  • ResNet-50: lr = 0.1
  • EfficientNet: lr = 0.01

2. 渐进学习(EfficientNetV2):

训练过程中逐渐增加图像尺寸:

def progressive_resize(epoch, initial_size=128, max_size=380):
    """渐进调整训练分辨率"""
    size = initial_size + (max_size - initial_size) * epoch / 300
    return int(size)

3. 蒸馏

从大模型蒸馏轻量模型:

def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    """知识蒸馏损失"""
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / T, dim=1),
        F.softmax(teacher_logits / T, dim=1),
        reduction='batchmean'
    ) * (T * T)
    
    hard_loss = F.cross_entropy(student_logits, labels)
    
    return alpha * soft_loss + (1 - alpha) * hard_loss

7.3 量化与部署

训练后量化(PTQ)

import torch.quantization as quant
 
# 准备量化
model.eval()
model_fp32 = copy.deepcopy(model)
model_fp32.qconfig = quant.get_default_qconfig('qnnpack')
 
# 量化敏感层
model_fp32.conv1.qconfig = None  # 跳过第一层
model_fp32.fc.qconfig = None     # 跳过最后一层
 
# 融合 + 量化
model_fused = quant.fuse_modules(model_fp32, [['conv1', 'bn1', 'relu']])
model_quant = quant.prepare(model_fused, inplace=False)
# 校准(用一小批校准数据)
# ... 
model_int8 = quant.convert(model_quant, inplace=False)

TensorRT优化

import torch_tensorrt
 
# 编译为TensorRT
trt_model = torch_tensorrt.compile(
    model,
    inputs=[
        torch_tensorrt.Input(
            min_shape=[1, 3, 224, 224],
            opt_shape=[16, 3, 224, 224],
            max_shape=[32, 3, 224, 224]
        )
    ],
    enabled_precisions={torch.float16}  # FP16加速
)

八、参考资料


最后更新:2026-06-21

Footnotes

  1. Howard, A.G., et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861.

  2. Howard, A., et al. (2019). Searching for MobileNetV3. ICCV 2019. https://arxiv.org/pdf/1905.02244

  3. Finder, S.E., Amoyal, R., Treister, E., & Freifeld, O. (2024). Wavelet Convolutions for Large Receptive Fields. arXiv:2407.05848. https://arxiv.org/pdf/2407.05848