轻量化CNN架构

概述

随着深度学习在移动设备和边缘计算中的广泛应用，轻量化CNN成为研究与工程的核心议题。本文档系统梳理以下内容：

深度可分离卷积（Depthwise Separable Convolution）：将标准卷积分解为深度卷积和逐点卷积，大幅降低参数量和计算量
MobileNet系列：v1/v2/v3的演进与设计哲学
EfficientNet家族：复合缩放策略
其他轻量化架构：GhostNet、ShuffleNet、RegNet
CNN-Transformer混合：EfficientFormer、MobileViT、PoolFormer
2025新进展：Wavelet Convolutions等

轻量化CNN的核心目标是在精度和效率之间取得最优权衡，这对实际部署至关重要。¹

一、深度可分离卷积理论基础

1.1 标准卷积的计算复杂度

对于输入 $H \times W \times C_{in}$ ，输出 $H \times W \times C_{out}$ ，卷积核 $K \times K$ ：

FLOPs_{standard} = H \times W \times C_{in} \times C_{out} \times K^{2}

示例（ $3 \times 3$ 卷积， $C_{in} = C_{out} = 256$ ）：

FLOPs = H \times W \times 256 \times 256 \times 9 = H \times W \times 589824

1.2 深度可分离卷积分解

两步分解：

Step 1 - 深度卷积（Depthwise Convolution）：

每个输入通道独立使用一个 $K \times K$ 卷积核：

Y_{dw} [i, j, c] = u, v \sum W_{dw} [u, v, c] \cdot X [i + u, j + v, c]

参数量： $K^{2} \times C_{in}$

Step 2 - 逐点卷积（Pointwise Convolution）：

$1 \times 1$ 标准卷积进行通道混合：

Y_{pw} [i, j, c_{out}] = c_{in} \sum W_{pw} [1, 1, c_{in}, c_{out}] \cdot Y_{dw} [i, j, c_{in}]

参数量： $1 \times 1 \times C_{in} \times C_{out} = C_{in} \times C_{out}$

1.3 复杂度对比

总FLOPs：

FLOPs_{dsc} = H \times W \times C_{in} \times K^{2} + H \times W \times C_{in} \times C_{out}

压缩比：

\frac{FLOPs _{dsc}}{FLOPs _{standard}} = \frac{1}{C _{out}} + \frac{1}{K ^{2}}

对于 $K = 3, C_{out} = 256$ ：

\frac{1}{256} + \frac{1}{9} \approx 0.115

约 8.7 倍的FLOPs减少。

1.4 几何解释

标准卷积同时进行：

空间相关性建模（ $K \times K$ 核）
通道相关性建模（ $C_{in} \to C_{out}$ 线性变换）

深度可分离卷积将这两个任务解耦：

深度卷积仅建模空间相关性
逐点卷积仅建模通道相关性

这种解耦假设：空间相关性和通道相关性可以分别学习，这在实践中被证明是有效的。

1.5 PyTorch实现

import torch
import torch.nn as nn
 
 
class DepthwiseSeparableConv2d(nn.Module):
    """深度可分离卷积"""
    
    def __init__(self, in_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False):
        super().__init__()
        self.depthwise = nn.Conv2d(
            in_channels, in_channels, kernel_size,
            stride=stride, padding=padding, groups=in_channels, bias=bias
        )
        self.pointwise = nn.Conv2d(
            in_channels, out_channels, 1, bias=bias
        )
    
    def forward(self, x):
        x = self.depthwise(x)
        x = self.pointwise(x)
        return x
 
 
def count_flops_dsc(in_channels, out_channels, kernel_size, H, W):
    """计算深度可分离卷积FLOPs"""
    dw = H * W * in_channels * kernel_size * kernel_size
    pw = H * W * in_channels * out_channels
    return dw + pw
 
 
def count_flops_standard(in_channels, out_channels, kernel_size, H, W):
    """计算标准卷积FLOPs"""
    return H * W * in_channels * out_channels * kernel_size * kernel_size
 
 
# 示例
H, W = 56, 56
c_in, c_out = 128, 256
ks = 3
 
flops_std = count_flops_standard(c_in, c_out, ks, H, W)
flops_dsc = count_flops_dsc(c_in, c_out, ks, H, W)
print(f"标准卷积 FLOPs: {flops_std:,}")
print(f"深度可分离 FLOPs: {flops_dsc:,}")
print(f"压缩比: {flops_dsc/flops_std:.3f}")

二、MobileNet系列演进

2.1 MobileNet v1（2017）

核心创新：引入深度可分离卷积作为基础块。

架构：

Input (224×224×3)
  ↓
Conv 3×3, stride=2 → 112×112×32
  ↓
DepthwiseSeparable × 13
  ↓
AvgPool 7×7
  ↓
FC 1000

关键超参：

宽度乘子（Width Multiplier） $α \in (0, 1]$ ：缩放通道数
分辨率乘子（Resolution Multiplier） $ρ \in (0, 1]$ ：缩放输入尺寸

调整后：

FLOPs_{α, ρ} = α ρ^{2} \cdot FLOPs_{base}

2.2 MobileNet v2（2018）

核心创新：线性瓶颈 + 反向残差（Linear Bottlenecks & Inverted Residuals）。

设计动机：

ReLU激活对低维（瓶颈）空间的破坏性更强。MobileNet v2在高维空间做非线性，在低维空间保持线性。

反向残差块（与传统ResNet相反）：

传统ResNet： $C \to C /4 \to C$ （压缩后恢复）
MobileNet v2： $C \to 6 C \to C$ （先扩展后压缩）

class InvertedResidual(nn.Module):
    """MobileNet v2的反向残差块"""
    
    def __init__(self, in_channels, out_channels, stride, expansion=6):
        super().__init__()
        hidden = in_channels * expansion
        self.use_residual = (stride == 1 and in_channels == out_channels)
        
        layers = []
        # 1. 扩展（逐点卷积）
        if expansion != 1:
            layers.append(nn.Conv2d(in_channels, hidden, 1, bias=False))
            layers.append(nn.BatchNorm2d(hidden))
            layers.append(nn.ReLU6(inplace=True))
        
        # 2. 深度卷积
        layers.extend([
            nn.Conv2d(hidden, hidden, 3, stride=stride, padding=1, groups=hidden, bias=False),
            nn.BatchNorm2d(hidden),
            nn.ReLU6(inplace=True)
        ])
        
        # 3. 投影（线性逐点卷积，无激活）
        layers.extend([
            nn.Conv2d(hidden, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.conv = nn.Sequential(*layers)
    
    def forward(self, x):
        if self.use_residual:
            return x + self.conv(x)
        return self.conv(x)

2.3 MobileNet v3（2019）

核心创新：结合神经架构搜索（NAS）、Squeeze-and-Excitation（SE）注意力和h-swish激活。²

h-swish激活：

h-swish (x) = x \cdot \frac{ReLU6 ( x + 3 )}{6}

相比swish $σ (x) \cdot x$ ，h-swish避免sigmoid计算，更适合硬件加速。

SE注意力模块：

F_{se} = σ (W_{2} \cdot δ (W_{1} \cdot GAP (F))) ⊙ F

MobileNetV3-Large完整块：

class MobileNetV3Block(nn.Module):
    """MobileNet v3的bneck块"""
    
    def __init__(self, in_channels, out_channels, kernel_size, stride,
                 use_se=True, use_hs=True, expansion=6):
        super().__init__()
        self.use_residual = (stride == 1 and in_channels == out_channels)
        hidden = in_channels * expansion
        activation = nn.Hardswish() if use_hs else nn.ReLU()
        
        layers = []
        # 扩展
        if expansion != 1:
            layers.extend([
                nn.Conv2d(in_channels, hidden, 1, bias=False),
                nn.BatchNorm2d(hidden),
                activation
            ])
        
        # 深度卷积
        padding = (kernel_size - 1) // 2
        layers.extend([
            nn.Conv2d(hidden, hidden, kernel_size, stride=stride,
                      padding=padding, groups=hidden, bias=False),
            nn.BatchNorm2d(hidden),
            activation
        ])
        
        # SE注意力
        if use_se:
            layers.append(SqueezeExcite(hidden))
        
        # 投影
        layers.extend([
            nn.Conv2d(hidden, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.conv = nn.Sequential(*layers)
    
    def forward(self, x):
        if self.use_residual:
            return x + self.conv(x)
        return self.conv(x)
 
 
class SqueezeExcite(nn.Module):
    """Squeeze-and-Excitation模块"""
    
    def __init__(self, channels, reduction=4):
        super().__init__()
        hidden = max(1, channels // reduction)
        self.fc = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),
            nn.Conv2d(channels, hidden, 1),
            nn.ReLU(inplace=True),
            nn.Conv2d(hidden, channels, 1),
            nn.Hardswish()
        )
    
    def forward(self, x):
        return x * self.fc(x)

2.4 MobileNet版本对比

特性	v1 (2017)	v2 (2018)	v3 (2019)
核心模块	深度可分离	线性瓶颈+反向残差	NAS+SE+h-swish
激活函数	ReLU6	ReLU6	h-swish/ReLU
ImageNet Top-1	70.6%	72.0%	75.2%
FLOPs (Large)	569M	300M	219M

三、EfficientNet系列

3.1 复合缩放理论

核心问题：给定FLOPs预算，如何最优地缩放网络的深度、宽度、分辨率？

复合缩放公式（Tan & Le, ICML 2019）：

depth width resolution = d = α^{ϕ} = w = β^{ϕ} = r = γ^{ϕ}

约束：

α \cdot β^{2} \cdot γ^{2} \approx 2, α, β, γ \geq 1

其中 $ϕ$ 是用户指定的复合系数。

3.2 EfficientNet-B0基线

class MBConvBlock(nn.Module):
    """EfficientNet的MBConv块（基于MobileNet v2 + SE）"""
    
    def __init__(self, in_channels, out_channels, kernel_size, stride,
                 expand_ratio, se_ratio=0.25, drop_rate=0.0):
        super().__init__()
        self.use_residual = (stride == 1 and in_channels == out_channels)
        hidden = in_channels * expand_ratio
        self.drop_rate = drop_rate
        
        layers = []
        # 扩展（仅当expand_ratio ≠ 1时）
        if expand_ratio != 1:
            layers.extend([
                nn.Conv2d(in_channels, hidden, 1, bias=False),
                nn.BatchNorm2d(hidden),
                nn.SiLU()  # Swish
            ])
        
        # 深度卷积
        padding = (kernel_size - 1) // 2
        layers.extend([
            nn.Conv2d(hidden, hidden, kernel_size, stride=stride,
                      padding=padding, groups=hidden, bias=False),
            nn.BatchNorm2d(hidden),
            nn.SiLU()
        ])
        
        # SE
        if se_ratio > 0:
            se_hidden = max(1, int(hidden * se_ratio))
            layers.append(SqueezeExcite(hidden, se_hidden))
        
        # 投影
        layers.extend([
            nn.Conv2d(hidden, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.conv = nn.Sequential(*layers)
    
    def forward(self, x):
        out = self.conv(x)
        if self.use_residual:
            if self.drop_rate > 0 and self.training:
                out = self._drop_connect(out)
            out = out + x
        return out
    
    def _drop_connect(self, x):
        """DropConnect正则化"""
        keep_prob = 1.0 - self.drop_rate
        mask = torch.empty((x.size(0), 1, 1, 1), device=x.device).bernoulli_(keep_prob)
        return x * mask / keep_prob

3.3 EfficientNet的8个版本

模型	$ϕ$	分辨率	FLOPs	Top-1
B0	0	224	390M	77.3%
B1	1	240	700M	79.1%
B2	2	260	1.0G	80.1%
B3	3	300	1.8G	81.6%
B4	4	380	4.2G	82.9%
B5	5	456	9.9G	83.6%
B6	6	528	19.0G	84.0%
B7	7	600	37.0G	84.3%

3.4 EfficientNetV2（2021）

改进：

引入Fused-MBConv（早期层用标准卷积替代深度可分离）
渐进学习（Progressive Learning）：训练过程中逐渐增大图像尺寸
自适应正则化

class FusedMBConv(nn.Module):
    """EfficientNetV2的Fused-MBConv（早期层）"""
    
    def __init__(self, in_channels, out_channels, stride, expand_ratio=1):
        super().__init__()
        hidden = in_channels * expand_ratio
        self.use_residual = (stride == 1 and in_channels == out_channels)
        
        layers = []
        if expand_ratio != 1:
            layers.extend([
                nn.Conv2d(in_channels, hidden, 3, stride=stride,
                          padding=1, bias=False),
                nn.BatchNorm2d(hidden),
                nn.SiLU()
            ])
        
        layers.extend([
            nn.Conv2d(hidden, out_channels, 1, bias=False),
            nn.BatchNorm2d(out_channels)
        ])
        
        self.conv = nn.Sequential(*layers)
    
    def forward(self, x):
        out = self.conv(x)
        if self.use_residual:
            return out + x
        return out

四、其他高效架构

4.1 GhostNet（2020）

核心观察：CNN特征图中存在大量冗余（相似特征图）。GhostNet以低成本生成这些冗余特征。

Ghost模块：

class GhostModule(nn.Module):
    """Ghost模块：廉价操作生成冗余特征"""
    
    def __init__(self, in_channels, out_channels, kernel_size=1, ratio=2, dw_size=3):
        super().__init__()
        self.out_channels = out_channels
        init_channels = out_channels // ratio
        new_channels = out_channels - init_channels
        
        # 第一步：标准1×1卷积生成内在特征
        self.primary_conv = nn.Sequential(
            nn.Conv2d(in_channels, init_channels, kernel_size, bias=False),
            nn.BatchNorm2d(init_channels),
            nn.ReLU(inplace=True)
        )
        
        # 第二步：廉价线性变换（深度卷积）生成幻影特征
        self.cheap_operation = nn.Sequential(
            nn.Conv2d(init_channels, new_channels, dw_size, padding=dw_size//2,
                      groups=init_channels, bias=False),
            nn.BatchNorm2d(new_channels),
            nn.ReLU(inplace=True)
        )
    
    def forward(self, x):
        x1 = self.primary_conv(x)
        x2 = self.cheap_operation(x1)
        return torch.cat([x1, x2], dim=1)

复杂度对比（输出 $C_{out}$ ，常规 vs Ghost）：

\frac{FLOPs _{Ghost}}{FLOPs _{standard}} \approx \frac{1}{r} + \frac{( r - 1 )}{r \cdot K ^{2}}

通常 $r = 2$ ，压缩约 50% FLOPs。

4.2 ShuffleNet（2018）

核心思想：使用通道混洗（Channel Shuffle）解决分组卷积的信息隔离问题。

class ShuffleBlock(nn.Module):
    """ShuffleNet单元"""
    
    def __init__(self, in_channels, out_channels, stride, groups=3):
        super().__init__()
        self.stride = stride
        self.groups = groups
        if stride == 2:
            out_channels -= in_channels  # 拼接而非相加
        
        bottleneck = out_channels // 4
        self.conv1 = nn.Conv2d(in_channels, bottleneck, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(bottleneck)
        
        self.dwconv = nn.Conv2d(bottleneck, bottleneck, 3,
                                stride=stride, padding=1,
                                groups=bottleneck, bias=False)
        self.bn2 = nn.BatchNorm2d(bottleneck)
        
        self.conv3 = nn.Conv2d(bottleneck, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)
    
    def channel_shuffle(self, x, groups):
        """通道混洗"""
        B, C, H, W = x.shape
        x = x.view(B, groups, C // groups, H, W)
        x = x.transpose(1, 2).contiguous()
        return x.view(B, C, H, W)
    
    def forward(self, x):
        identity = x
        
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.channel_shuffle(out, self.groups)
        out = self.bn2(self.dwconv(out))
        out = F.relu(self.bn3(self.conv3(out)))
        
        if self.stride == 1:
            out = out + identity
        else:
            out = torch.cat([identity, out], dim=1)
        return out

4.3 RegNet（2020）

核心贡献：通过网络设计空间分析发现最优网络设计原则。

关键设计约束（基于大规模实验）：

共享瓶颈比 $b = 1$ （bottleneck宽度 = 输入宽度）
共享分组宽度 $g_{w} = 16$ 或 $g_{w} > 16$
深度约 20 blocks
共享参数化宽度 $w_{j} = w_{0} + w_{a} \cdot j$ （ $j$ 为block索引）

class RegNetBlock(nn.Module):
    """RegNet的设计参数化block"""
    
    def __init__(self, in_channels, out_channels, stride, group_width):
        super().__init__()
        bottleneck = out_channels  # 瓶颈比为1
        
        groups = bottleneck // group_width
        
        self.conv1 = nn.Conv2d(in_channels, bottleneck, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(bottleneck)
        self.conv2 = nn.Conv2d(bottleneck, bottleneck, 3, stride=stride,
                               padding=1, groups=groups, bias=False)
        self.bn2 = nn.BatchNorm2d(bottleneck)
        self.conv3 = nn.Conv2d(bottleneck, out_channels, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_channels)
        self.use_skip = (stride == 1 and in_channels == out_channels)
    
    def forward(self, x):
        identity = x
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        if self.use_skip:
            out = out + identity
        return out

五、CNN-Transformer混合架构

5.1 设计动机

CNN和Transformer各有优势：

CNN：局部性、归纳偏置、计算高效
Transformer：全局依赖、长程关系、可扩展性

混合架构尝试结合两者优势。

5.2 MobileViT（2021）

核心思想：Transformer作为卷积的”全局”补足。

MobileViT Block：

class MobileViTBlock(nn.Module):
    """MobileViT块：局部卷积 + 全局Transformer"""
    
    def __init__(self, in_channels, transformer_dim, patch_size=(2, 2)):
        super().__init__()
        self.patch_size = patch_size
        # 局部表示
        self.local_rep = nn.Conv2d(in_channels, in_channels, 3, padding=1)
        # 投影到Transformer维度
        self.proj_in = nn.Conv2d(in_channels, transformer_dim, 1)
        # Transformer块
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(
                d_model=transformer_dim, nhead=4, dim_feedforward=transformer_dim*2,
                batch_first=True
            ),
            num_layers=2
        )
        # 投影回
        self.proj_out = nn.Conv2d(transformer_dim, in_channels, 1)
        # 融合
        self.fuse = nn.Conv2d(2 * in_channels, in_channels, 1)
    
    def forward(self, x):
        local = self.local_rep(x)
        # 投影 + 分patch
        B, C, H, W = x.shape
        ph, pw = self.patch_size
        x_t = self.proj_in(local)  # (B, T_d, H, W)
        # 分patch: (B, T_d, H, W) -> (B, T_d, H/ph * W/pw, ph*pw)
        x_t = x_t.unfold(2, ph, ph).unfold(3, pw, pw)  # 划分patch
        x_t = x_t.contiguous().view(B, x_t.shape[1], -1, ph * pw)
        x_t = x_t.permute(0, 2, 1, 3).contiguous().view(B, -1, x_t.shape[1] * ph * pw)
        # Transformer处理
        x_t = self.transformer(x_t)
        # 恢复空间结构
        x_t = x_t.view(B, H // ph, W // pw, x_t.shape[1] // (ph * pw), ph * pw)
        x_t = x_t.permute(0, 3, 1, 2, 4).contiguous().view(B, -1, H, W)
        x_t = self.proj_out(x_t)
        # 融合局部与全局
        return self.fuse(torch.cat([local, x_t], dim=1))

5.3 EfficientFormer（2022）

核心思想：设计高效的Vision Transformer，延迟导向而非FLOPs导向。

关键设计：

Dimension-consistent 设计：避免reshape（节省内存）
Latency-driven slimming：根据设备延迟剪枝

5.4 PoolFormer（2022）

惊人发现：用简单的池化操作替代注意力，性能仍能保持。

PoolFormer Block：

class PoolFormerBlock(nn.Module):
    """PoolFormer块：池化作为token混合"""
    
    def __init__(self, dim, pool_size=3):
        super().__init__()
        self.norm1 = nn.GroupNorm(1, dim)
        self.pool = nn.AvgPool2d(pool_size, stride=1, padding=pool_size//2)
        # 实际上池化作用于空间维度
        self.norm2 = nn.GroupNorm(1, dim)
        self.mlp = nn.Sequential(
            nn.Conv2d(dim, 4 * dim, 1),
            nn.GELU(),
            nn.Conv2d(4 * dim, dim, 1)
        )
    
    def forward(self, x):
        # x: (B, H, W, C) 假设NHWC布局
        # 池化代替attention
        residual = x
        x = self.norm1(x)
        x = self.pool(x) - x  # 差分作为注意力替代
        x = residual + x
        
        # MLP
        residual = x
        x = self.norm2(x)
        x = self.mlp(x)
        return residual + x

洞察：Transformer的成功可能主要来自通用架构（残差+MLP），而非注意力机制本身。

六、2025前沿进展

6.1 Wavelet Convolutions

Finder et al. (2024) 提出小波卷积，构造具有全局感受野但稀疏激活的卷积核。³

数学构造：

小波核定义在多尺度上：

K_{wavelet} (x) = j = 1 \sum J k \in Z^{d} \sum α_{jk} \cdot ψ_{jk} (x)

其中 $ψ_{jk}$ 是第 $j$ 尺度、第 $k$ 平移的小波基函数。

优势：

全局感受野（与ViT相同）
稀疏激活（计算高效）
多尺度分析能力

6.2 GhostNetV2（2022）

GhostNetV2 引入**解耦全注意力（DFC）**捕获长程依赖：

class GhostNetV2Block(nn.Module):
    """GhostNetV2：DFC注意力 + Ghost模块"""
    
    def __init__(self, in_channels, out_channels, stride, kernel_size=3, ratio=2):
        super().__init__()
        # Ghost模块
        self.ghost1 = GhostModule(in_channels, out_channels, ratio=ratio)
        # DFC注意力
        self.dfc = DFCAttention(out_channels)
        self.ghost2 = GhostModule(out_channels, out_channels, ratio=ratio)
        self.shortcut = (stride == 1 and in_channels == out_channels)
    
    def forward(self, x):
        residual = x
        x = self.ghost1(x)
        x = self.dfc(x) * x  # 注意力加权
        x = self.ghost2(x)
        if self.shortcut:
            x = x + residual
        return x

6.3 Foundation Models的轻量化

2025年出现轻量化基础模型趋势：

模型	参数量	任务
MobileCLIP	50M	图文检索
EfficientSAM	9M	图像分割
MobileVLM v2	1.7B	视觉问答

这些模型通过蒸馏、剪枝、量化从大型基础模型继承能力。

6.4 神经架构搜索（NAS）的演化

2025年NAS的最新趋势：

零成本NAS（Zero-Cost NAS）：无需训练的代理指标
可微NAS（Differentiable NAS）：DARTS系列
超网络NAS（One-Shot NAS）：权重共享搜索

详见 machine-learning/neural-architecture-search/ 目录。

七、轻量化CNN的实践指南

7.1 选型决策树

部署平台？
├── 移动端/边缘
│   ├── 极致轻量 (< 1M参数)
│   │   └── MobileNetV3-Small / ShuffleNetV2
│   ├── 中等 (1-10M)
│   │   └── MobileNetV3-Large / EfficientNet-B0
│   └── 高精度 (10-50M)
│       └── EfficientNet-B3 / ConvNeXt-Tiny
├── 服务器GPU
│   └── ResNet / ConvNeXt / EfficientNet-B7
└── 长序列/全局依赖
    └── Transformer / MobileViT / EfficientFormer

7.2 训练技巧

1. 学习率：

轻量化模型学习率应略低于大型模型：

MobileNet: lr = 0.05
ResNet-50: lr = 0.1
EfficientNet: lr = 0.01

2. 渐进学习（EfficientNetV2）：

训练过程中逐渐增加图像尺寸：

def progressive_resize(epoch, initial_size=128, max_size=380):
    """渐进调整训练分辨率"""
    size = initial_size + (max_size - initial_size) * epoch / 300
    return int(size)

3. 蒸馏：

从大模型蒸馏轻量模型：

def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    """知识蒸馏损失"""
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / T, dim=1),
        F.softmax(teacher_logits / T, dim=1),
        reduction='batchmean'
    ) * (T * T)
    
    hard_loss = F.cross_entropy(student_logits, labels)
    
    return alpha * soft_loss + (1 - alpha) * hard_loss

7.3 量化与部署

训练后量化（PTQ）：

import torch.quantization as quant
 
# 准备量化
model.eval()
model_fp32 = copy.deepcopy(model)
model_fp32.qconfig = quant.get_default_qconfig('qnnpack')
 
# 量化敏感层
model_fp32.conv1.qconfig = None  # 跳过第一层
model_fp32.fc.qconfig = None     # 跳过最后一层
 
# 融合 + 量化
model_fused = quant.fuse_modules(model_fp32, [['conv1', 'bn1', 'relu']])
model_quant = quant.prepare(model_fused, inplace=False)
# 校准（用一小批校准数据）
# ... 
model_int8 = quant.convert(model_quant, inplace=False)

TensorRT优化：

import torch_tensorrt
 
# 编译为TensorRT
trt_model = torch_tensorrt.compile(
    model,
    inputs=[
        torch_tensorrt.Input(
            min_shape=[1, 3, 224, 224],
            opt_shape=[16, 3, 224, 224],
            max_shape=[32, 3, 224, 224]
        )
    ],
    enabled_precisions={torch.float16}  # FP16加速
)

八、参考资料

最后更新：2026-06-21

Howard, A.G., et al. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861. ↩
Howard, A., et al. (2019). Searching for MobileNetV3. ICCV 2019. https://arxiv.org/pdf/1905.02244 ↩
Finder, S.E., Amoyal, R., Treister, E., & Freifeld, O. (2024). Wavelet Convolutions for Large Receptive Fields. arXiv:2407.05848. https://arxiv.org/pdf/2407.05848 ↩

Metaphor

探索