深度学习之图像分类（十一）-- MobileNetV2 网络结构-木卯 Blog

深度学习之图像分类（十一）MobileNetV2 网络结构

本节学习 MobileNetV2 网络结构。学习视频源于 Bilibili，部分参考描述源自知乎详解MobileNetV2。

1. 前言

MobileNetV2 是由google团队在 2018 年提出的，相比于 MobileNetV1 而言准确率更高，模型更小。其原始论文为 MobileNetV2: Inverted Residuals and Linear Bottlenecks。

我们在 MobileNetV1 中发现很多 DW 卷积权重为 0，其实是无效的。很重要的一个原因是因为 ReLU 激活函数对 0 值的梯度是 0，后续无论怎么迭代这个节点的值都不会恢复了，所以如果节点的值变为 0 就会“死掉”。而 ResNet 的残差结构可以很大程度上缓解这种特征退化问题。所以很自然的，MobileNetV2 尝试引入 Residuals 模块并针对 ReLU 激活函数进行研究，动机相当明确，故事也很清晰。

MobileNetV2 网络中的亮点（正好就是论文的名称）包括：

Inverted Residuals 倒残差结构
Linear Bottlenecks

2. Inverted Residuals 倒残差结构

首先来看看什么是 Inverted Residuals 结构。在 ResNet 提出的 Residuals 结构中，先使用 $1 \times 1$ 卷积实现降维，然后通过 $3 \times 3$ 卷积，最后通过 $1 \times 1$ 卷积实现升维，即两头大中间小。在 MobileNetV2 中，将降维和升维的顺序进行了调换，并且将 $3 \times 3$ 卷积换为 $3 \times 3$ DW卷积，即两头小中间大。

此外，激活函数也不一样，在 Inverted Residuals 中使用的是 ReLU6 激活函数，其表达式为 $y = \text{ReLU}(6) = min(max(x,0), 6)$。

在 MobileNetV2 中的 Inverted Residuals 结构图如下所示，与 ResNet 不同的是，只有当 stride = 1 且输入特征矩阵与输出特征矩阵 shape 相同的时候才有 shortcut 连接。

Inverted Residuals 结构的核心问题在于：为什么要升维再降维？

MobileNetV1 网络主要思路就是深度可分离卷积的堆叠，这和 VGG 的堆叠思想基本一致。Inverted Residuals 结构包含 DW 卷积结构之外，还使用了Expansion layer和 Projection layer。这个 Projection layer 就是使用 $1 \times 1$ 卷积网络结构，其目的是希望把高维特征映射到低维空间去。补充说一句，使用 $1 \times 1$ 卷积网络结构将高维空间映射到低维空间的设计有的时候我们也称之为 Bottleneck layer。Expansion layer 的功能正相反，使用 $1 \times 1$ 卷积网络结构，其目的是将低维空间映射到高维空间。这里 Expansion 有一个超参数是维度扩展几倍。可以根据实际情况来做调整的，默认值是 6，也就是扩展 6 倍。

我们知道：如果 Tensor 维度越低，卷积层的乘/加法计算量就越小。那么如果整个网络都是低维的 Tensor，那么整体计算速度就会很快(这个和 MobileNet 的初心一致呀)。然而，如果只是使用低维的 Tensor 效果并不会好（在 ICIP2021 的文章 MEPDNet 中提到了倒三角，不是增加维度，而是增加尺寸）。如果卷积层的 filter 都是使用低维的 Tensor 来提取特征的话，那么就没有办法提取到整体的足够多的信息。所以，如果提取特征数据的话，我们可能更希望有高维的 Tensor 来做这个事情。总结而言就是：在没有办法确定这些特征是否充足或者完备的情况下，那就多来点特征你自己去选来用，韩信点兵多多益善。MobileNetV2 便设计这样一个结构来达到平衡。

先通过 Expansion layer 来扩展维度，之后在用深度可分离卷积来提取特征，之后使用 Projection layer 来压缩数据，让网络从新变小。因为 Expansion layer 和 Projection layer 都是有可以学习的参数，所以整个网络结构可以学习到如何更好的扩展数据和从新压缩数据。哪为什么原来的结构不先升再降呢？简单，因为他们中间 $3 \times 3$ 卷积计算量太大。MobileNetV2 之所以敢这么做，还是源自于 DW 卷积计算量小，如果中间是普通卷积，计算量太大了，就需要先降维再升维来降低计算量和参数量。

3. Linear Bottlenecks

在原文中，作者针对 Inverted Residuals 结构的最后一个 $1 \times 1$ 卷积层使用线性激活函数而非 ReLU 激活函数。作者做了一个实验来说明这一点，首先输入是一个二维的矩阵，channel = 1。我们采用不同的 matrix $T$ 把它进行升维到更高的维度上，使用激活函数 ReLU，然后再使用 $T$ 的逆矩阵 $T^{-1}$ 对其进行还原。当 $T$ 的维度为 2 或者 3 时，还原回 2 维特征矩阵后丢失了很多信息。但是随着维度的增加，丢失的信息越来越少。可见，ReLU 激活函数对低维特征信息会造成大量损失。因为 Inverted Residuals 结构两头小中间大，在最后输出的时候是一个低维的特征了，所以改为线性激活函数而非 ReLU 激活函数。实验证明，使用 linear bottleneck 可以防止非线性破坏太多信息。

4. MobileNetV2 网络结构

下图展示了 MobileNetV2 网络结构配置图。其中 t 表示在 Inverted Residuals 结构中 $1 \times 1$ 卷积升维的倍率 (相较于输入通道而言的)，c 是输出特征矩阵的深度 channel。n 表示 bottleneck (即 Inverted Residuals 结构) 重复的次数。s 表示步距，但是只表示第一个 bottleneck 中 DW 卷积的步距，后面重复 bottleneck 的 stride 都是等于 1 的。

值得指出的是，在 pytorch 和 tensorflow 的官方实现中，第一个 bottleneck 是没有用 $1 \times 1$ 卷积升维的，而是直接接了 DW 卷积。因为原论文中这个的扩展因子 $t = 1$，也就是哪怕使用了 $1 \times 1$ 卷积也没有升维。所以干脆没有用了。

我们来看一下最终的性能对比，相比于 MobileNetV1 而言，无论从参数量还是性能上面都有提升。在 CPU 上预测一张图用了不到 0.1s，可以说是实现了实时。但是在目标检测上，比 MobileNetV1 的性能略差，就有点迷。

5. 代码

MobileNetV2 实现代码如下所示：

from torch import nn
import torch


def _make_divisible(ch, divisor=8, min_ch=None):
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    """
    if min_ch is None:
        min_ch = divisor
    new_ch = max(min_ch, int(ch + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_ch < 0.9 * ch:
        new_ch += divisor
    return new_ch


class ConvBNReLU(nn.Sequential):
    def __init__(self, in_channel, out_channel, kernel_size=3, stride=1, groups=1):
        padding = (kernel_size - 1) // 2
        super(ConvBNReLU, self).__init__(
            nn.Conv2d(in_channel, out_channel, kernel_size, stride, padding, groups=groups, bias=False),
            nn.BatchNorm2d(out_channel),
            nn.ReLU6(inplace=True)
        )


class InvertedResidual(nn.Module):
    def __init__(self, in_channel, out_channel, stride, expand_ratio):
        super(InvertedResidual, self).__init__()
        hidden_channel = in_channel * expand_ratio
        self.use_shortcut = stride == 1 and in_channel == out_channel

        layers = []
        if expand_ratio != 1:
            # 1x1 pointwise conv
            layers.append(ConvBNReLU(in_channel, hidden_channel, kernel_size=1))
        layers.extend([
            # 3x3 depthwise conv
            ConvBNReLU(hidden_channel, hidden_channel, stride=stride, groups=hidden_channel),
            # 1x1 pointwise conv(linear)
            nn.Conv2d(hidden_channel, out_channel, kernel_size=1, bias=False),
            nn.BatchNorm2d(out_channel),
        ])

        self.conv = nn.Sequential(*layers)

    def forward(self, x):
        if self.use_shortcut:
            return x + self.conv(x)
        else:
            return self.conv(x)


class MobileNetV2(nn.Module):
    def __init__(self, num_classes=1000, alpha=1.0, round_nearest=8):
        super(MobileNetV2, self).__init__()
        block = InvertedResidual
        input_channel = _make_divisible(32 * alpha, round_nearest)
        last_channel = _make_divisible(1280 * alpha, round_nearest)

        inverted_residual_setting = [
            # t, c, n, s
            [1, 16, 1, 1],
            [6, 24, 2, 2],
            [6, 32, 3, 2],
            [6, 64, 4, 2],
            [6, 96, 3, 1],
            [6, 160, 3, 2],
            [6, 320, 1, 1],
        ]

        features = []
        # conv1 layer
        features.append(ConvBNReLU(3, input_channel, stride=2))
        # building inverted residual residual blockes
        for t, c, n, s in inverted_residual_setting:
            output_channel = _make_divisible(c * alpha, round_nearest)
            for i in range(n):
                stride = s if i == 0 else 1
                features.append(block(input_channel, output_channel, stride, expand_ratio=t))
                input_channel = output_channel
        # building last several layers
        features.append(ConvBNReLU(input_channel, last_channel, 1))
        # combine feature layers
        self.features = nn.Sequential(*features)

        # building classifier
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.classifier = nn.Sequential(
            nn.Dropout(0.2),
            nn.Linear(last_channel, num_classes)
        )

        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.ones_(m.weight)
                nn.init.zeros_(m.bias)
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, 0, 0.01)
                nn.init.zeros_(m.bias)

    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x