# Summary of the evolutionary history of lightweight network model optimization ——Inception V1-4,ResNet,Xception,ResNeXt,MobileNe,,ShuffleNet,DenseNet

abstract

Network from lightweight LeNet Classic to deeper structure AlexNet,VGG Equal network structure , For both recognition still classification The accuracy has been greatly improved , But it also brings many problems . For example, gradient dispersion often encountered in training / blast , Over fitting , Under fitting , Poor pan China Performance , Accuracy degradation, etc , And the computing time as the network deepens , The cost of space has skyrocketed . Although the operation performance of the graphics card is also improving , But the most effective way is to solve it from the fundamental algorithm ; And now a research direction is to apply the deep learning algorithm to the mobile terminal , This requires that the academic community from the network block Change in structure . Two directions of network compression optimization are mainly transfer learning （ for example MobileID） And network sparsity , In fact, more excellent networks have been created , as ResNet,InceptionV2-4,Xception,ResNext,MobileNet,ShuffleNet,DenseNet, The main idea of these networks will be introduced in the text .

key word ：Inception V1-4,ResNet,Xception,ResNeXt,MobileNe,,ShuffleNet,DenseNet

One , introduction

LeNet5 Born in 1994

year , It is one of the earliest convolutional neural networks , And promote the development of deep learning . Now, more network structures are piled up from simple layers, more are changed from basic structures , Generally speaking, the idea of model simplification is based on Conv Model compression with a large amount of redundancy within the layer , At present, the commonly used methods in the academic circle are mainly ： Parameter sparsity , Matrix decomposition ,Depthwise convolution , Group convolution, etc . from ResNet start ,InceptionV4,Xception,ResNext,MobileNet,ShuffleNet, as well as 2017CVPR Of best

paper Of DenseNet It's all borrowed ResNet Thought of , It's called “ an inspired passage ” An idea of . This literature review is based on ResNet As a starting point , Started reading a series of recent excellent papers on network structure optimization , Main contributions to these networks in the main body , Core ideas , The changes of network structure are described in detail , And sort out their evolutionary history .

Two , Introduction to the evolution history of optimization model ：

（1）Inception V1：

That's what we know GoogleNet, Its core idea is to increase the depth and width of the network , To improve CNN network performance , This means that a large number of parameters are easy to produce over fitting, which will greatly increase the amount of calculation .

GoogleNet It is considered that the fundamental way to solve the above two shortcomings is to convert full connection and even general convolution into sparse connection . On the one hand, the connection of biological nervous system is sparse , On the other hand, the literature shows that ： For large scale sparse neural networks , An optimal network can be constructed layer by layer by analyzing the statistical characteristics of activation values and clustering highly correlated outputs . This shows that the bloated sparse network may be simplified without loss of performance .

Although there are strict conditions for mathematical proof , but Hebbian The code strongly supports this ：fire together,wire together.

Earlier , In order to break network symmetry and improve learning ability , Traditional networks use random sparse connection . however , The computing efficiency of computer software and hardware for heterogeneous sparse data is very poor , So in AlexNet Full connection layer is re enabled in , The purpose is to optimize parallel operation better .GoogleNet To find a way to maintain the sparsity of network structure , A method that can utilize the high computing performance of dense matrix .

Inception module The central idea of , It is to approximate sparse structure to several dense submatrixes , So as to reduce the parameters at the same time , More efficient use of computing resources .

In the same layer network structure , Yes 1*1,3*3,5*5 Different convolution templates , Can be in different size Convolution feature extraction based on receptive field , The calculation quantity of the whole network becomes larger , But the layers are not getting deeper .

The specific operation is , stay 3*3,5*5 Do before convolution 1*1 Convolution of , To reduce input Of channel Number of ,,1*1 The convolution kernel plays a role of dimensionality reduction ; And the features extracted from the network become more abstract , The receptive field involved has also become larger ,kernel by 3*3,5*5 The convolution ratio of is also increased .

The core structure of the network is shown in the figure below ：

final Google Compared with AlexNet less than 12 times , be relative to VGG-16 less than 3 times , It was a very good network at that time , But research is far more than that .

（2）Inception V2：

Inception V2 The main contribution is to propose batch

normalization, The main purpose is to speed up the training . In the process of network training, the continuous change of parameters leads to the change of input distribution of each layer , And the learning process should make each layer adapt to the input distribution , So we have to reduce the learning rate , Initialize carefully . The author calls the change of distribution internal

covariate shift.

The network structure has also changed , Stack in two layers 3*3 perhaps 5*5, Compared with V1 There are fewer parameters , Less calculation , But the number of layers increases , Better results , as follows ：

（3）Inception V3：

Inception V3 The purpose of this paper is to study how to increase the network scale and ensure the high efficiency of computation , In this paper, some suggestions are put forward CNN Empirical rules of parameter adjustment .

1, Avoid bottleneck of feature representation , Feature representation means that the image CNN Activation value of a layer , The size of feature representation is CNN It should be slowly reduced .

2, High dimensional features are easier to handle , Faster training on high dimensional features , Easier convergence

3, Spatial convergence in low dimensional embedding space , The loss is not great . The explanation for this is that there is a strong correlation between adjacent nerve units , Redundant information .

4, Depth and width of balanced networks . If the width and depth are appropriate, the network can be applied to the distributed system in a balanced way computational budget.

The biggest change in network structure is the use of 1*n combination n*1 To replace n*n Convolution of , The structure is as follows ：

（4）ResNet：

ResNet Main problems solved , It's the problem of degradation in deep networks . The author clearly stated in the paper , In the field of deep learning , The deeper the conventional network is, the better , Beyond a certain depth , Accuracy begins to decline , And the accuracy of the training set is also decreasing , It is proved that it is not due to over fitting .

stay ResNet Add one identity

mapping（ Identity map ）, The original function to be learned H(x) convert to F(x)+x, The author thinks that the two expressions have the same effect , But the difficulty of optimization is not the same , Author's hypothesis F(x) Optimization of

Will compare H(x) Much simpler . This idea is also derived from the residual vector coding in image processing , Through a reformulation, Decomposing a problem into several scale direct residual problems , It can play a good role in optimizing training . The specific paper notes are explained in detail in the blog paper notes .

core block The structure is as follows ：

（5）Inception V4：

Inception V4 Mainly connected by residual （Residual

Connection）, that is ResNet To improve V3 structure . Proved Inception Module combination Residual

Connection Can greatly speed up training , At the same time, the performance is improved , Get one Inception-ResNet V2 network , At the same time, a deeper and more optimized Inception

v4 Model , Can reach and Inception-ResNet V2 Comparable performance .

（6）ResNeXt：

ResNeXt yes ResNet Limit version of , On behalf of the next

dimension.ResNeXt The paper proves that Cardinality（ Namely ResNeXt in module number ） Ratio of width perhaps depth Better results , And ResNet Less than parameters , Better results , Simple structure and convenient design .

（7）Xception：

Xception yes Inception The ultimate version of the family network , The most important method proposed by the author is Depthwise Separable

Convlution, This is in the back MobileNet It's also reflected in , The core idea is spatial transformation , Channel transformation . and Inception

V3 The difference is to do it first 1*1 Convolution of , Do it again 3*3 Convolution of , In this way, the channels are merged first , Channel convolution , And then space convolution , and Xception On the contrary , Space first 3*3 convolution , To go through the channel again 1*1 convolution , The differences are as follows ：

（8）MobileNet：

MobileNets In fact Xception Application of ideas . The difference is Exception The focus of this paper is to improve the accuracy , and MobileNets Focus on compression model , At the same time, ensure the accuracy .Depthwiseseparable

convolutions The idea is , Decomposing a standard convolution into a depthwise convolutions And a pointwise

convolution. Simple understanding is factorization of matrix , The specific steps are shown on the left of the figure below .

The structure of deep separation convolution with traditional convolution block The difference is shown on the right of the figure below ：

hypothesis , Entered feature map Size is DF * DF, Dimension is M, The size of the filter is DK *

DK, Dimension is N, And suppose padding by 1,stride by 1. be , Original convolution operation , The number of matrix operations to be performed is DK*DK*M*N*DF*DF, The convolution kernel parameter is DK

*DK *N.

Depthwise separable convolutions The number of matrix operations to be performed is DK*DK*M*DF*DF + M *N

*DF*DF, The convolution kernel parameter is DK *DK *M+N.

Because of the convolution process , Mainly a spatial dimensions reduce ,channel dimension Process of increase , Namely N>M, therefore ,DK *DK *N>

DK *DK *M+N.

therefore ,depthwiseseparable convolutions A lot of compression has been carried out in the model size and calculation amount , Make the model fast , Less computing overhead , Good accuracy .

（9）ShuffleNet：

This article is in mobileNet On the basis of 1 Point improvement ：mobileNet Just do it 3*3 Convolutional deepwiseconvolution, and 1*1 Convolution or traditional convolution , There is also a lot of redundancy ,ShuffleNet On this basis , take 1*1 Convolution is done shuffle and group operation , Realized channel

shuffle and pointwise group convolution operation , Finally, the speed and accuracy are compared mobileNet Improved .

The specific structure is shown in the figure below ：

(a) It's primitive mobileNet Framework of , each group There is no exchange of information between them .

(b) take feature map Did it shuffle operation

(c) It's a process channel shuffle Later results .

shufflenet The idea of group convolution is also used in , Very effective , Indirectly , In fact, an efficient neural network structure design should be grouped , Instead of the general Conv perhaps InnerProduct So it's all connected —— Similar information should be shared , No need to extract repeatedly ; Different groups have different functions , And these functions can be trained . It can tell us , Information needs to be condensed .

（10）DenseNet：

DenseNet It's the latest 2017CVPR Of best

paper, Even though it's based on ResNet, But the difference is to maximize the flow of information between all layers in the network , The author connects all layers in the network , Make each layer in the network accept the features of all layers in front of it as input . Because there are a lot of dense connections in the network , The author calls this network structure

DenseNet, The structure is shown on the left in the figure below ：

It has two main features ：

1, To some extent, it can reduce the problem of gradient dissipation in the training process . Because we can see from the picture on the left , In back propagation, each layer will receive gradient signals from all subsequent layers , So it will not increase with the depth of the network , The gradient near the input layer becomes smaller and smaller .

2, Because a large number of features are reused , So that a small number of convolution kernels can generate a large number of features , The size of the final model is also relatively small .

A complete DesNet The structure is as follows ：

The main points of network design are also described as follows ：

1, For feature reuse , In cross layer connection, it is used on the feature dimension Concatenate operation , instead of Element-wise Addition operation .

2, Because there is no need to Elewise-wise operation , So there is no need for one at the end of each unit module 1X1 To increase the number of feature layers to be consistent with the input feature dimension .

3, use Pre-activation To design the unit , take BN

Operation moves up from main branch to before branch .（BN->ReLU->1x1Conv->BN->ReLU->3x3Conv）.

4, Because each layer in the network accepts the characteristics of all previous layers as input , In order to avoid the increase of network layers , Feature dimensions of the back layer grow too fast , When subsampling after each stage , First, the feature dimension is compressed to half of the current input through a convolution layer , And then do it

Pooling Operation of .

5, Setting of growth rate . The growth rate refers to the last one of each module 3x3 The number of convolution kernels of , Recorded as k. Because each unit module finally uses the Concatenate

To connect , So every unit module , The next level of feature dimensions will grow

k. The greater the value, the greater the amount of information circulating in the network , Correspondingly, the more powerful the network is , But the size and calculation amount of the whole model will also increase . The author uses k=32 and k=48 Two settings .

Three , Summary and Outlook ：

This paper is based on ResNet start with , take ResNet The excellent network structure design before and after is combed and the core points are summarized . Overall , In recent years, there are more and more manual design networks in the conference , Step by step to replace the traditional network simple and deep thinking , More and more attention has been paid to the search for methods of model compression and Optimization for processing recognition与classification这两个计算机视觉方面最为重要的问题.不要是要让accuracy更小,mAP更高,收敛曲线更好,同时还要减少计算的空间和成本.从MobileNet我们也可以看出,由于更多框架的搭建,更少的减少卷积层的内部冗余,提高运算性能和网络性能,在业界这也为深度学习向移动端发展提供了可能.

手动设计网络结构的不断演进,喷井式的网络结构优化论文的发辫,我们目前谈论的手工设计的神经网络结构将会被很快淘汰.但从前人的论文中我们应该吸收他们的创意和思路,并且寻找到新的方法.对未来的设想是随着网络的不断进化,以后甚至可以自动根据训练数据学习得到的“更适合”的网络结构所代替,只是固定网络的基本结构,而整个神经网络的拓扑结构可以在训练中被自动发现,设计,而不再是手动设计,这样的网络可以不断在场景中升级演化.

四,参考文献（对应正文第二部分的顺序）：

[1] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]// IEEE

Conference on Computer Vision and Pattern Recognition. IEEE Computer Society,

2015:1-9.

[2] Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network

Training by Reducing Internal Covariate Shift[J]. 2015:448-456.

[3] Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the Inception

Architecture for Computer Vision[J]. 2015:2818-2826.

[4] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image

Recognition[J]. 2015:770-778.

[5] Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet

and the Impact of Residual Connections on Learning[J]. 2016.

[6] Xie S, Girshick R, Dollar P, et al. Aggregated Residual Transformations

for Deep Neural Networks[J]. 2016.

[7] Chollet F. Xception: Deep Learning with Depthwise Separable

Convolutions[C]// IEEE Conference on Computer Vision and Pattern Recognition.

IEEE Computer Society, 2017:1800-1807.

[8] Howard A G, Zhu M, Chen B, et al. MobileNets: Efficient Convolutional

Neural Networks for Mobile Vision Applications[J]. 2017.

[9] Zhang X, Zhou X, Lin M, et al. ShuffleNet: An Extremely Efficient

Convolutional Neural Network for Mobile Devices[J]. 2017.

[10] Huang G, Liu Z, Maaten L V D, et al. Densely Connected Convolutional

Networks[J]. 2016.