Summary of the evolutionary history of lightweight network model optimization ——Inception V1-4,ResNet,Xception,ResNeXt,MobileNe,,ShuffleNet,DenseNet
Network from lightweight LeNet Classic to deeper structure AlexNet,VGG Equal network structure , For both recognition still classification The accuracy has been greatly improved , But it also brings many problems . For example, gradient dispersion often encountered in training / blast , Over fitting , Under fitting , Poor pan China Performance , Accuracy degradation, etc , And the computing time as the network deepens , The cost of space has skyrocketed . Although the operation performance of the graphics card is also improving , But the most effective way is to solve it from the fundamental algorithm ; And now a research direction is to apply the deep learning algorithm to the mobile terminal , This requires that the academic community from the network block Change in structure . Two directions of network compression optimization are mainly transfer learning （ for example MobileID） And network sparsity , In fact, more excellent networks have been created , as ResNet,InceptionV2-4,Xception,ResNext,MobileNet,ShuffleNet,DenseNet, The main idea of these networks will be introduced in the text .
key word ：Inception V1-4,ResNet,Xception,ResNeXt,MobileNe,,ShuffleNet,DenseNet
One , introduction
LeNet5 Born in 1994
year , It is one of the earliest convolutional neural networks , And promote the development of deep learning . Now, more network structures are piled up from simple layers, more are changed from basic structures , Generally speaking, the idea of model simplification is based on Conv Model compression with a large amount of redundancy within the layer , At present, the commonly used methods in the academic circle are mainly ： Parameter sparsity , Matrix decomposition ,Depthwise convolution , Group convolution, etc . from ResNet start ,InceptionV4,Xception,ResNext,MobileNet,ShuffleNet, as well as 2017CVPR Of best
paper Of DenseNet It's all borrowed ResNet Thought of , It's called “ an inspired passage ” An idea of . This literature review is based on ResNet As a starting point , Started reading a series of recent excellent papers on network structure optimization , Main contributions to these networks in the main body , Core ideas , The changes of network structure are described in detail , And sort out their evolutionary history .
Two , Introduction to the evolution history of optimization model ：
That's what we know GoogleNet, Its core idea is to increase the depth and width of the network , To improve CNN network performance , This means that a large number of parameters are easy to produce over fitting, which will greatly increase the amount of calculation .
GoogleNet It is considered that the fundamental way to solve the above two shortcomings is to convert full connection and even general convolution into sparse connection . On the one hand, the connection of biological nervous system is sparse , On the other hand, the literature shows that ： For large scale sparse neural networks , An optimal network can be constructed layer by layer by analyzing the statistical characteristics of activation values and clustering highly correlated outputs . This shows that the bloated sparse network may be simplified without loss of performance .
Although there are strict conditions for mathematical proof , but Hebbian The code strongly supports this ：fire together,wire together.
Earlier , In order to break network symmetry and improve learning ability , Traditional networks use random sparse connection . however , The computing efficiency of computer software and hardware for heterogeneous sparse data is very poor , So in AlexNet Full connection layer is re enabled in , The purpose is to optimize parallel operation better .GoogleNet To find a way to maintain the sparsity of network structure , A method that can utilize the high computing performance of dense matrix .
Inception module The central idea of , It is to approximate sparse structure to several dense submatrixes , So as to reduce the parameters at the same time , More efficient use of computing resources .
In the same layer network structure , Yes 1*1,3*3,5*5 Different convolution templates , Can be in different size Convolution feature extraction based on receptive field , The calculation quantity of the whole network becomes larger , But the layers are not getting deeper .
The specific operation is , stay 3*3,5*5 Do before convolution 1*1 Convolution of , To reduce input Of channel Number of ,,1*1 The convolution kernel plays a role of dimensionality reduction ; And the features extracted from the network become more abstract , The receptive field involved has also become larger ,kernel by 3*3,5*5 The convolution ratio of is also increased .
The core structure of the network is shown in the figure below ：
final Google Compared with AlexNet less than 12 times , be relative to VGG-16 less than 3 times , It was a very good network at that time , But research is far more than that .
Inception V2 The main contribution is to propose batch
normalization, The main purpose is to speed up the training . In the process of network training, the continuous change of parameters leads to the change of input distribution of each layer , And the learning process should make each layer adapt to the input distribution , So we have to reduce the learning rate , Initialize carefully . The author calls the change of distribution internal
The network structure has also changed , Stack in two layers 3*3 perhaps 5*5, Compared with V1 There are fewer parameters , Less calculation , But the number of layers increases , Better results , as follows ：
Inception V3 The purpose of this paper is to study how to increase the network scale and ensure the high efficiency of computation , In this paper, some suggestions are put forward CNN Empirical rules of parameter adjustment .
1, Avoid bottleneck of feature representation , Feature representation means that the image CNN Activation value of a layer , The size of feature representation is CNN It should be slowly reduced .
2, High dimensional features are easier to handle , Faster training on high dimensional features , Easier convergence
3, Spatial convergence in low dimensional embedding space , The loss is not great . The explanation for this is that there is a strong correlation between adjacent nerve units , Redundant information .
4, Depth and width of balanced networks . If the width and depth are appropriate, the network can be applied to the distributed system in a balanced way computational budget.
The biggest change in network structure is the use of 1*n combination n*1 To replace n*n Convolution of , The structure is as follows ：
ResNet Main problems solved , It's the problem of degradation in deep networks . The author clearly stated in the paper , In the field of deep learning , The deeper the conventional network is, the better , Beyond a certain depth , Accuracy begins to decline , And the accuracy of the training set is also decreasing , It is proved that it is not due to over fitting .
stay ResNet Add one identity
mapping（ Identity map ）, The original function to be learned H(x) convert to F(x)+x, The author thinks that the two expressions have the same effect , But the difficulty of optimization is not the same , Author's hypothesis F(x) Optimization of
Will compare H(x) Much simpler . This idea is also derived from the residual vector coding in image processing , Through a reformulation, Decomposing a problem into several scale direct residual problems , It can play a good role in optimizing training . The specific paper notes are explained in detail in the blog paper notes .
core block The structure is as follows ：
Inception V4 Mainly connected by residual （Residual
Connection）, that is ResNet To improve V3 structure . Proved Inception Module combination Residual
Connection Can greatly speed up training , At the same time, the performance is improved , Get one Inception-ResNet V2 network , At the same time, a deeper and more optimized Inception
v4 Model , Can reach and Inception-ResNet V2 Comparable performance .
ResNeXt yes ResNet Limit version of , On behalf of the next
dimension.ResNeXt The paper proves that Cardinality（ Namely ResNeXt in module number ） Ratio of width perhaps depth Better results , And ResNet Less than parameters , Better results , Simple structure and convenient design .
Xception yes Inception The ultimate version of the family network , The most important method proposed by the author is Depthwise Separable
Convlution, This is in the back MobileNet It's also reflected in , The core idea is spatial transformation , Channel transformation . and Inception
V3 The difference is to do it first 1*1 Convolution of , Do it again 3*3 Convolution of , In this way, the channels are merged first , Channel convolution , And then space convolution , and Xception On the contrary , Space first 3*3 convolution , To go through the channel again 1*1 convolution , The differences are as follows ：
MobileNets In fact Xception Application of ideas . The difference is Exception The focus of this paper is to improve the accuracy , and MobileNets Focus on compression model , At the same time, ensure the accuracy .Depthwiseseparable
convolutions The idea is , Decomposing a standard convolution into a depthwise convolutions And a pointwise
convolution. Simple understanding is factorization of matrix , The specific steps are shown on the left of the figure below .
The structure of deep separation convolution with traditional convolution block The difference is shown on the right of the figure below ：
hypothesis , Entered feature map Size is DF * DF, Dimension is M, The size of the filter is DK *
DK, Dimension is N, And suppose padding by 1,stride by 1. be , Original convolution operation , The number of matrix operations to be performed is DK*DK*M*N*DF*DF, The convolution kernel parameter is DK
Depthwise separable convolutions The number of matrix operations to be performed is DK*DK*M*DF*DF + M *N
*DF*DF, The convolution kernel parameter is DK *DK *M+N.
Because of the convolution process , Mainly a spatial dimensions reduce ,channel dimension Process of increase , Namely N>M, therefore ,DK *DK *N>
DK *DK *M+N.
therefore ,depthwiseseparable convolutions A lot of compression has been carried out in the model size and calculation amount , Make the model fast , Less computing overhead , Good accuracy .
This article is in mobileNet On the basis of 1 Point improvement ：mobileNet Just do it 3*3 Convolutional deepwiseconvolution, and 1*1 Convolution or traditional convolution , There is also a lot of redundancy ,ShuffleNet On this basis , take 1*1 Convolution is done shuffle and group operation , Realized channel
shuffle and pointwise group convolution operation , Finally, the speed and accuracy are compared mobileNet Improved .
The specific structure is shown in the figure below ：
(a) It's primitive mobileNet Framework of , each group There is no exchange of information between them .
(b) take feature map Did it shuffle operation
(c) It's a process channel shuffle Later results .
shufflenet The idea of group convolution is also used in , Very effective , Indirectly , In fact, an efficient neural network structure design should be grouped , Instead of the general Conv perhaps InnerProduct So it's all connected —— Similar information should be shared , No need to extract repeatedly ; Different groups have different functions , And these functions can be trained . It can tell us , Information needs to be condensed .
DenseNet It's the latest 2017CVPR Of best
paper, Even though it's based on ResNet, But the difference is to maximize the flow of information between all layers in the network , The author connects all layers in the network , Make each layer in the network accept the features of all layers in front of it as input . Because there are a lot of dense connections in the network , The author calls this network structure
DenseNet, The structure is shown on the left in the figure below ：
It has two main features ：
1, To some extent, it can reduce the problem of gradient dissipation in the training process . Because we can see from the picture on the left , In back propagation, each layer will receive gradient signals from all subsequent layers , So it will not increase with the depth of the network , The gradient near the input layer becomes smaller and smaller .
2, Because a large number of features are reused , So that a small number of convolution kernels can generate a large number of features , The size of the final model is also relatively small .
A complete DesNet The structure is as follows ：
The main points of network design are also described as follows ：
1, For feature reuse , In cross layer connection, it is used on the feature dimension Concatenate operation , instead of Element-wise Addition operation .
2, Because there is no need to Elewise-wise operation , So there is no need for one at the end of each unit module 1X1 To increase the number of feature layers to be consistent with the input feature dimension .
3, use Pre-activation To design the unit , take BN
Operation moves up from main branch to before branch .（BN->ReLU->1x1Conv->BN->ReLU->3x3Conv）.
4, Because each layer in the network accepts the characteristics of all previous layers as input , In order to avoid the increase of network layers , Feature dimensions of the back layer grow too fast , When subsampling after each stage , First, the feature dimension is compressed to half of the current input through a convolution layer , And then do it
Pooling Operation of .
5, Setting of growth rate . The growth rate refers to the last one of each module 3x3 The number of convolution kernels of , Recorded as k. Because each unit module finally uses the Concatenate
To connect , So every unit module , The next level of feature dimensions will grow
k. The greater the value, the greater the amount of information circulating in the network , Correspondingly, the more powerful the network is , But the size and calculation amount of the whole model will also increase . The author uses k=32 and k=48 Two settings .
Three , Summary and Outlook ：
This paper is based on ResNet start with , take ResNet The excellent network structure design before and after is combed and the core points are summarized . Overall , In recent years, there are more and more manual design networks in the conference , Step by step to replace the traditional network simple and deep thinking , More and more attention has been paid to the search for methods of model compression and Optimization for processing recognition与classification这两个计算机视觉方面最为重要的问题.不要是要让accuracy更小,mAP更高,收敛曲线更好,同时还要减少计算的空间和成本.从MobileNet我们也可以看出,由于更多框架的搭建,更少的减少卷积层的内部冗余,提高运算性能和网络性能,在业界这也为深度学习向移动端发展提供了可能.
 Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]// IEEE
Conference on Computer Vision and Pattern Recognition. IEEE Computer Society,
 Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift[J]. 2015:448-456.
 Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the Inception
Architecture for Computer Vision[J]. 2015:2818-2826.
 He K, Zhang X, Ren S, et al. Deep Residual Learning for Image
 Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet
and the Impact of Residual Connections on Learning[J]. 2016.
 Xie S, Girshick R, Dollar P, et al. Aggregated Residual Transformations
for Deep Neural Networks[J]. 2016.
 Chollet F. Xception: Deep Learning with Depthwise Separable
Convolutions[C]// IEEE Conference on Computer Vision and Pattern Recognition.
IEEE Computer Society, 2017:1800-1807.
 Howard A G, Zhu M, Chen B, et al. MobileNets: Efficient Convolutional
Neural Networks for Mobile Vision Applications[J]. 2017.
 Zhang X, Zhou X, Lin M, et al. ShuffleNet: An Extremely Efficient
Convolutional Neural Network for Mobile Devices[J]. 2017.
 Huang G, Liu Z, Maaten L V D, et al. Densely Connected Convolutional