Summary of the evolutionary history of lightweight network model optimization——Inception V1-4,ResNet,Xception,ResNeXt,MobileNe,,ShuffleNet,DenseNet
Network from lightweightLeNet Classic to deeper structureAlexNet,VGG Equal network structure, For bothrecognition stillclassification The accuracy has been greatly improved, But it also brings many problems. For example, gradient dispersion often encountered in training/ blast, Over fitting, Under fitting, Poor pan China Performance, Accuracy degradation, etc, And the computing time as the network deepens, The cost of space has skyrocketed. Although the operation performance of the graphics card is also improving, But the most effective way is to solve it from the fundamental algorithm; And now a research direction is to apply the deep learning algorithm to the mobile terminal, This requires that the academic community from the networkblock Change in structure. Two directions of network compression optimization are mainly transfer learning（ for exampleMobileID） And network sparsity, In fact, more excellent networks have been created, asResNet,InceptionV2-4,Xception,ResNext,MobileNet,ShuffleNet,DenseNet, The main idea of these networks will be introduced in the text.
Key word：Inception V1-4,ResNet,Xception,ResNeXt,MobileNe,,ShuffleNet,DenseNet
LeNet5 Born in 1994
year, It is one of the earliest convolutional neural networks, And promote the development of deep learning. Now, more network structures are piled up from simple layers, more are changed from basic structures, Generally speaking, the idea of model simplification is based onConv Model compression with a large amount of redundancy within the layer, At present, the commonly used methods in the academic circle are mainly： Sparse parameters, Matrix decomposition,Depthwise convolution, Group convolution, etc. fromResNet start,InceptionV4,Xception,ResNext,MobileNet,ShuffleNet, as well as2017CVPR Ofbest
paper OfDenseNet It's all borrowedResNet Thought, It's called“ An inspired passage” An idea of. This literature review is based onResNet As a starting point, Started reading a series of recent excellent papers on network structure optimization, Main contributions to these networks in the main body, Core thinking, The changes of network structure are described in detail, And sort out their evolutionary history.
Two, An introduction to the evolutionary history of optimization model：
That's what we knowGoogleNet, Its core idea is to increase the depth and width of the network, To improveCNN network performance, This means that a large number of parameters are easy to produce over fitting, which will greatly increase the amount of calculation.
GoogleNet It is considered that the fundamental way to solve the above two shortcomings is to convert full connection and even general convolution into sparse connection. On the one hand, the connection of biological nervous system is sparse, On the other hand, the literature shows that： For large scale sparse neural networks, An optimal network can be constructed layer by layer by analyzing the statistical characteristics of activation values and clustering highly correlated outputs. This shows that the bloated sparse network may be simplified without loss of performance.
Although mathematical proof has strict conditions, butHebbian The code strongly supports this：fire together,wire together.
Earlier, In order to break network symmetry and improve learning ability, Traditional networks use random sparse connection. however, The computing efficiency of computer software and hardware for heterogeneous sparse data is very poor, So inAlexNet Full connection layer is re enabled in, The purpose is to optimize parallel operation better.GoogleNet To find a way to maintain the sparsity of network structure, A method that can utilize the high computing performance of dense matrix.
Inception module The central idea of, It is to approximate sparse structure to several dense submatrixes, So as to reduce the parameters at the same time, More efficient use of computing resources.
In the same layer network structure, Yes1*1,3*3,5*5 Different convolution templates, Can be in differentsize Convolution feature extraction based on receptive field, The calculation quantity of the whole network becomes larger, But the layers are not getting deeper.
The specific operation is, stay3*3,5*5 Do before convolution1*1 Convolution, To reduceinput Ofchannel Quantity,,1*1 The convolution kernel plays a role of dimensionality reduction; And the features extracted from the network become more abstract, The receptive field involved has also become larger,kernel by3*3,5*5 The convolution ratio of is also increased.
The core structure of the network is shown in the figure below：
FinalGoogle Compared withAlexNet less than12 times, Be relative toVGG-16 less than3 times, It was a very good network at that time, But research is far more than that.
Inception V2 The main contribution is to proposebatch
normalization, The main purpose is to speed up the training. In the process of network training, the continuous change of parameters leads to the change of input distribution of each layer, And the learning process should make each layer adapt to the input distribution, So we have to reduce the learning rate, Initialize carefully. The author calls the change of distributioninternal
The network structure has also changed, Stack in two layers3*3 perhaps5*5, Compared withV1 There are fewer parameters, Less calculation, But the number of layers increases, Better effect, as follows：
Inception V3 The purpose of this paper is to study how to increase the network scale and ensure the high efficiency of computation, In this paper, some suggestions are put forwardCNN Empirical rules of parameter adjustment.
1, Avoid bottleneck of feature representation, Feature representation means that the imageCNN Activation value of a layer, The size of feature representation isCNN It should be slowly reduced.
2, High dimensional features are easier to handle, Faster training on high dimensional features, Easier convergence
3, Spatial convergence in low dimensional embedding space, The loss is not great. The explanation for this is that there is a strong correlation between adjacent nerve units, Redundant information.
4, Depth and width of balanced networks. If the width and depth are appropriate, the network can be applied to the distributed system in a balanced waycomputational budget.
The biggest change in network structure is the use of1*n Combinationn*1 To replacen*n Convolution, The structure is as follows：
ResNet Main problems solved, It's the problem of degradation in deep networks. The author clearly stated in the paper, In the field of deep learning, The deeper the conventional network is, the better, Beyond a certain depth, Accuracy begins to decline, And the accuracy of the training set is also decreasing, It is proved that it is not due to over fitting.
stayResNet Add oneidentity
mapping（ Identity mapping）, The original function to be learnedH(x) convert toF(x)+x, The author thinks that the two expressions have the same effect, But the difficulty of optimization is not the same, Author hypothesisF(x) Optimization
Comparable ratioH(x) Much simpler. This idea is also derived from the residual vector coding in image processing, Through onereformulation, Decomposing a problem into several scale direct residual problems, It can play a good role in optimizing training. The specific paper notes are explained in detail in the blog paper notes.
coreblock The structure is as follows：
Inception V4 Mainly connected by residual（Residual
Connection）, that isResNet To improveV3 structure. ProvedInception Module combinationResidual
Connection Can greatly speed up training, At the same time, the performance is improved, Get oneInception-ResNet V2 network, At the same time, a deeper and more optimizedInception
v4 Model, Able to achieveInception-ResNet V2 Comparable performance.
ResNeXt yesResNet Limit version of, On behalf ofthe next
dimension.ResNeXt The paper proves thatCardinality（ NamelyResNeXt inmodule Number） Ratio ofwidth perhapsdepth Better effect, AndResNet Less than parameters, Better results, Simple structure and convenient design.
Xception yesInception The ultimate version of the family network, The most important method proposed by the author isDepthwise Separable
Convlution, This is in the backMobileNet It's also reflected in, The core idea is spatial transformation, Channel transformation. andInception
V3 The difference is to do it first1*1 Convolution, Do again3*3 Convolution, In this way, the channels are merged first, Channel convolution, And then space convolution, andXception On the contrary, Space first3*3 convolution, To go through the channel again1*1 convolution, The difference is as follows：
MobileNets In fact, that isXception Application of ideas. The difference isException The focus of this paper is to improve the accuracy, andMobileNets Focus on compression model, At the same time, ensure the accuracy.Depthwiseseparable
convolutions The idea is, Decomposing a standard convolution into adepthwise convolutions And onepointwise
convolution. Simple understanding is factorization of matrix, The specific steps are shown on the left of the figure below.
The structure of deep separation convolution with traditional convolutionblock The difference is shown on the right of the figure below：
hypothesis, Inputfeature map Size isDF * DF, Dimension isM, The size of the filter isDK *
DK, Dimension isN, And assume thatpadding by1,stride by1. be, Original convolution operation, The number of matrix operations to be performed isDK*DK*M*N*DF*DF, The convolution kernel parameter isDK
Depthwise separable convolutions The number of matrix operations to be performed isDK*DK*M*DF*DF + M *N
*DF*DF, The convolution kernel parameter isDK *DK *M+N.
Because of the convolution process, Mainly aspatial dimensions reduce,channel dimension Process of increase, NamelyN>M, therefore,DK *DK *N>
DK *DK *M+N.
therefore,depthwiseseparable convolutions A lot of compression has been carried out in the model size and calculation amount, Make the model fast, Less computing overhead, Good accuracy.
This article is inmobileNet On the basis of1 Point improvement：mobileNet Just done.3*3 Convolutiondeepwiseconvolution, and1*1 Convolution or traditional convolution, There is also a lot of redundancy,ShuffleNet On this basis, take1*1 Convolution is done.shuffle andgroup operation, Realizedchannel
shuffle andpointwise group convolution operation, Finally, the speed and accuracy are comparedmobileNet Promotion.
The specific structure is shown in the figure below：
(a) It's original.mobileNet Framework, eachgroup There is no exchange of information between them.
(b) takefeature map Doneshuffle operation
(c) It is throughchannel shuffle Later results.
shufflenet The idea of group convolution is also used in, Very effective, Indirectly, In fact, an efficient neural network structure design should be grouped, Instead of the generalConv perhapsInnerProduct So it's all connected—— Similar information should be shared, No need to extract repeatedly; Different groups have different functions, And these functions can be trained. It can tell us, Information needs to be condensed.
DenseNet It's the latest.2017CVPR Ofbest
paper, Even though it's based onResNet, But the difference is to maximize the flow of information between all layers in the network, The author connects all layers in the network, Make each layer in the network accept the features of all layers in front of it as input. Because there are a lot of dense connections in the network, The author calls this network structure
DenseNet, The structure is shown on the left in the figure below：
It has two main features：
1, To some extent, it can reduce the problem of gradient dissipation in the process of training. Because we can see from the picture on the left, In back propagation, each layer will receive gradient signals from all subsequent layers, So it will not increase with the depth of the network, The gradient near the input layer becomes smaller and smaller.
2, Because a large number of features are reused, So that a small number of convolution kernels can generate a large number of features, The size of the final model is also relatively small.
A completeDesNet The structure is as follows：
The main points of network design are also described as follows：
1, For feature reuse, In cross layer connection, it is used on the feature dimension Concatenate operation, Instead of Element-wise Addition operation.
2, Because there is no need to Elewise-wise operation, So there is no need for one at the end of each unit module 1X1 To increase the number of feature layers to be consistent with the input feature dimension.
3, Use Pre-activation To design the unit, take BN
Operation moves up from main branch to before branch.（BN->ReLU->1x1Conv->BN->ReLU->3x3Conv）.
4, Because each layer in the network accepts the characteristics of all previous layers as input, In order to avoid the increase of network layers, Feature dimensions of the back layer grow too fast, When subsampling after each stage, First, the feature dimension is compressed to half of the current input through a convolution layer, And then do it
5, Setting of growth rate. The growth rate refers to the last one of each module 3x3 The number of convolution kernels of, Remember as k. Because each unit module finally uses the Concatenate
To connect, So every unit module, The next level of feature dimensions will grow
k. The greater the value, the greater the amount of information circulating in the network, Correspondingly, the more powerful the network is, But the size and calculation amount of the whole model will also increase. The author uses k=32 and k=48 Two settings.
Three, Summary and Prospect：
This article is based onResNet start with, takeResNet The excellent network structure design before and after is combed and the core points are summarized. In general, In recent years, there are more and more manual design networks in the conference, Step by step to replace the traditional network simple and deep thinking, More and more attention has been paid to the search for methods of model compression and Optimization for processingrecognition与classification这两个计算机视觉方面最为重要的问题.不要是要让accuracy更小,mAP更高,收敛曲线更好,同时还要减少计算的空间和成本.从MobileNet我们也可以看出,由于更多框架的搭建,更少的减少卷积层的内部冗余,提高运算性能和网络性能,在业界这也为深度学习向移动端发展提供了可能.
 Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]// IEEE
Conference on Computer Vision and Pattern Recognition. IEEE Computer Society,
 Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift[J]. 2015:448-456.
 Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the Inception
Architecture for Computer Vision[J]. 2015:2818-2826.
 He K, Zhang X, Ren S, et al. Deep Residual Learning for Image
 Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet
and the Impact of Residual Connections on Learning[J]. 2016.
 Xie S, Girshick R, Dollar P, et al. Aggregated Residual Transformations
for Deep Neural Networks[J]. 2016.
 Chollet F. Xception: Deep Learning with Depthwise Separable
Convolutions[C]// IEEE Conference on Computer Vision and Pattern Recognition.
IEEE Computer Society, 2017:1800-1807.
 Howard A G, Zhu M, Chen B, et al. MobileNets: Efficient Convolutional
Neural Networks for Mobile Vision Applications[J]. 2017.
 Zhang X, Zhou X, Lin M, et al. ShuffleNet: An Extremely Efficient
Convolutional Neural Network for Mobile Devices[J]. 2017.
 Huang G, Liu Z, Maaten L V D, et al. Densely Connected Convolutional