Densenet Learning notes
paper：《Densely Connected Convolutional Networks》
Caffe VersionsGitHub link：https://github.com/shicai/DenseNet-Caffe
The paper is2017CVPR Best paper, Dr. Huang Gao, postdoctoral of Cornell University, Written by Liu Zhuang, an undergraduate of Tsinghua University, The author Liu ZhuangGitHub Open source and open sourcedensenet Related codes. This paper is based onresnet As well as someresnet On the basis of derivative network, we make innovation of network model and get good effect, And the paper is easy to understand. That's a response：“ We are all on the shoulders of giants”.
2. The source of innovative ideas
DenseNet The idea of SDN is largely derived from the idea of SDN（Deep networks with stochastic
depth） Analysis, It is found that every step in the training process is randomlydrop Some layers, Can significantly improve ResNet Generalization performance of. The success of this approach has at least two implications：
(1) It shows that neural network is not necessarily a hierarchical structure, That is to say, a layer in the network can not only depend on the characteristics of the next layer, And it can depend on the characteristics of higher level learning.
(2) In the process of training, throwing away many layers at random will not destroy the convergence of the algorithm, Illustrated ResNet
It has obvious redundancy, Each layer in the network only extracts a few features（ So called residual）. Actually, Will be trained ResNet
Remove several layers randomly, It will not have a great impact on the prediction results of the network.
3. Densnet Advantage：
(1) Lightenedvanishing-gradient（ Gradient disappear）
(2) Strengthenedfeature Transmission
(3) More efficient use offeature
(4) To some extent, the number of parameters is less
4. Densnet Core idea
4.1 Resnet introduce
aboutresnet Just output the currentfeature map And front outputfeature map Add additivity, The number of last channels remains the same.
4.2 Dense block
chart1 It is shown thatdensnet The classics of the Internetdense block Schematic diagram of structure,dense
block The input from each layer in is the output from all previous layers, That is to say, the output of each layerfeature
map Will be used as input for later layers. formula2 Medium[x0,x1,…,xl-1] Express will0 reachl-1 Layer outputfeature
map Conductconcatenation（ Series connection） operation,concatenation It's a combination of channels, Just likeInception like that. That is to say, the superposition of the last channel, The number of channels is variable. The deeper the network depth is, the more likely the gradient will disappear, The reason is that input information and gradient information are transferred between many layers, While indensnet Equivalent to direct connection of each layerinput andloss, So we can reduce the gradient disappearance, This deeper network is not a problem.
chart1 densnet Network part structure
4.3 Composite function（composite function）
Subject acceptanceCNN Some experience of network, For formula2 MediumH The function uses three consecutive operations：BN->ReLU->Conv（3*3）, It's called a compound function.
4.4 Transition layer（Transition Layer）
According to formula2 Conductconcatenation The number of channels will change in series operation, However, inCNN There areconv andpooling Layer can be downsampled for dimensionality reduction. Pictured2 Shown, Paper elicitation transition layer（Transition
Layer） Cascading multipledense block, You want todense block Internalfeature
map Ofsize Finally, it can be unified, This is going onconcatenation There won't besize The problem of disunity. The transition layer is composed of1*1 Ofconv and2*2 Ofpooling Layer composition.
chart2 densnet Overall network process structure
4.5 growth rate （Growth rate）
If each composite functionH outputk individualfeature map, So the firstℓ Layer hask0+k×(ℓ−1) Inputfeature
map, amongk0 Is the number of channels in the input layer, In the paperk Be called Growth
rate.DenseNet andresnet One advantage of the comparison is that the setting is very smallk value, Make the network narrower and have fewer parameters. stay dense block Output of each volume layer infeature
map Quantityk All very small.（ less than100）, Instead of moving like other networks, hundreds or thousands of width. In addition, this kind ofdenseblock Has the effect of regularization, Therefore, it has a certain inhibitory effect on over fitting, I think the main reason is that the narrow network channel reduces the parameters, So the over fitting phenomenon is reduced.
4.6 Reduce parameter innovation
Densnet In the networkdense block Ofconcatenation Links will make each layer's input and eachdense block Transmitted betweenfeature
map Ofsize great, So there are too many parameters, In view of this problem, the paper puts forwardBottleneck
layers andcompression（ Treatment in transition layer）, The final network structure is shown in the figure3 Shown.
chart3 Densnet-BC Network structure diagram
(1) Bottleneck layer（Bottleneck layers）
Papers in eachdense block in3*3 Ofconv（ convolution） In front of layer1*1 Ofconv layer, It's calledbottleneck
layer, The goal is to reduce inputfeature map Number, It can reduce the dimension and calculation, It can also integrate the characteristics of each channel, formationBN->ReLU->Conv(1*1)->
BN->ReLU->Conv(3*3) Structure. Be calledDensenet-B structure.
In order to improve the compactness of the model, Paper reduction in transition layerfeature map Quantity, Introduce a parameter , Suppose adense block Outputfeature
map yesm individual, In the transition layer Post outputfeature map Namely individual,0< <1, Papers take0.5, When <1 Its structure isDensenet-C.
notes： Let's talk more about it herebottleneck andtransition layer operation
In eachDense Block There are many substructures in, withDenseNet-169 OfDense
Block（3） take as an example, Contain32 individual1*1 and3*3 Ofconv operation, That is the first.32 The input to the substructure is the front31 Output result of layer, Output of each layerchannel yes32（growth
rate）, So if you don'tbottleneck（ That is to introduce1*1 Ofconv） operation, The first32 Layer3*3 Ofconv The input of the operation is31*32+ The previousDense
Block Outputchannel, near1000 了. And the introduction of1*1 Ofconv, In the paper1*1conv Ofchannel yesgrowth
rate*4 that is128, And then as3*3 Ofconv Input. This greatly reduces the amount of calculation, This is it.bottleneck.
As fortransition layer, Put it in two.Dense Block Middle, Because of eachDense
Block Output after endchannel There are many numbers. Need to use1*1 Ofconv Descending dimension. Or toDenseNet-169 OfDense
Block（3） take as an example, Although the first32 Layer3*3 Ofconv outputchannel only32 individual（growth
rate）, But then there will be channels like the previous layersconcatenation operation, Forthcoming32 Output of layer and32 Layer inputconcatenation, As I said before32 The input to the layer is1000 Left and rightchannel, So in the endDense
Block The output of1000 Manychannel. So intransition layer There is a parameter.
（ The scope is0 reach1）, Indicates how many times smaller these outputs are, What the paper gives is0.5. Pass it on to the nextDense
Block Whenchannel The number will be halved, This is it.transition layer Role.
4.7 Experimental comparison
chart3 It is shown thatDenseNet-BC andResNet stayImagenet Comparison on datasets, The figure on the left is a comparison of parameter complexity and error rate, It can be seen that under the same error rate, the complexity of parameters can be seen, Error rate can also be seen under the same parameter complexity, The promotion is still obvious. On the right isflops（ It can be understood as computational complexity） Comparison with error rate, It's also effective.
chart3 Comparison of experimental results