# Densenet Learning notes

Densenet summary

paper ：《Densely Connected Convolutional Networks》

Paper link ：https://arxiv.org/pdf/1608.06993.pdf

<https://arxiv.org/pdf/1608.06993.pdf>

Coded github link ：https://github.com/liuzhuang13/DenseNet

<https://github.com/liuzhuang13/DenseNet>

Caffe Version of GitHub link ：https://github.com/shicai/DenseNet-Caffe

<https://github.com/shicai/DenseNet-Caffe>

1. summary

The paper is 2017CVPR Best paper , Dr. Huang Gao, postdoctoral of Cornell University , Written by Liu Zhuang, an undergraduate of Tsinghua University , The author Liu Zhuang GitHub Open source densenet Related codes . This paper is based on resnet And some resnet On the basis of derivative network, we make innovation of network model and get good effect , And the paper is easy to understand . That's a response ：“ We are all on the shoulders of giants ”.

2. The source of innovative ideas

DenseNet The idea of SDN is largely derived from the idea of SDN （Deep networks with stochastic

depth） Analysis of , It is found that every step in the training process is randomly drop Some layers , Can significantly improve ResNet Generalization performance of . The success of this approach has at least two implications ：

(1) It shows that neural network is not necessarily a hierarchical structure , That is to say, a layer in the network can not only depend on the characteristics of the next layer , And it can depend on the characteristics of higher level learning .

(2) In the process of training, throwing away many layers at random will not destroy the convergence of the algorithm , Explained ResNet

It has obvious redundancy , Each layer in the network only extracts a few features （ So called residual ）. actually , Will be trained ResNet

Remove several layers randomly , It will not have a great impact on the prediction results of the network .

3. Densnet advantage ：

(1) Reduced vanishing-gradient（ Gradient disappearance ）

(2) Strengthened feature Transmission of

(3) More efficient use of feature

(4) To some extent, the number of parameters is less

4. Densnet Core ideas

4.1 Resnet introduce

about resnet Just output the current feature map And front output feature map Add , The number of last channels remains the same .

4.2 Dense block

chart 1 It shows that densnet The classics of the Internet dense block Schematic diagram of structure ,dense

block The input from each layer in is the output from all previous layers , That is to say, each layer outputs feature

map Will be used as input for later layers . formula 2 In [x0,x1,…,xl-1] Indicates that 0 reach l-1 Layer output feature

map conduct concatenation（ series connection ） operation ,concatenation It's a combination of channels , Like Inception like that . That is to say, the superposition of the last channel , The number of channels is variable . The deeper the network depth is, the more likely the gradient will disappear , The reason is that input information and gradient information are transferred between many layers , And in densnet Equivalent to direct connection of each layer input and loss, So we can reduce the gradient disappearance , This deeper network is not a problem .

chart 1 densnet Network part structure

4.3 Compound function （composite function）

Paper acceptance CNN Some experience of network , For formulas 2 In H The function uses three consecutive operations ：BN->ReLU->Conv（3*3）, It's called a compound function .

4.4 Transition layer （Transition Layer）

According to the formula 2 conduct concatenation The number of channels will change in series operation , However, in CNN Among them conv and pooling Layer can be downsampled for dimensionality reduction . As shown in the figure 2 As shown in , Paper elicitation transition layer （Transition

Layer） Cascade multiple dense block, You want to dense block Internal feature

map Of size Finally, it can be unified , This is going on concatenation There won't be size The problem of disunity . The transition layer is composed of 1*1 Of conv and 2*2 Of pooling Layer composition .

chart 2 densnet Overall network process structure

4.5 growth rate （Growth rate）

If each composite function H output k individual feature map, So, No ℓ Layer has k0+k×(ℓ−1) Inputs feature

map, among k0 Is the number of channels in the input layer , In the paper k be called Growth

rate.DenseNet and resnet One advantage of the comparison is that the setting is very small k value , Make the network narrower and have fewer parameters . stay dense block Output of each volume layer in feature

map Number of k Very small （ less than 100）, Instead of moving like other networks, hundreds of thousands of width . In addition, this kind of denseblock Has the effect of regularization , Therefore, it has a certain inhibitory effect on over fitting , I think the main reason is that the narrow network channel reduces the parameters , So the over fitting phenomenon is reduced .

4.6 Reduce parameter innovation

Densnet In the network dense block Of concatenation Links will make each layer's input and each dense block Transmitted between feature

map Of size great , So there are too many parameters , In view of this problem, the paper puts forward Bottleneck

layers and compression（ Treatment in transition layer ）, The final network structure is shown in the figure 3 As shown in .

chart 3 Densnet-BC Network structure diagram

(1) Bottleneck layer （Bottleneck layers）

Papers in each dense block in 3*3 Of conv（ convolution ） In front of layer 1*1 Of conv layer , It's called bottleneck

layer, The goal is to reduce input feature map number , It can reduce the dimension and calculation , It can also integrate the characteristics of each channel , formation BN->ReLU->Conv(1*1)->

BN->ReLU->Conv(3*3) Structure of . be called Densenet-B structure .

(2) compress （Compression）

In order to improve the compactness of the model , Paper reduction in transition layer feature map Number of , Introduce a parameter , Suppose a dense block Output feature

map yes m individual , In the transition layer Post output feature map namely individual ,0< <1, In the paper take 0.5, When <1 Its structure is Densenet-C.

notes ： Let's talk more about it here bottleneck and transition layer operation

In each Dense Block There are many substructures in , with DenseNet-169 Of Dense

Block（3） take as an example , contain 32 individual 1*1 and 3*3 Of conv operation , That is to say 32 The input to the substructure is the front 31 Output result of layer , Output of each layer channel yes 32（growth

rate）, So if you don't bottleneck（ That is to introduce 1*1 Of conv） operation , The first 32 Stratified 3*3 Of conv The input of the operation is 31*32+ the previous Dense

Block Output of channel, near 1000 了 . And introduce 1*1 Of conv, In the paper 1*1conv Of channel yes growth

rate*4 that is 128, And then as 3*3 Of conv Input of . This greatly reduces the amount of calculation , This is it. bottleneck.

as for transition layer, Put it in two Dense Block middle , Because of each Dense

Block Output after end channel Many , Need to use 1*1 Of conv To reduce dimensions . Or with DenseNet-169 Of Dense

Block（3） take as an example , Although No 32 Stratified 3*3 Of conv output channel only 32 individual （growth

rate）, But then there will be channels like the previous layers concatenation operation , Coming soon 32 Output of layer and 32 Layer input concatenation, As I said before 32 The input to the layer is 1000 Left right channel, So in the end Dense

Block The output of 1000 Many channel. So in transition layer There is a parameter

（ The scope is 0 reach 1）, Indicates how many times smaller these outputs are , What the paper gives is 0.5. Pass it on to the next Dense

Block When channel The number will be halved , This is it. transition layer The role of .

4.7 Experimental comparison

chart 3 It shows DenseNet-BC and ResNet stay Imagenet Comparison on datasets , The figure on the left is a comparison of parameter complexity and error rate , It can be seen that under the same error rate, the complexity of parameters can be seen , Error rate can also be seen under the same parameter complexity , The promotion is still obvious . On the right flops（ It can be understood as computational complexity ） Comparison with error rate , It's also effective .

chart 3 Comparison of experimental results