author : Zhang Hao
【 New wisdom guide 】
The author of this paper is from the Institute of machine learning and data mining, Department of computer science, Nanjing University (LAMDA), This paper reviews the application of deep learning in four basic tasks of computer vision , Including image classification , location , testing , Semantic segmentation and instance segmentation .

The purpose of this paper is to introduce the application of deep learning in four basic tasks of computer vision , Including classification ( chart a), location , testing ( chart b), Semantic segmentation ( chart c), And instance segmentation ( chart d).

image classification (image classification)

Given an input image , The purpose of image classification task is to determine the category of the image .

(1) Common data sets of image classification

Here are some common classification datasets , Increasing difficulty in turn . The performance ranking of each algorithm in each data set is listed .

MNIST 60k Training image ,10k Test image ,10 Categories , Image size 1×28×28, The content is 0-9 Handwritten numbers .CIFAR-10 
50k Training image ,10k Test image ,10 Categories , Image size 3×32×32.CIFAR-100 50k Training image ,10k Test image ,100 Categories , Image size 3×32×32.
 1.2M Training image ,50k Verify image ,1k Categories .2017 Year and before , Annual meeting based on ImageNet Data set's ILSVRC competition , This is equivalent to the Olympic Games in computer vision .(2)
Classic network structure of image classification

Basic structure   We use conv Representative convolution layer ,bn Representative batch level 1 ,pool Representative convergence layer . The most common network structure order is conv -> bn -> relu ->
pool, The convolution layer is used to extract features , Convergence layer is used to reduce space size . With the development of network depth , The space size of the image will be smaller and smaller , And the number of channels will increase .

For your task , How to design a network ?
  When faced with your actual task , If your goal is to solve the task rather than invent new algorithms , So don't try to design your own new network architecture , And don't try to recreate the existing network structure from scratch . Fine tune the published implementation and pre training model . Remove the last full connection layer and corresponding softmax, Add the full connection layer and softmax, Then fix the front layer , Train only what you add . If you have more training data , Then you can fine tune several layers , Even fine tune all layers .

LeNet-560k parameter . The basic network architecture is :conv1 (6) -> pool1 -> conv2 (16) -> pool2 -> fc3 (120) ->
fc4 (84) -> fc5 (10) ->
softmax. The numbers in brackets represent the number of channels , Network name has 5 Indicates that it has 5 layer conv/fc layer . at that time ,LeNet-5 Successfully used for ATM To recognize handwritten numbers in checks .LeNet Named after the author's surname LeCun.

AlexNet 60M parameter ,ILSVRC 2012 The champion network of . The basic network architecture is :conv1 (96) -> pool1 -> conv2 (256) ->
pool2 -> conv3 (384) -> conv4 (384) -> conv5 (256) -> pool5 -> fc6 (4096) ->
fc7 (4096) -> fc8 (1000) ->
softmax.AlexNet With and LeNet-5 Similar network structure , But deeper , There are more parameters .conv1 use 11×11 Filter of , In steps of 4 Reduce space size rapidly (227×227
-> 55×55).AlexNet The key point is :(1).  Used ReLU Activation function , Make it have better gradient characteristics , Faster training .(2).  Random deactivation used
(dropout).(3).  Extensive use of data expansion technology
.AlexNet The point is that it's higher than the second place 10% The performance of ILSVRC The champion of the competition , This makes people realize the advantages of the neural network of coiler . in addition ,AlexNet It also makes people realize that they can use it GPU Accelerated convolution neural network training .AlexNet Named after the author Alex.

VGG-16/VGG-19 138M parameter ,ILSVRC 2014 The runner up network of .VGG-16 The basic structure of is :conv1^2 (64) -> pool1 ->
conv2^2 (128) -> pool2 -> conv3^3 (256) -> pool3 -> conv4^3 (512) -> pool4 ->
conv5^3 (512) -> pool5 -> fc6 (4096) -> fc7 (4096) -> fc8 (1000) -> softmax.
^3 For repetition 3 second .VGG The key point of the network is :(1).  Simple structure
, only 3×3 Convolution sum 2×2 Merge two configurations , And repeatedly stack the same module combination . Convolution layer does not change space size , After each convergence layer , Halve space .(2).  Large number of parameters
, And most of the parameters are concentrated in the full connection layer . Network name has 16 Indicates that it has 16 layer conv/fc layer .(3).  Proper network initialization and use batch unification (batch
normalization) Layers are important for training deep networks .VGG-19 The structure is similar to VGG-16, Slightly better than VGG-16 Performance of , but VGG-19 Need to consume more resources , So in practice VGG-16 Use more . because VGG-16 The network structure is very simple , And it's good for transfer learning , So far VGG-16 Still widely used .VGG-16 and VGG-19 The name comes from the name of the research group where the author works (Visual
Geometry Group).

GoogLeNet 5M parameter ,ILSVRC
2014 The champion network of .GoogLeNet This paper attempts to answer how large the convolution size should be selected when designing the network , Or we should choose the convergence layer . It proposes Inception modular , Simultaneous use 1×1,3×3,5×5 Convolution sum 3×3 Confluence , And keep all results . The basic network architecture is :conv1
(64) -> pool1 -> conv2^2 (64, 192) -> pool2 -> inc3 (256, 480) -> pool3 ->
inc4^5 (512, 512, 512, 528, 832) -> pool4 -> inc5^2 (832, 1024) -> pool5 -> fc
(1000).GoogLeNet The key point is :(1).  Multi branch processing , Concatenate results .(2). In order to reduce the amount of calculation , Used 1×1 convolution
Dimensionality reduction .GoogLeNet Global average convergence is used instead of full connection layer , Greatly reduce network parameters .GoogLeNet The name comes from the author's unit (Google), among L Capital is for LeNet salute , and Inception The name comes from "we
need to go deeper" stem .

Inception v3/v4
  stay GoogLeNet Further reduce the parameters on the basis of . And GoogLeNet There are similar Inception modular , But will 7×7 and 5×5 Convolution decomposition into some equivalent 3×3 convolution , And in the latter part of the network 3×3 Convolution is decomposed into 1×3 and 3×1 convolution . This enables the network to be deployed to 42 layer . in addition ,Inception
v3 Use of batch consolidation .Inception v3 yes GoogLeNet Calculated quantity 2.5 times , And the error rate is lower than the latter 3%.Inception
v4 stay Inception Combined with the module residual modular ( See below ), Further reduced 0.4% Error rate of .

2015 The champion network of .ResNet It aims to solve the problem of increasing training difficulty after network deepening . It proposes residual modular , Contains two 3×3 Convolution and a short circuit connection ( Left ). Short circuit connection can effectively alleviate the gradient disappearing phenomenon caused by too deep depth in back propagation , This makes the performance not worse after the network is deepened . Short circuit connection is another important idea of deep learning , In addition to computer vision , Short circuit connections are also used in machine translation , speech recognition / Synthesis field . in addition , With short circuit connection ResNet It can be seen as the integration of many networks with different depths and shared parameters , The number of networks increases with the number of layers .ResNet The key point is :(1).
Use short circuit connection , Make it easier to train deep networks , And repeatedly stack the same module combination .(2). ResNet A large number of batches are used .(3).
For deep networks ( exceed 50 layer ),ResNet More efficient bottlenecks used (bottleneck) structure ( Bottom right ).ResNet stay ImageNet It has achieved super accuracy rate on .

The following table compares the above network structures .

 ResNet Improvement of .preResNet That's it residual The order of layers in the module . Compared to classic residual modular (a),(b) take BN Sharing will further affect the short-circuit transmission of information , Make the network harder to train , Worse performance ;(c) Directly ReLU Move to BN The output of this branch will always be nonnegative , Reduce network representation ;(d) take ReLU Resolved in advance (e) Nonnegative problems of , but ReLU Can't enjoy BN Effect of ;(e) take ReLU and BN It's all settled in advance (d) Problems of .preResNet Short circuit connection of (e) Can deliver information more directly , And then we get the ratio ResNet Better performance .

 ResNet Another improvement of . The traditional way is to deepen or widen the network to improve the performance , But computing costs will also increase .ResNeXt To improve performance without changing model complexity . Fertilization is simple and efficient Inception Module inspiration ,ResNeXt take ResNet Which branch of the short circuit between China and Africa becomes more than one branch . and Inception The difference is , Each branch has the same structure .ResNeXt The key point is :(1).
Continue to use ResNet Short circuit connection of , And repeatedly stack the same module combination .(2).  Multi branch processing .(3). use 1×1 convolution
Reduce the amount of calculation . It integrates ResNet and Inception Advantages of . in addition ,ResNeXt Ingeniously using group convolution to realize .ResNeXt find , Increasing the number of branches is a more effective way to improve network performance than deepening or widening .ResNeXt This is the next generation (next) Of ResNet.

Random depth
 ResNet Improvement of . To alleviate gradient loss and accelerate training . Similar to random deactivation (dropout), It randomly residual Module deactivation . The deactivated module is output directly from the short circuit branch , Without going through a branch with parameters . When testing , Feedforward through all modules . Random depth description residual Modules are redundant .

  The goal is to avoid the gradient disappearing . and residual Different modules ,dense There is short circuit connection between any two layers in the module . in other words , The inputs of each layer are cascaded (concatenation) Contains results from all previous layers , That is to say, it contains all levels of features from low to high . Different from the previous method ,DenseNet The number of filters in the middle convolution layer is very small .DenseNet only need ResNet Half of the parameters can be achieved ResNet Performance of . Implementation , The author points out in the report of the conference , Directly cascading the output takes up a lot GPU storage . later , Through shared storage , Can be in the same GPU Deeper training under storage resources DenseNet. But some intermediate results need to be calculated repeatedly , This implementation will increase training time .

target location (object localization)

On the basis of image classification , We also want to know where the target is in the image , It's usually surrounded by a box (bounding box) form .

Basic ideas

Multi task learning , The network has two output branches . A branch for image classification , Full connection +softmax Judge target category , The difference between image classification and simple image classification is that there is another one needed here “ background ” class . Another branch is used to determine the target location , That is to say, to complete the regression task and output four numbers to mark the position of the bounding box ( For example, the horizontal and vertical coordinates of the center point and the length and width of the bounding box ), The output result of this branch is only judged not to be “ background ” Use only when .

Positioning of human body / Face location

The idea of target location can also be used in human pose location or face location . Both of these require us to regress a series of key points of human joints or faces .

Weak supervision positioning

Because target orientation is a relatively simple task , Recently, the focus of research is to locate the target under the condition of only marking information . The basic idea is to find some significant regions with high response from convolution results , Think that this area corresponds to the target in the image .

object detection (object detection)

In target positioning , There is usually only one or a fixed number of targets , Target detection is more general , The types and numbers of objects in the image are uncertain . therefore , Target detection is a more challenging task than target location .

(1) Common data sets for target detection

PASCAL VOC  contain 20 Categories . Usually with VOC07 and VOC12 Of trainval Union as training , use VOC07 Test set as test .MS COCO
 COCO than VOC More difficult .COCO contain 80k Training image ,40k Verify image , and 20k Test image without public mark (test-dev),80 Categories , Average per chart 7.2 Goals . Usually with 80k Training and 35k Verify the union of images as training , rest 5k Image as validation ,20k Test image for online test .
mAP (mean average precision) Common evaluation indexes in target detection
, The calculation method is as follows . When the intersection ratio of predicted bounding box and real bounding box is greater than a certain threshold ( Usually 0.5), The prediction is considered correct . For each category , We draw it with precision - Recall rate (precision-recall) curve , The average accuracy is the area under the curve . Then average the average accuracy of all categories , You can get mAP, Its value is [0,
100%]. Cross union ratio (intersection over union, IoU) The area of the intersection of the predicted bounding box and the real bounding box divided by the area of the union of the two bounding boxes , Value is [0,
1]. The intersection and union ratio measures the close degree between the predicted bounding box and the real bounding box , The bigger the ratio , The higher the overlap of two bounding boxes .(2) Target detection algorithm based on candidate region

Basic ideas

Use windows of different sizes to slide on the image , In each area , Target the area in the window . Namely , The region feedforward network in each window , Its classification branch is used to judge the category of the area , Regression branch for output bounding box . The motive of target detection based on sliding window is , Although the original image may contain multiple targets , However, there is usually only one target in the local area of the image corresponding to the sliding window ( Or not ). therefore , We can use the idea of target location to deal with the areas in the window one by one . however , Because this method needs to slide all areas of the image once , And the sliding windows vary in size , This will bring a lot of computing costs .


First, we use some unsupervised methods of non deep learning , Find some candidate areas in the image that may contain the target . after , Feedforward network for each candidate region , Target positioning ,即两分支(分类+回归)输出.其中,我们仍然需要回归分支的原因是,候选区域只是对包含目标区域的一个粗略的估计,我们需要有监督地利用回归分支得到更精确的包围盒预测结果.R-CNN的重要性在于当时目标检测已接近瓶颈期,而R-CNN利于在ImageNet预训练模型微调的方法一举将VOC上mAP由35.1%提升至53.7%,确定了深度学习下目标检测的基本思路.一个有趣之处是R-CNN论文开篇第一句只有两个词"Features
matter." 这点明了深度学习方法的核心.

候选区域(region proposal)


Fast R-CNN


兴趣区域汇合(region of interest pooling, RoI pooling)


Faster R-CNN

Fast R-CNN测试时每张图像前馈网络只需0.2秒,但瓶颈在于提取候选区域需要2秒.Faster R-CNN

候选区域网络(region proposal networks, RPN)

为什么要使用锚盒(anchor box)

锚盒是预先定义形状和大小的包围盒.使用锚盒的原因包括:(1). 图像中的候选区域大小和长宽比不同,直接回归比对锚盒坐标修正训练起来更困难.(2).


Faster R-CNN在RoI


Fast R-CNN, Faster R-CNN, R-FCN这些算法的演进思路是逐渐提高网络中图像级别计算的比例,同时降低区域级别计算的比例

(3) 基于直接回归的目标检测算法





YOLO的优点在于:(1). 基于候选区域的方法的感受野是图像中的局部区域,而YOLO可以利用整张图像的信息.(2). 有更好的泛化能力.

YOLO的局限在于:(1). 不能很好处理网格中目标数超过预设固定值,或网格中有多个目标同时属于一个锚盒的情况.(2). 对小目标的检测能力不够好.(3).
对不常见长宽比的包围盒的检测能力不强.(4). 计算损失时没有考虑包围盒大小.大的包围盒中的小偏移和小的包围盒中的小偏移应有不同的影响.







(4) 目标检测常用技巧

非最大抑制(non-max suppression, NMS)


在线困难样例挖掘(online hard example mining, OHEM)




语义分割(semantic segmentation)


(1) 语义分割常用数据集

PASCAL VOC 2012 1.5k训练图像,1.5k验证图像,20个类别(包含背景).

MS COCO COCO比VOC更困难.有83k训练图像,41k验证图像,80k测试图像,80个类别.

(2) 语义分割基本思路





反卷积(deconvolution)/转置卷积(transpose convolution)




(3) 语义分割常用技巧

扩张卷积(dilated convolution)


条件随机场(conditional random field, CRF)

neural networks)近似条件随机场.条件随机场的另一弊端是会考虑两两像素之间的关系,这使其运行效率不高.



实例分割(instance segmentation)




Mask R-CNN

R-CNN有三个输出分支(分类,坐标回归,和分割).此外,Mask R-CNN的其他改进有:(1).
改进了RoI汇合,通过双线性差值使候选区域和卷积特征的对齐不因量化而损失信息.(2). 在分割时,Mask