author: Zhang Hao
【 New wisdom guide】
The author of this paper is from the Institute of machine learning and data mining, Department of computer science, Nanjing University(LAMDA), This paper reviews the application of deep learning in four basic tasks of computer vision, Including image classification, Location, Testing, Semantic segmentation and instance segmentation.

The purpose of this paper is to introduce the application of deep learning in four basic tasks of computer vision, Including classification( charta), Location, Testing( chartb), Semantic segmentation( chartc), And instance segmentation( chartd).

image classification(image classification)

Given an input image, The purpose of image classification task is to determine the category of the image.

(1) Common data sets of image classification

Here are some common classification datasets, Increasing difficulty in turn. The performance ranking of each algorithm in each data set is listed.

MNIST 60k Training image,10k Test image,10 Categories, Image size1×28×28, Content is0-9 Handwritten digits.CIFAR-10 
50k Training image,10k Test image,10 Categories, Image size3×32×32.CIFAR-100 50k Training image,10k Test image,100 Categories, Image size3×32×32.
 1.2M Training image,50k Verify image,1k Categories.2017 Before and after, Annual meeting based onImageNet Data setILSVRC competition, This is equivalent to the Olympic Games in computer vision.(2)
Classic network structure of image classification

Basic framework  We useconv Representative convolution layer,bn Representative batch level 1,pool Representative convergence layer. The most common network structure order isconv -> bn -> relu ->
pool, The convolution layer is used to extract features, Convergence layer is used to reduce space size. With the development of network depth, The space size of the image will be smaller and smaller, And the number of channels will increase.

For your task, How to design a network?
  When faced with your actual task, If your goal is to solve the task rather than invent new algorithms, So don't try to design your own new network architecture, And don't try to recreate the existing network structure from scratch. Fine tune the published implementation and pre training model. Remove the last full connection layer and correspondingsoftmax, Add the full connection layer andsoftmax, Then fix the front layer, Train only what you add. If you have more training data, Then you can fine tune several layers, Even fine tune all layers.

LeNet-560k parameter. The basic network architecture is:conv1 (6) -> pool1 -> conv2 (16) -> pool2 -> fc3 (120) ->
fc4 (84) -> fc5 (10) ->
softmax. The numbers in brackets represent the number of channels, Network name has5 Indicating that it has5 layerconv/fc layer. at that time,LeNet-5 Successfully used forATM To recognize handwritten numbers in checks.LeNet Named after the author's surnameLeCun.

AlexNet 60M parameter,ILSVRC 2012 The champion network of. The basic network architecture is:conv1 (96) -> pool1 -> conv2 (256) ->
pool2 -> conv3 (384) -> conv4 (384) -> conv5 (256) -> pool5 -> fc6 (4096) ->
fc7 (4096) -> fc8 (1000) ->
softmax.AlexNet With harmonyLeNet-5 Similar network structure, But deeper, There are more parameters.conv1 Use11×11 Filter, Step length is4 Reduce space size rapidly(227×227
-> 55×55).AlexNet The key point is:(1).  UsedReLU Activation function, Make it have better gradient characteristics, Faster training.(2).  Random deactivation used
(dropout).(3).  Extensive use of data expansion technology
.AlexNet The point is that it's higher than the second place10% The performance ofILSVRC The champion of the competition, This makes people realize the advantages of the neural network of coiler. in addition,AlexNet It also makes people realize that they can use itGPU Accelerated convolution neural network training.AlexNet Named after the authorAlex.

VGG-16/VGG-19 138M parameter,ILSVRC 2014 The runner up network of.VGG-16 The basic structure of is:conv1^2 (64) -> pool1 ->
conv2^2 (128) -> pool2 -> conv3^3 (256) -> pool3 -> conv4^3 (512) -> pool4 ->
conv5^3 (512) -> pool5 -> fc6 (4096) -> fc7 (4096) -> fc8 (1000) -> softmax.
^3 Representative repetition3 second.VGG The key point of the network is:(1).  Simple structure
, only3×3 Convolution sum2×2 Merge two configurations, And repeatedly stack the same module combination. Convolution layer does not change space size, After each convergence layer, Halve space size.(2).  Large parameter
, And most of the parameters are concentrated in the full connection layer. Network name has16 Indicating that it has16 layerconv/fc layer.(3).  Proper network initialization and use batch unification(batch
normalization) Layers are important for training deep networks.VGG-19 The structure is similar toVGG-16, Slightly better thanVGG-16 Performance, butVGG-19 Need to consume more resources, So in practiceVGG-16 Use more. BecauseVGG-16 The network structure is very simple, And it's good for transfer learning, So farVGG-16 Still widely used.VGG-16 andVGG-19 The name comes from the name of the research group where the author works(Visual
Geometry Group).

GoogLeNet 5M parameter,ILSVRC
2014 The champion network of.GoogLeNet This paper attempts to answer how large the convolution size should be selected when designing the network, Or we should choose the convergence layer. It proposed.Inception Modular, Simultaneous use1×1,3×3,5×5 Convolution sum3×3 Convergence, And keep all results. The basic network architecture is:conv1
(64) -> pool1 -> conv2^2 (64, 192) -> pool2 -> inc3 (256, 480) -> pool3 ->
inc4^5 (512, 512, 512, 528, 832) -> pool4 -> inc5^2 (832, 1024) -> pool5 -> fc
(1000).GoogLeNet The key point is:(1).  Multi branch processing, Concatenate results.(2). In order to reduce the amount of calculation, Used1×1 convolution
Dimension reduction.GoogLeNet Global average convergence is used instead of full connection layer, Greatly reduce network parameters.GoogLeNet The name comes from the author's unit(Google), amongL Capital is forLeNet salute, andInception The name comes from"we
need to go deeper" stem.

Inception v3/v4
  stayGoogLeNet Further reduce the parameters on the basis of. Its sumGoogLeNet SimilarInception Modular, But will7×7 and5×5 Convolution decomposition into some equivalent3×3 convolution, And in the latter part of the network3×3 Convolution is decomposed into1×3 and3×1 convolution. This enables the network to be deployed to42 layer. in addition,Inception
v3 Use of batch consolidation.Inception v3 yesGoogLeNet Amount of computation2.5 times, And the error rate is lower than the latter3%.Inception
v4 stayInception Combined with the moduleresidual Modular( See below), Further reduced0.4% Error rate.

2015 The champion network of.ResNet It aims to solve the problem of increasing training difficulty after network deepening. It proposed.residual Modular, Contains two3×3 Convolution and a short circuit connection( Left picture). Short circuit connection can effectively alleviate the gradient disappearing phenomenon caused by too deep depth in back propagation, This makes the performance not worse after the network is deepened. Short circuit connection is another important idea of deep learning, In addition to computer vision, Short circuit connections are also used in machine translation, speech recognition/ Synthetic field. in addition, With short circuit connectionResNet It can be seen as the integration of many networks with different depths and shared parameters, The number of networks increases with the number of layers.ResNet The key point is:(1).
Use short circuit connection, Make it easier to train deep networks, And repeatedly stack the same module combination.(2). ResNet A large number of batches are used.(3).
For deep networks( Exceed50 layer),ResNet More efficient bottlenecks used(bottleneck) structure( Lower right).ResNet stayImageNet It has achieved a super accuracy rate on.

The following table compares the above network structures.

 ResNet Improvement.preResNet Wholeresidual The order of layers in the module. Compared with classicsresidual Modular(a),(b) takeBN Sharing will further affect the short-circuit transmission of information, Make the network harder to train, Worse performance;(c) Direct willReLU Move toBN The output of this branch will always be nonnegative, Reduce network representation;(d) takeReLU Resolved in advance(e) Nonnegative problems of, butReLU Unable to enjoyBN Effect;(e) takeReLU andBN It's all settled in advance(d) Problem.preResNet Short circuit connection of(e) Can deliver information more directly, And then we get the ratioResNet Better performance.

 ResNet Another improvement of. The traditional way is to deepen or widen the network to improve the performance, But computing costs will also increase.ResNeXt To improve performance without changing model complexity. Fertilization is simple and efficientInception Module heuristic,ResNeXt takeResNet Which branch of the short circuit between China and Africa becomes more than one branch. andInception The difference is, Each branch has the same structure.ResNeXt The key point is:(1).
Continue to useResNet Short circuit connection of, And repeatedly stack the same module combination.(2).  Multi branch processing.(3). Use1×1 convolution
Reduce the amount of calculation. It's integrated.ResNet andInception Advantages. in addition,ResNeXt Ingeniously using group convolution to realize.ResNeXt find, Increasing the number of branches is a more effective way to improve network performance than deepening or widening.ResNeXt This is the next generation(next) OfResNet.

Random depth
 ResNet Improvement. To alleviate gradient loss and accelerate training. Similar to random deactivation(dropout), It randomlyresidual Module inactivation. The deactivated module is output directly from the short circuit branch, Without going through a branch with parameters. At testing time, Feedforward through all modules. Random depth descriptionresidual Modules are redundant.

  The goal is to avoid the gradient disappearing. andresidual Different modules,dense There is short circuit connection between any two layers in the module. In other words, The inputs of each layer are cascaded(concatenation) Contains results from all previous layers, That is to say, it contains all levels of features from low to high. Different from the previous method,DenseNet The number of filters in the middle convolution layer is very small.DenseNet only needResNet Half of the parameters can be achievedResNet Performance. Implementation aspect, The author points out in the report of the conference, Directly cascading the output takes up a lotGPU storage. later, Through shared storage, Can be in the sameGPU Deeper training under storage resourcesDenseNet. But some intermediate results need to be calculated repeatedly, This implementation will increase training time.

target location(object localization)

On the basis of image classification, We also want to know where the target is in the image, It's usually surrounded by a box(bounding box) form.

Basic thinking

Multi task learning, The network has two output branches. A branch for image classification, Fully connected+softmax Judge target category, The difference between image classification and simple image classification is that there is another one needed here“ background” class. Another branch is used to determine the target location, That is to say, to complete the regression task and output four numbers to mark the position of the bounding box( For example, the horizontal and vertical coordinates of the center point and the length and width of the bounding box), The output result of this branch is only judged not to be“ background” Only when used.

Positioning of human body/ Face location

The idea of target location can also be used in human pose location or face location. Both of these require us to regress a series of key points of human joints or faces.

Weak supervision positioning

Because target orientation is a relatively simple task, Recently, the focus of research is to locate the target under the condition of only marking information. The basic idea is to find some significant regions with high response from convolution results, Think that this area corresponds to the target in the image.

object detection(object detection)

In target positioning, There is usually only one or a fixed number of targets, Target detection is more general, The types and numbers of objects in the image are uncertain. therefore, Target detection is a more challenging task than target location.

(1) Common data sets for target detection

PASCAL VOC  Contain20 Categories. Usually usedVOC07 andVOC12 Oftrainval Union as training, useVOC07 Test set as test.MS COCO
 COCO thanVOC More difficult.COCO Contain80k Training image,40k Verify image, and20k Test image without public mark(test-dev),80 Categories, Average per chart7.2 Target. Usually used80k Training and35k Verify the union of images as training, Rest5k Image as validation,20k Test image for online test.
mAP (mean average precision) Common evaluation indexes in target detection
, The calculation method is as follows. When the intersection ratio of predicted bounding box and real bounding box is greater than a certain threshold( Usually for0.5), The prediction is considered correct. For each category, We draw it with precision- Recall rate(precision-recall) curve, The average accuracy is the area under the curve. Then average the average accuracy of all categories, Can getmAP, Its value is[0,
100%]. Intersection ratio(intersection over union, IoU) The area of the intersection of the predicted bounding box and the real bounding box divided by the area of the union of the two bounding boxes, The value is[0,
1]. The intersection and union ratio measures the close degree between the predicted bounding box and the real bounding box, The bigger the ratio, The higher the overlap of two bounding boxes.(2) Target detection algorithm based on candidate region

Basic thinking

Use windows of different sizes to slide on the image, In each area, Target the area in the window. Namely, The region feedforward network in each window, Its classification branch is used to judge the category of the area, Regression branch for output bounding box. The motive of target detection based on sliding window is, Although the original image may contain multiple targets, However, there is usually only one target in the local area of the image corresponding to the sliding window( Or not). therefore, We can use the idea of target location to deal with the areas in the window one by one. however, Because this method needs to slide all areas of the image once, And the sliding windows vary in size, This will bring a lot of computing costs.


First, we use some unsupervised methods of non deep learning, Find some candidate areas in the image that may contain the target. after, Feedforward network for each candidate region, Target positioning,即两分支(分类+回归)输出.其中,我们仍然需要回归分支的原因是,候选区域只是对包含目标区域的一个粗略的估计,我们需要有监督地利用回归分支得到更精确的包围盒预测结果.R-CNN的重要性在于当时目标检测已接近瓶颈期,而R-CNN利于在ImageNet预训练模型微调的方法一举将VOC上mAP由35.1%提升至53.7%,确定了深度学习下目标检测的基本思路.一个有趣之处是R-CNN论文开篇第一句只有两个词"Features
matter." 这点明了深度学习方法的核心.

候选区域(region proposal)


Fast R-CNN


兴趣区域汇合(region of interest pooling, RoI pooling)


Faster R-CNN

Fast R-CNN测试时每张图像前馈网络只需0.2秒,但瓶颈在于提取候选区域需要2秒.Faster R-CNN

候选区域网络(region proposal networks, RPN)

为什么要使用锚盒(anchor box)

锚盒是预先定义形状和大小的包围盒.使用锚盒的原因包括:(1). 图像中的候选区域大小和长宽比不同,直接回归比对锚盒坐标修正训练起来更困难.(2).


Faster R-CNN在RoI


Fast R-CNN, Faster R-CNN, R-FCN这些算法的演进思路是逐渐提高网络中图像级别计算的比例,同时降低区域级别计算的比例

(3) 基于直接回归的目标检测算法





YOLO的优点在于:(1). 基于候选区域的方法的感受野是图像中的局部区域,而YOLO可以利用整张图像的信息.(2). 有更好的泛化能力.

YOLO的局限在于:(1). 不能很好处理网格中目标数超过预设固定值,或网格中有多个目标同时属于一个锚盒的情况.(2). 对小目标的检测能力不够好.(3).
对不常见长宽比的包围盒的检测能力不强.(4). 计算损失时没有考虑包围盒大小.大的包围盒中的小偏移和小的包围盒中的小偏移应有不同的影响.







(4) 目标检测常用技巧

非最大抑制(non-max suppression, NMS)


在线困难样例挖掘(online hard example mining, OHEM)




语义分割(semantic segmentation)


(1) 语义分割常用数据集

PASCAL VOC 2012 1.5k训练图像,1.5k验证图像,20个类别(包含背景).

MS COCO COCO比VOC更困难.有83k训练图像,41k验证图像,80k测试图像,80个类别.

(2) 语义分割基本思路





反卷积(deconvolution)/转置卷积(transpose convolution)




(3) 语义分割常用技巧

扩张卷积(dilated convolution)


条件随机场(conditional random field, CRF)

neural networks)近似条件随机场.条件随机场的另一弊端是会考虑两两像素之间的关系,这使其运行效率不高.



实例分割(instance segmentation)




Mask R-CNN

R-CNN有三个输出分支(分类,坐标回归,和分割).此外,Mask R-CNN的其他改进有:(1).
改进了RoI汇合,通过双线性差值使候选区域和卷积特征的对齐不因量化而损失信息.(2). 在分割时,Mask