author: Zhang Hao
【 New wisdom guide】
The author of this paper is from the Institute of machine learning and data mining, Department of computer science, Nanjing University(LAMDA), This paper reviews the application of deep learning in four basic tasks of computer vision, Including image classification, Location, Testing, Semantic segmentation and instance segmentation.

The purpose of this paper is to introduce the application of deep learning in four basic tasks of computer vision, Including classification( charta), Location, Testing( chartb), Semantic segmentation( chartc), And instance segmentation( chartd).

image classification(image classification)

Given an input image, The purpose of image classification task is to determine the category of the image.

(1) Common data sets of image classification


Here are some common classification datasets, Increasing difficulty in turn.http://rodrigob.github.io/are_we_there_yet/build/ The performance ranking of each algorithm in each data set is listed.

MNIST 60k Training image,10k Test image,10 Categories, Image size1×28×28, Content is0-9 Handwritten digits.CIFAR-10 
50k Training image,10k Test image,10 Categories, Image size3×32×32.CIFAR-100 50k Training image,10k Test image,100 Categories, Image size3×32×32.
ImageNet
 1.2M Training image,50k Verify image,1k Categories.2017 Before and after, Annual meeting based onImageNet Data setILSVRC competition, This is equivalent to the Olympic Games in computer vision.(2)
Classic network structure of image classification

Basic framework  We useconv Representative convolution layer,bn Representative batch level 1,pool Representative convergence layer. The most common network structure order isconv -> bn -> relu ->
pool, The convolution layer is used to extract features, Convergence layer is used to reduce space size. With the development of network depth, The space size of the image will be smaller and smaller, And the number of channels will increase.

For your task, How to design a network?
  When faced with your actual task, If your goal is to solve the task rather than invent new algorithms, So don't try to design your own new network architecture, And don't try to recreate the existing network structure from scratch. Fine tune the published implementation and pre training model. Remove the last full connection layer and correspondingsoftmax, Add the full connection layer andsoftmax, Then fix the front layer, Train only what you add. If you have more training data, Then you can fine tune several layers, Even fine tune all layers.

LeNet-560k parameter. The basic network architecture is:conv1 (6) -> pool1 -> conv2 (16) -> pool2 -> fc3 (120) ->
fc4 (84) -> fc5 (10) ->
softmax. The numbers in brackets represent the number of channels, Network name has5 Indicating that it has5 layerconv/fc layer. at that time,LeNet-5 Successfully used forATM To recognize handwritten numbers in checks.LeNet Named after the author's surnameLeCun.

AlexNet 60M parameter,ILSVRC 2012 The champion network of. The basic network architecture is:conv1 (96) -> pool1 -> conv2 (256) ->
pool2 -> conv3 (384) -> conv4 (384) -> conv5 (256) -> pool5 -> fc6 (4096) ->
fc7 (4096) -> fc8 (1000) ->
softmax.AlexNet With harmonyLeNet-5 Similar network structure, But deeper, There are more parameters.conv1 Use11×11 Filter, Step length is4 Reduce space size rapidly(227×227
-> 55×55).AlexNet The key point is:(1).  UsedReLU Activation function, Make it have better gradient characteristics, Faster training.(2).  Random deactivation used
(dropout).(3).  Extensive use of data expansion technology
.AlexNet The point is that it's higher than the second place10% The performance ofILSVRC The champion of the competition, This makes people realize the advantages of the neural network of coiler. in addition,AlexNet It also makes people realize that they can use itGPU Accelerated convolution neural network training.AlexNet Named after the authorAlex.

VGG-16/VGG-19 138M parameter,ILSVRC 2014 The runner up network of.VGG-16 The basic structure of is:conv1^2 (64) -> pool1 ->
conv2^2 (128) -> pool2 -> conv3^3 (256) -> pool3 -> conv4^3 (512) -> pool4 ->
conv5^3 (512) -> pool5 -> fc6 (4096) -> fc7 (4096) -> fc8 (1000) -> softmax.
^3 Representative repetition3 second.VGG The key point of the network is:(1).  Simple structure
, only3×3 Convolution sum2×2 Merge two configurations, And repeatedly stack the same module combination. Convolution layer does not change space size, After each convergence layer, Halve space size.(2).  Large parameter
, And most of the parameters are concentrated in the full connection layer. Network name has16 Indicating that it has16 layerconv/fc layer.(3).  Proper network initialization and use batch unification(batch
normalization) Layers are important for training deep networks.VGG-19 The structure is similar toVGG-16, Slightly better thanVGG-16 Performance, butVGG-19 Need to consume more resources, So in practiceVGG-16 Use more. BecauseVGG-16 The network structure is very simple, And it's good for transfer learning, So farVGG-16 Still widely used.VGG-16 andVGG-19 The name comes from the name of the research group where the author works(Visual
Geometry Group).

GoogLeNet 5M parameter,ILSVRC
2014 The champion network of.GoogLeNet This paper attempts to answer how large the convolution size should be selected when designing the network, Or we should choose the convergence layer. It proposed.Inception Modular, Simultaneous use1×1,3×3,5×5 Convolution sum3×3 Convergence, And keep all results. The basic network architecture is:conv1
(64) -> pool1 -> conv2^2 (64, 192) -> pool2 -> inc3 (256, 480) -> pool3 ->
inc4^5 (512, 512, 512, 528, 832) -> pool4 -> inc5^2 (832, 1024) -> pool5 -> fc
(1000).GoogLeNet The key point is:(1).  Multi branch processing, Concatenate results.(2). In order to reduce the amount of calculation, Used1×1 convolution
Dimension reduction.GoogLeNet Global average convergence is used instead of full connection layer, Greatly reduce network parameters.GoogLeNet The name comes from the author's unit(Google), amongL Capital is forLeNet salute, andInception The name comes from"we
need to go deeper" stem.

Inception v3/v4
  stayGoogLeNet Further reduce the parameters on the basis of. Its sumGoogLeNet SimilarInception Modular, But will7×7 and5×5 Convolution decomposition into some equivalent3×3 convolution, And in the latter part of the network3×3 Convolution is decomposed into1×3 and3×1 convolution. This enables the network to be deployed to42 layer. in addition,Inception
v3 Use of batch consolidation.Inception v3 yesGoogLeNet Amount of computation2.5 times, And the error rate is lower than the latter3%.Inception
v4 stayInception Combined with the moduleresidual Modular( See below), Further reduced0.4% Error rate.

ResNet ILSVRC
2015 The champion network of.ResNet It aims to solve the problem of increasing training difficulty after network deepening. It proposed.residual Modular, Contains two3×3 Convolution and a short circuit connection( Left picture). Short circuit connection can effectively alleviate the gradient disappearing phenomenon caused by too deep depth in back propagation, This makes the performance not worse after the network is deepened. Short circuit connection is another important idea of deep learning, In addition to computer vision, Short circuit connections are also used in machine translation, speech recognition/ Synthetic field. in addition, With short circuit connectionResNet It can be seen as the integration of many networks with different depths and shared parameters, The number of networks increases with the number of layers.ResNet The key point is:(1).
Use short circuit connection, Make it easier to train deep networks, And repeatedly stack the same module combination.(2). ResNet A large number of batches are used.(3).
For deep networks( Exceed50 layer),ResNet More efficient bottlenecks used(bottleneck) structure( Lower right).ResNet stayImageNet It has achieved a super accuracy rate on.

The following table compares the above network structures.

preResNet
 ResNet Improvement.preResNet Wholeresidual The order of layers in the module. Compared with classicsresidual Modular(a),(b) takeBN Sharing will further affect the short-circuit transmission of information, Make the network harder to train, Worse performance;(c) Direct willReLU Move toBN The output of this branch will always be nonnegative, Reduce network representation;(d) takeReLU Resolved in advance(e) Nonnegative problems of, butReLU Unable to enjoyBN Effect;(e) takeReLU andBN It's all settled in advance(d) Problem.preResNet Short circuit connection of(e) Can deliver information more directly, And then we get the ratioResNet Better performance.

ResNeXt
 ResNet Another improvement of. The traditional way is to deepen or widen the network to improve the performance, But computing costs will also increase.ResNeXt To improve performance without changing model complexity. Fertilization is simple and efficientInception Module heuristic,ResNeXt takeResNet Which branch of the short circuit between China and Africa becomes more than one branch. andInception The difference is, Each branch has the same structure.ResNeXt The key point is:(1).
Continue to useResNet Short circuit connection of, And repeatedly stack the same module combination.(2).  Multi branch processing.(3). Use1×1 convolution
Reduce the amount of calculation. It's integrated.ResNet andInception Advantages. in addition,ResNeXt Ingeniously using group convolution to realize.ResNeXt find, Increasing the number of branches is a more effective way to improve network performance than deepening or widening.ResNeXt This is the next generation(next) OfResNet.

Random depth
 ResNet Improvement. To alleviate gradient loss and accelerate training. Similar to random deactivation(dropout), It randomlyresidual Module inactivation. The deactivated module is output directly from the short circuit branch, Without going through a branch with parameters. At testing time, Feedforward through all modules. Random depth descriptionresidual Modules are redundant.

DenseNet
  The goal is to avoid the gradient disappearing. andresidual Different modules,dense There is short circuit connection between any two layers in the module. In other words, The inputs of each layer are cascaded(concatenation) Contains results from all previous layers, That is to say, it contains all levels of features from low to high. Different from the previous method,DenseNet The number of filters in the middle convolution layer is very small.DenseNet only needResNet Half of the parameters can be achievedResNet Performance. Implementation aspect, The author points out in the report of the conference, Directly cascading the output takes up a lotGPU storage. later, Through shared storage, Can be in the sameGPU Deeper training under storage resourcesDenseNet. But some intermediate results need to be calculated repeatedly, This implementation will increase training time.

target location(object localization)

On the basis of image classification, We also want to know where the target is in the image, It's usually surrounded by a box(bounding box) form.

Basic thinking


Multi task learning, The network has two output branches. A branch for image classification, Fully connected+softmax Judge target category, The difference between image classification and simple image classification is that there is another one needed here“ background” class. Another branch is used to determine the target location, That is to say, to complete the regression task and output four numbers to mark the position of the bounding box( For example, the horizontal and vertical coordinates of the center point and the length and width of the bounding box), The output result of this branch is only judged not to be“ background” Only when used.

Positioning of human body/ Face location

The idea of target location can also be used in human pose location or face location. Both of these require us to regress a series of key points of human joints or faces.

Weak supervision positioning


Because target orientation is a relatively simple task, Recently, the focus of research is to locate the target under the condition of only marking information. The basic idea is to find some significant regions with high response from convolution results, Think that this area corresponds to the target in the image.

object detection(object detection)

In target positioning, There is usually only one or a fixed number of targets, Target detection is more general, The types and numbers of objects in the image are uncertain. therefore, Target detection is a more challenging task than target location.

(1) Common data sets for target detection

PASCAL VOC  Contain20 Categories. Usually usedVOC07 andVOC12 Oftrainval Union as training, useVOC07 Test set as test.MS COCO
 COCO thanVOC More difficult.COCO Contain80k Training image,40k Verify image, and20k Test image without public mark(test-dev),80 Categories, Average per chart7.2 Target. Usually used80k Training and35k Verify the union of images as training, Rest5k Image as validation,20k Test image for online test.
mAP (mean average precision) Common evaluation indexes in target detection
, The calculation method is as follows. When the intersection ratio of predicted bounding box and real bounding box is greater than a certain threshold( Usually for0.5), The prediction is considered correct. For each category, We draw it with precision- Recall rate(precision-recall) curve, The average accuracy is the area under the curve. Then average the average accuracy of all categories, Can getmAP, Its value is[0,
100%]. Intersection ratio(intersection over union, IoU) The area of the intersection of the predicted bounding box and the real bounding box divided by the area of the union of the two bounding boxes, The value is[0,
1]. The intersection and union ratio measures the close degree between the predicted bounding box and the real bounding box, The bigger the ratio, The higher the overlap of two bounding boxes.(2) Target detection algorithm based on candidate region

Basic thinking


Use windows of different sizes to slide on the image, In each area, Target the area in the window. Namely, The region feedforward network in each window, Its classification branch is used to judge the category of the area, Regression branch for output bounding box. The motive of target detection based on sliding window is, Although the original image may contain multiple targets, However, there is usually only one target in the local area of the image corresponding to the sliding window( Or not). therefore, We can use the idea of target location to deal with the areas in the window one by one. however, Because this method needs to slide all areas of the image once, And the sliding windows vary in size, This will bring a lot of computing costs.

R-CNN


First, we use some unsupervised methods of non deep learning, Find some candidate areas in the image that may contain the target. after, Feedforward network for each candidate region, Target positioning,即两分支(分类+回归)输出.其中,我们仍然需要回归分支的原因是,候选区域只是对包含目标区域的一个粗略的估计,我们需要有监督地利用回归分支得到更精确的包围盒预测结果.R-CNN的重要性在于当时目标检测已接近瓶颈期,而R-CNN利于在ImageNet预训练模型微调的方法一举将VOC上mAP由35.1%提升至53.7%,确定了深度学习下目标检测的基本思路.一个有趣之处是R-CNN论文开篇第一句只有两个词"Features
matter." 这点明了深度学习方法的核心.

候选区域(region proposal)

候选区域生成算法通常基于图像的颜色,纹理,面积,位置等合并相似的像素,最终可以得到一系列的候选矩阵区域.这些算法,如selective
search或EdgeBoxes,通常只需要几秒的CPU时间,而且,一个典型的候选区域数目是2k,相比于用滑动窗把图像所有区域都滑动一遍,基于候选区域的方法十分高效.另一方面,这些候选区域生成算法的查准率(precision)一般,但查全率(recall)通常比较高,这使得我们不容易遗漏图像中的目标.

Fast R-CNN

R-CNN的弊端是需要多次前馈网络,这使得R-CNN的运行效率不高,预测一张图像需要47秒.Fast
R-CNN同样基于候选区域进行目标检测,但受SPPNet启发,在Fast
R-CNN中,不同候选区域的卷积特征提取部分是共享的.也就是说,我们先将整副图像前馈网络,并提取conv5卷积特征.之后,基于候选区域生成算法的结果在卷积特征上进行采样,这一步称为兴趣区域汇合.最后,对每个候选区域,进行目标定位,即两分支(分类+回归)输出.

兴趣区域汇合(region of interest pooling, RoI pooling)


兴趣区域汇合旨在由任意大小的候选区域对应的局部卷积特征提取得到固定大小的特征,这是因为下一步的两分支网络由于有全连接层,需要其输入大小固定.其做法是,先将候选区域投影到卷积特征上,再把对应的卷积特征区域空间上划分成固定数目的网格(数目根据下一步网络希望的输入大小确定,例如VGGNet需要7×7的网格),最后在每个小的网格区域内进行最大汇合,以得到固定大小的汇合结果.和经典最大汇合一致,每个通道的兴趣区域汇合是独立的.

Faster R-CNN

Fast R-CNN测试时每张图像前馈网络只需0.2秒,但瓶颈在于提取候选区域需要2秒.Faster R-CNN
不再使用现有的无监督候选区域生成算法,而利用候选区域网络从conv5特征中产生候选区域,并且将候选区域网络集成到整个网络中端到端训练.Faster
R-CNN的测试时间是0.2秒,接近实时.后来有研究发现,通过使用更少的候选区域,可以在性能损失不大的条件下进一步提速.

候选区域网络(region proposal networks, RPN)
 在卷积特征上的通过两层卷积(3×3和1×1卷积),输出两个分支.其中,一个分支用于判断每个锚盒是否包含了目标,另一个分支对每个锚盒输出候选区域的4个坐标.候选区域网络实际上延续了基于滑动窗进行目标定位的思路,不同之处在于候选区域网络在卷积特征而不是在原图上进行滑动.由于卷积特征的空间大小很小而感受野很大,即使使用3×3的滑动窗,也能对应于很大的原图区域.Faster
R-CNN实际使用了3组大小(128×128,256×256,512×512),3组长宽比(1:1,1:2,2:1),共计9个锚盒,这里锚盒的大小已经超过conv5特征感受野的大小.对一张1000×600的图像,可以得到20k个锚盒.

为什么要使用锚盒(anchor box)

锚盒是预先定义形状和大小的包围盒.使用锚盒的原因包括:(1). 图像中的候选区域大小和长宽比不同,直接回归比对锚盒坐标修正训练起来更困难.(2).
conv5特征感受野很大,很可能该感受野内包含了不止一个目标,使用多个锚盒可以同时对感受野内出现的多个目标进行预测.(3).
使用锚盒也可以认为这是向神经网络引入先验知识的一种方式.我们可以根据数据中包围盒通常出现的形状和大小设定一组锚盒.锚盒之间是独立的,不同的锚盒对应不同的目标,比如高瘦的锚盒对应于人,而矮胖的锚盒对应于车辆.

R-FCN

Faster R-CNN在RoI
pooling之后,需要对每个候选区域单独进行两分支预测.R-FCN旨在使几乎所有的计算共享,以进一步加快速度.由于图像分类任务不关心目标具体在图像的位置,网络具有平移不变性.但目标检测中由于要回归出目标的位置,所以网络输出应当受目标平移的影响.为了缓和这两者的矛盾,R-FCN显式地给予深度卷积特征各通道以位置关系.在RoI汇合时,先将候选区域划分成3×3的网格,之后将不同网格对应于候选卷积特征的不同通道,最后每个网格分别进行平均汇合.R-FCN同样采用了两分支(分类+回归)输出.

小结

基于候选区域的目标检测算法通常需要两步:第一步是从图像中提取深度特征,第二步是对每个候选区域进行定位(包括分类和回归).其中,第一步是图像级别
计算,一张图像只需要前馈该部分网络一次,而第二步是区域级别计算,每个候选区域都分别需要前馈该部分网络一次.因此,第二步占用了整体主要的计算开销.R-CNN,
Fast R-CNN, Faster R-CNN, R-FCN这些算法的演进思路是逐渐提高网络中图像级别计算的比例,同时降低区域级别计算的比例
.R-CNN中几乎所有的计算都是区域级别计算,而R-FCN中几乎所有的计算都是图像级别计算.

(3) 基于直接回归的目标检测算法

基本思路


基于候选区域的方法由于有两步操作,虽然检测性能比较好,但速度上离实时仍有一些差距.基于直接回归的方法不需要候选区域,直接输出分类/回归结果.这类方法由于图像只需前馈网络一次,速度通常更快,可以达到实时.

YOLO


将图像划分成7×7的网格,其中图像中的真实目标被其划分到目标中心所在的网格及其最接近的锚盒.对每个网格区域,网络需要预测:每个锚盒包含目标的概率(不包含目标时应为0,否则为锚盒和真实包围盒的IoU),每个锚盒的4个坐标,该网格的类别概率分布.每个锚盒的类别概率分布等于每个锚盒包含目标的概率乘以该网格的类别概率分布.相比基于候选区域的方法,YOLO需要预测包含目标的概率的原因是,图像中大部分的区域不包含目标,而训练时只有目标存在时才对坐标和类别概率分布进行更新.

YOLO的优点在于:(1). 基于候选区域的方法的感受野是图像中的局部区域,而YOLO可以利用整张图像的信息.(2). 有更好的泛化能力.

YOLO的局限在于:(1). 不能很好处理网格中目标数超过预设固定值,或网格中有多个目标同时属于一个锚盒的情况.(2). 对小目标的检测能力不够好.(3).
对不常见长宽比的包围盒的检测能力不强.(4). 计算损失时没有考虑包围盒大小.大的包围盒中的小偏移和小的包围盒中的小偏移应有不同的影响.

SSD

相比YOLO,SSD在卷积特征后加了若干卷积层以减小特征空间大小,并通过综合多层卷积层的检测结果以检测不同大小的目标.此外,类似于Faster
R-CNN的RPN,SSD使用3×3卷积取代了YOLO中的全连接层,以对不同大小和长宽比的锚盒来进行分类/回归.SSD取得了比YOLO更快,接近Faster
R-CNN的检测性能.后来有研究发现,相比其他方法,SSD受基础模型性能的影响相对较小.

FPN


之前的方法都是取高层卷积特征.但由于高层特征会损失一些细节信息,FPN融合多层特征,以综合高层,低分辨率,强语义信息和低层,高分辨率,弱语义信息来增强网络对小目标的处理能力.此外,和通常用多层融合的结果做预测的方法不同,FPN在不同层独立进行预测.FPN既可以与基于候选区域的方法结合,也可以与基于直接回归的方法结合.FPN在和Faster
R-CNN结合后,在基本不增加原有模型计算量的情况下,大幅提高对小目标的检测性能.

RetinaNet


RetinaNet认为,基于直接回归的方法性能通常不如基于候选区域方法的原因是,前者会面临极端的类别不平衡现象.基于候选区域的方法可以通过候选区域过滤掉大部分的背景区域,但基于直接回归的方法需要直接面对类别不平衡.因此,RetinaNet通过改进经典的交叉熵损失以降低对已经分的很好的样例的损失值,提出了焦点(focal)损失函数,以使模型训练时更加关注到困难的样例上.RetinaNet取得了接近基于直接回归方法的速度,和超过基于候选区域的方法的性能.

(4) 目标检测常用技巧

非最大抑制(non-max suppression, NMS)


目标检测可能会出现的一个问题是,模型会对同一目标做出多次预测,得到多个包围盒.NMS旨在保留最接近真实包围盒的那一个预测结果,而抑制其他的预测结果.NMS的做法是,首先,对每个类别,NMS先统计每个预测结果输出的属于该类别概率,并将预测结果按该概率由高至低排序.其次,NMS认为对应概率很小的预测结果并没有找到目标,所以将其抑制.然后,NMS在剩余的预测结果中,找到对应概率最大的预测结果,将其输出,并抑制和该包围盒有很大重叠(如IoU大于0.3)的其他包围盒.重复上一步,直到所有的预测结果均被处理.

在线困难样例挖掘(online hard example mining, OHEM)


目标检测的另一个问题是类别不平衡,图像中大部分的区域是不包含目标的,而只有小部分区域包含目标.此外,不同目标的检测难度也有很大差异,绝大部分的目标很容易被检测到,而有一小部分目标却十分困难.OHEM和Boosting的思路类似,其根据损失值将所有候选区域进行排序,并选择损失值最高的一部分候选区域进行优化,使网络更关注于图像中更困难的目标.此外,为了避免选到相互重叠很大的候选区域,OHEM对候选区域根据损失值进行NMS.

在对数空间回归


回归相比分类优化难度大了很多.L2ell_损失对异常值比较敏感,由于有平方,异常值会有大的损失值,同时会有很大的梯度,使训练时很容易发生梯度爆炸.而L1el损失的梯度不连续.在对数空间中,由于数值的动态范围小了很多,回归训练起来也会容易很多.此外,也有人用平滑的L1el损失进行优化.预先将回归目标规范化也会有助于训练.

语义分割(semantic segmentation)

语义分割是目标检测更进阶的任务,目标检测只需要框出每个目标的包围盒,语义分割需要进一步判断图像中哪些像素属于哪个目标.

(1) 语义分割常用数据集

PASCAL VOC 2012 1.5k训练图像,1.5k验证图像,20个类别(包含背景).

MS COCO COCO比VOC更困难.有83k训练图像,41k验证图像,80k测试图像,80个类别.

(2) 语义分割基本思路

基本思路

逐像素进行图像分类.我们将整张图像输入网络,使输出的空间大小和输入一致,通道数等于类别数,分别代表了各空间位置属于各类别的概率,即可以逐像素地进行分类.

全卷积网络+反卷积网络


为使得输出具有三维结构,全卷积网络中没有全连接层,只有卷积层和汇合层.但是随着卷积和汇合的进行,图像通道数越来越大,而空间大小越来越小.要想使输出和输入有相同的空间大小,全卷积网络需要使用反卷积和反汇合来增大空间大小.

反卷积(deconvolution)/转置卷积(transpose convolution)


标准卷积的滤波器在输入图像中进行滑动,每次和输入图像局部区域点乘得到一个输出,而反卷积的滤波器在输出图像中进行滑动,每个由一个输入神经元乘以滤波器得到一个输出局部区域.反卷积的前向过程和卷积的反向过程完成的是相同的数学运算.和标准卷积的滤波器一样,反卷积的滤波器也是从数据中学到的.

反最大汇合(max-unpooling)


通常全卷积网络是对称的结构,在最大汇合时需要记下最大值所处局部区域位置,在对应反最大汇合时将对应位置输出置为输入,其余位置补零.反最大汇合可以弥补最大汇合时丢失的空间信息.反最大汇合的前向过程和最大汇合的反向过程完成的是相同的数学运算.

(3) 语义分割常用技巧

扩张卷积(dilated convolution)


经常用于分割任务以增大有效感受野的一个技巧.标准卷积操作中每个输出神经元对应的输入局部区域是连续的,而扩张卷积对应的输入局部区域在空间位置上不连续.扩张卷积保持卷积参数量不变,但有更大的有效感受野.

条件随机场(conditional random field, CRF)


条件随机场是一种概率图模型,常被用于微修全卷积网络的输出结果,使细节信息更好.其动机是距离相近的像素,或像素值相近的像素更可能属于相同的类别.此外,有研究工作用循环神经网络(recurrent
neural networks)近似条件随机场.条件随机场的另一弊端是会考虑两两像素之间的关系,这使其运行效率不高.

利用低层信息

综合利用低层结果可以弥补随着网络加深丢失的细节和边缘信息.

实例分割(instance segmentation)


语义分割不区分属于相同类别的不同实例.例如,当图像中有多只猫时,语义分割会将两只猫整体的所有像素预测为“猫”这个类别.与此不同的是,实例分割需要区分出哪些像素属于第一只猫,哪些像素属于第二只猫.

基本思路

目标检测+语义分割.先用目标检测方法将图像中的不同实例框出,再用语义分割方法在不同包围盒内进行逐像素标记.

Mask R-CNN

用FPN进行目标检测,并通过添加额外分支进行语义分割(额外分割分支和原检测分支不共享参数),即Master
R-CNN有三个输出分支(分类,坐标回归,和分割).此外,Mask R-CNN的其他改进有:(1).
改进了RoI汇合,通过双线性差值使候选区域和卷积特征的对齐不因量化而损失信息.(2). 在分割时,Mask
R-CNN将判断类别和输出模板(mask)这两个任务解耦合,用sigmoid配合对率(logistic)损失函数对每个类别的模板单独处理,取得了比经典分割方法用softmax让所有类别一起竞争更好的效果.