This article is a platform for the market <> Original article , Please attach a link to the original :

author : Chen Taihong
Paper address : <>

<>1 Overview of target detection algorithms

CornerNet It's the University of Michigan Hei
Law And so on ECCV2018 A paper on , It mainly realizes target detection . At the beginning of the introduction CornerNet Before the paper , First review the mainstream algorithms in the field of target detection , Because the algorithm proposed by the author is different from the mainstream algorithm .

The depth method is mainly divided into three parts one-stage(e.g. SSD, YOLO) and two-stage(e.g.
RCNN series ) two types .single-stage It is directly generated by calculation on the picture detections.two-stage Extract first proposal,
Based on proposal Make a second correction . relatively speaking single-stage Fast , Low precision . and two-stage High precision , Slow .

2012 year , Based on deep learning CNN Network AlexNet stay ILSVRC Competitive ImageNet Shine , 2014 year Ross
Girshick utilize CNN Successfully replaced HOG,DPM Equal feature extraction , ross The target detection is divided into three steps , The first is image extraction detection
proposal, In fact, they are some areas of the image that may be detected , Then use the cnn Yes, that's right proposal Feature extraction , Last but not least svm These extracted features are classified , So as to complete the task of detection , This is
Two-stage object detectors Ancestor .

from RCNN,SPPNet reach fast RCNN, Again Faster
RCNN, Three steps of target detection ( Region selection , feature extraction , Categorical regression ) Unified into a deep network framework , The running speed is greatly improved .FCN,FPN, RoI Align, Mask
Branch And so on Faster R-CNN A great push forward . And then again FCN, IoU, NMS,ION,FPN, RoI Align and Mask
Branch And other technical sources YOLO, SSD, AttratioNet, G-CNN, R-FCN, Mask R-CNN, Mask ^X
R-CNN The evolutionary relationship of et al !

chart 1 Faster RCNN Algorithm framework

one-stage Detection algorithm , It doesn't need to region
proposal stage , The category probability and position coordinate value of objects are generated directly , After a single test, the final test results can be obtained directly , So it has a faster detection speed , Typical algorithms such as YOLO,SSD,Retina-Net.YOLO
The idea of divide and rule was used , Divide the input image into SxS Grid of , Different grids are classified by good classifiers .SSD take YOLO and Anchor The thought blends together , And innovative use
Feature Pyramid structure .YOLO, YOLO-v2, YOLO-v3, SSD,DSSD The introduction of real-time model , Faster target detection .

<>2 Motivation

CornerNet think Two-stage The most obvious disadvantage of target detection is that Region Proposal Phase to be extracted anchor
boxes.(1), Extracted anchor boxes Large quantity , such as DSSD use 40k, RetinaNet use 100k,anchor
boxes Numerous causes anchor boxes Conquer sample equilibrium .(2),anchor boxes A lot of hyper parameters need to be adjusted , such as anchor
boxes number , size , ratio , It affects the training and inference rate of the model .

The paper puts forward one-stage Detection method of , Abandoning tradition anchor boxes thinking , propose CornerNet The model predicts a pair of vertices in the upper left corner and the lower right corner of the target bounding box , Namely
Using a single convolution model to generate hotspot maps and connection vectors : Hotspots in the upper left corner of all targets and the lower right corner of all targets , Connection vector for each vertex (embedding vector).

chart 2 CornerNet frame

The author's idea actually comes from a paper on multi person pose estimation [1]. be based on CNN Of 2D Multi person attitude estimation method , Usually there are 2 Ideas (Bottom-Up
Approaches and Top-Down Approaches):

framework, It's pedestrian detection first , Get the bounding box , Then the human key points are detected in each bounding box , Connect to each other's posture , The disadvantage is that it is greatly affected by the human detection frame , Representative algorithms are RMPE.

framework, It is to detect the key parts of each human body in the whole picture , Then the detected human body parts are spliced into each person's posture , The disadvantage is that it may , The representative method is openpose.

The first innovation of this paper is that target detection is raised to methodology , Based on multi person attitude estimation Bottom-Up thought , Firstly, the vertex pairs of the positioning frame are predicted at the same time ( Upper left and lower right ) Hotspot maps and embedding
vector, according to embedding vector Group vertices .

The second innovation of the paper is to put forward corner pooling Used to locate vertices . Most goals in nature are no bounding boxes and no rectangular vertices , according to top-left corner
pooling take as an example , For each channel, The maximum values of horizontal and vertical directions are extracted respectively , Then sum it .

chart 3 corner pooling Calculation method

The paper thinks that corner
pooling Why it works , Because (1) It is difficult to determine the center of the target positioning frame , And bounding box 4 Strip edge correlation , However, each vertex is only related to two edges of the bounding box , therefore corner
Easier to extract .(2) The vertex can provide the discrete boundary space more effectively , practical O(wh) Vertices can be represented O(w2h2) anchor boxes.

The third innovation of the paper is based on the model hourglass framework , use focal loss[5] Neural network with variant training based on .

It is proposed in the paper CornerNet stay MS COCO Test verification , achieve 42.1% AP, Beat all of them one-stage Target detection method , At the same time git Publication based on PyTorch Source code : <>

<>3 Architecture

3.1 Overview

chart 4CornerNet Model architecture

As shown in the figure 4 As shown in Fig ,CornerNet The model architecture consists of three parts ,Hourglass[7] Network,Bottom-right corners&Top-left
Corners Heatmaps and Prediction Module.

Hourglass Network It is a typical framework of human posture estimation , Two papers stacked Hourglass Network generate Top-left and Bottom-right
corners, every last corners All included corners Pooling, And the corresponding Heatmaps, Embeddings
vector and offsets.embedding vector Make two vertices of the same target ( Upper left and lower right ) Shortest distance , offsets Used to adjust to produce a tighter bounding box .

3.2Detecting Corners
Paper model generation heatmaps contain C channels(C Is the category of the target , No, background
channel), each channel It's a binary mask , Represents the vertex position of the corresponding category .

For each vertex , only one ground-truth, Other locations are negative samples . In the training process , The model reduces negative samples , In each ground-truth Vertex set radius r There are positive samples in the region , This is because it falls on the radius r The vertex in the region can still generate an effective bounding box , Set in the paper IoU=0.7.

pcij Indicates that the category is c, The coordinates are (i,j) Forecast hot spot map of ,ycij Of the corresponding position ground-truth, The paper proposes variants Focal loss Represents the loss function of the detection target :

Due to down sampling , The resolution of the hot spot map generated by the model is lower than that of the input image . The loss function of migration is proposed , For fine tuning corner and ground-truth deviation .

3.3Grouping Corners
The input image will have multiple targets , The upper left corner and lower right corner vertex of multiple targets are generated accordingly . Group vertices , Introduction of papers [1] Associative
Embedding The thought of , In the training phase, the model is used for each corner Predict the corresponding embedding vector, adopt embedding
vector Make the distance between the vertex pairs of the same target the shortest , The existing model can be passed through embedding vector Group each vertex .

model training Lpull The loss function groups the vertices of the same target , Lpush The loss function is used to separate the vertices of different targets .

3.4Hourglass Network
Hourglass Network It also includes bottom-up(from high resolutions to low
resolutions) and top-down (from low resolutions to high
resolutions). and , There are multiple bottom-up and top-down process . The purpose of this design is to capture information at all scales . Target detection task , The paper has been adjusted Hourglass Some strategies .

<>4 Experiments

The training loss function of this paper includes the training loss function introduced in the third part 4 Loss functions ,α, β and γ Used to adjust the weight of the corresponding loss function :

Used during model training 10 individual Titan X (PASCAL) GPUs, Detailed training parameters can refer to the original paper . The inference time of the model is 244ms/ image (Titan

CornerNet Compared with others one-stage Target detection algorithm ,MS
COCO Data set testing AP There is a significant improvement , Although the performance is close to Two-stage Detection algorithm , But inferring time has no obvious advantage .

Table 4MS COCO test-dev Data set performance comparison

<>5 Discussion

Personal opinion :CornerNet Innovation comes from multi person attitude estimation Bottom-Up thinking , forecast corner Of heatmps, according to Embeddings
vector Yes corner Group , Its backbone network also comes from attitude estimation Hourglass Network. The source code of the model is in github It has been announced , Can be assured of bold research and testing .

CV Many of the tasks are interlinked ,CVPR2018 best paper [8] It also confirms this view , Looking for similarities in different sub domains , Migrating algorithms in different domains , yes CV A trend in the industry .

Multi person attitude estimation Hourglass Network The algorithm is also improving , In fact, the inference rate of the paper model is limited by Hourglass
Network Feature extraction based on , Aspiring young people can also follow this idea to achieve better performance .

The above is only my understanding after reading the paper , Summary and reflection . Deviation is inevitable , Readers are expected to read with a skeptical and critical attitude , Welcome to exchange and correct .

<>6 reference

Newell, A. Huang, Z. Deng, J.: Associative embedding: End-to-end learning
for joint detection and grouping. In: Advances in Neural Information Processing
Systems. pp. 2274{2284 (2017)

Hei Law, Jia Deng :CornerNet: Detecting Objects as Paired Keypoints.ECCV2018

Girshick, R.: Fast r-cnn. arXiv preprint arXiv:1504.08083 (2015)

Girshick, R. Donahue, J. Darrell, T. Malik, J.: Rich feature hierarchies
for accurate object detection and semantic segmentation. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. pp. 580{587 (2014)

Lin, T.Y. Goyal, P. Girshick, R. He, K. Doll´ar, P.: Focal loss for dense
object detection. arXiv preprint arXiv:1708.02002 (2017)

Liu, W. Anguelov, D. Erhan, D. Szegedy, C. Reed, S. Fu, C.Y. Berg,
A.C.:SSD: Single shot multibox detector. In: European conference on computer
vision.pp. 21{37. Springer (2016)

Newell, A. Yang, K. Deng, J.: Stacked hourglass networks for human pose
estimation. In: European Conference on Computer Vision. pp. 483{499. Springer

Amir R. Zamir , Alexander Sax Taskonomy: Disentangling Task Transfer