New image retrieval Summarize the current situation of image retrieval Convenient for yourself and others to view Data collected from the network Infringement of rights Please contact to delete

 

End to end feature learning method

NetVLAD: CNN architecture for weakly supervised place recognition (CVPR 2016)

This article is fromINRIA OfRelja
Arandjelović The work of others. This paper focuses on a specific application of case search—— Location recognition. In the problem of location recognition, Given a query picture, Mark data set by querying a large-scale location, Then use the location of those similar images to estimate the location of the query image. The author first usesGoogle
Street View Time
Machine Large scale location mark data set is established, Then a convolutional neural network architecture is proposed,NetVLAD—— takeVLAD Method embedded inCNN Network, And realize“end-to-end” Learning. This method is shown in the figure below:

 


OriginalVLAD Method inhard-assignment Operations are nondifferentiable( Assign each local feature to its nearest center point), So it can't be embedded directly intoCNN In the network, And participate in error back propagation. The solution to this article is to usesoftmax The function takes this.hard-assignment Operation converted tosoft-assignment operation—— Use1x1 Convolution sumsoftmax The probability that the local feature belongs to each central point is obtained by the function/ weight, Then assign it to have the maximum probability/ Center point of weight. thereforeNetVLAD It contains three parameters that can be learned,, Which is the top1x1 Convolution parameters, Used for predictionsoft-assignment, Represented as the center point of each cluster. And in the picture aboveVLAD
core The corresponding accumulated residual operation is completed in the layer. The author explained it to us through the figure belowNetVLAD Compared to the originalVLAD Advantage:( Greater flexibility—— Learning better cluster center points)

 

 



 

Another improvement of this article isWeakly supervised triplet ranking
loss. In order to solve the problem that the training data may contain noise, taketriplet ranking
loss Replace positive and negative samples with potential positive samples( Include at least one positive sample, But I'm not sure which one) And a clear negative sample set. And in training, The feature distance between the constrained query image and the most likely positive image in the positive sample set is smaller than the feature distance between the query image and all the images in the negative sample set.

Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles
(CVPR 2016)

The next article focuses on vehicle identification/ Search problem, From Peking UniversityHongye Liu The work of others. As shown in the figure below, This problem can also be regarded as an instance search task.

 


Like many supervised deep case search methods, This paper aims to map the original image into a Euclidean feature space, And make it, More images of the same vehicle, Pictures of vehicles not of the same kind are far away. To achieve this effect, The common method is to optimizetriplet
ranking loss, Go to trainingCNN network. however, The author found that the originaltriplet ranking loss There are some problems, As shown in the figure below:

 

 

 

 


For the same sample, The triples on the left are adjusted by the loss function, The triples on the right are ignored. The difference between the two isanchor Different choices, This leads to instability in training. To overcome this problem, Author usecoupled
clusters loss(CCL) To replacetriplet ranking
loss. The characteristic of the loss function is to change the triple into a positive sample set and a negative sample set, And make the samples in the positive samples aggregate with each other, The negative samples are more distant from the positive samples, Thus, random selection is avoidedanchor Negative impact of samples. The specific effect of this loss function is shown in the figure below:

 

 



 

Finally, this paper aims at the particularity of vehicle problem, And combined with the above designcoupled clusters
loss, A hybrid network architecture is designed, And build the relevant vehicle database to provide the required training samples.

DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich
Annotations (CVPR 2016)

The final article is also published inCVPR 2016 upper, It introduces clothing recognition and search, The same task related to instance search, From the Chinese University of Hong KongZiwei
Liu The work of others. First, This article introduces a project namedDeepFashion Clothes database of. The database contains more than800K Clothes picture of,50 Fine grained categories and1000 Attributes, It also provides key points and cross posture of clothes/ Cross domain clothing relationships(cross-pose/cross-domain
pair correspondences), Some specific examples are shown in the figure below:

 


Then in order to show the effect of the database, The author proposes a novel deep learning network,FashionNet—— Predict key points and attributes of clothes by combining, Learning to get more distinguishing features. The overall framework of the network is as follows:

 

 



 


FashionNet There are three stages in the forward calculation of: First stage, Enter a clothing picture into the blue branch of the network, To predict the visibility and location of key points of clothing. Second stage, Based on the key position predicted in the previous step, Key point pooling layer(landmark
pooling layer) Get local features of clothes. The third stage, take“fc6 global” Global characteristics and“fc6
local” Local features of“fc7_fusion”, As the final image feature.FashionNet Four loss functions are introduced, And use an iterative training method to optimize. These losses are: Regression loss corresponds to key point positioning,softmax Loss corresponds to key visibility and clothing category, Cross entropy loss function corresponding to attribute prediction and triple loss function corresponding to similarity learning between clothes. The author classifies clothes separately, Attribute prediction and clothing search, takeFashionNet Comparison with other methods, All have achieved significantly better results.


summary: When there is enough labeled data, Deep learning can learn image features and measurement functions at the same time. The idea behind it is to use a given metric function, The learning feature makes the feature have the best discrimination in this metric space. Therefore, the main research direction of end-to-end feature learning method is how to construct better feature representation and loss function form.

Be based onCNN Feature coding method of feature


The depth case search algorithm introduced in the above part of this paper, Focus on data-driven end-to-end feature learning methods and corresponding image search data sets. Next, This paper focuses on another problem: When these related search datasets are not available, How to extract effective image features. In order to overcome the shortage of domain data, A feasible strategy is toCNN Pre training model( Training on other task data setsCNN Model, such asImageNet Image classification data set) On the basis of, Extract the feature map of one layer(feature
map), Coding it to get image features suitable for instance search task. This part will be based on relevant papers in recent years, Introduce some main methods( Special, All of theCNN Models are based onImageNet Pre training model of classification data set).

Multi-Scale Orderless Pooling of Deep Convolutional Activation Features (ECCV
2014)

This article was published inECCV 2014 upper, It's from the University of North Carolina at Chapel HillYunchao Gong And the University of Illinois at Urbana ChampaignLiwei Wang The work of others.
Due to the globalCNN Feature lack of geometric invariance, Limited the classification and matching of variable scenes. The author attributes the problem to the overallCNN Features contain too much spatial information, Therefore, it is proposed thatmulti-scale
orderless pooling (MOP-CNN)—— takeCNN Characteristics and disorderVLAD Combination of coding methods.


MOP-CNN The main steps are, First of allCNN Network as“ Local feature” Extractor, Then extract the image's“ Local feature”, And adoptVLAD Put these“ Local feature” Encoded as image features on this scale, Finally, all scale image features are connected to form the final image features. The framework for feature extraction is as follows:

 

The author tests on two tasks: classification and instance search, As shown in the figure below, ProvedMOP-CNN Compared with generalCNN Global feature has better classification and search effect.

 

 



 

Exploiting Local Features from Deep Networks for Image Retrieval (CVPR 2015
workshop)

This article was published inCVPR 2015 workshop upper, It's from the University of Maryland, Park CollegeJoe Yue-Hei
Ng The work of others. Many recent studies have shown that, Compared with the output of all connected layers, Characteristic map of convolution layer(feature
map) More suitable for instance search. This paper introduces how to transform the characteristic map of convolution into“ Local feature”, And useVLAD Encode it as an image feature. in addition, In addition, a series of experiments were carried out to observe the influence of the characteristic maps of different convolutions on the accuracy of case search.



Aggregating Deep Convolutional Features for Image Retrieval(ICCV 2015)

The next article is published inICCV 2015 upper, From Moscow Institute of physics and technologyArtem Babenko And Skolkovo Institute of TechnologyVictor
Lempitsky Work. As can be seen from the above two articles, Many deep case search methods use unordered coding methods. But includeVLAD,Fisher
Vector These coding methods are usually computationally expensive. To overcome this problem, This article designed a more simple, And a more efficient coding method——Sum pooing.Sum
pooling The specific definitions are as follows:



Among them is the local feature of convolution layer in spatial position( The method of extracting local features here, Consistent with the previous article). in usesum
pooling after, Further execution of global featuresPCA andL2 Normalize to get the final feature. Author andFisher Vector,Triangulation
embedding andmax pooling These methods are compared, Demonstratedsum pooling The method is not only simple in calculation, And it works better.

Come from Deep learning lecture https://zhuanlan.zhihu.com/p/22265265

 

 

Where to Focus: Query Adaptive Matching for Instance Retrieval Using
Convolutional Feature Maps (arXiv 1606.6811)

 

This paper is in《Particular object retrieval with integral max-pooling of CNN activations
》  On the basis of this, a new method is proposedReranking Method.

Before the beginning of the narrative, Let's understand convolutionFeature Map

 

The image above is a visualization of different convolutions, We can see,early convolutional layer Capture a major visual model, andlate
convolutional layer It's more about the representation of the object outline.

    In this paperReranking Process finishing:

One  Method introduction

1 productionbase regions, There are two ways:

1.1 Feature Map Pooling(FMP)

    For a certain layer of convolution network, If there isD Convolution kernel, Can produceD ZhangFeature Map(FM) chart. For eachFM, We choose a non-zero response as aBase
Regions(BR), suchBR The number ofFM Quantity. Then onBR The response value insum-pooling, Every one of themFM You get a valuefd. But for a givenImage, Quite a lotFM There's a lot of overlap, So it corresponds topooling Features, that isfd Basically the same, We haveFd Value for a cluster, Set cluster center toK.( It can be understood hereD individualBR ClusteringK individualBR).





                   sum-pooling Schema meaning( Add the response values)

1.2 Overlapped Spatial Pyramid Pooling (OSPP)

 

    OSPP Law and suggestionR-MAC In the paperRegions Same extraction method, Corresponding to different scales, We extract l × (l + m − 1) individualRegions, Its width= 2
min(W; H)/(l + 1), Then evenly sample outm Regions(BR region)

2 Reranking process 

Paper proposesQuery Adaptive Matching(QAM) A method ofReranking, That's rightBR To merge into oneMerge
Region, And this selection process turns into an optimization problem. Use this process, For a picture, Elect andquery The most similar merge region(merge regions) 

 

Through the above optimization process( In fact, the last is a common quadratic programming problem) Let's take out a pictureMerge Region. Calculatedquery andMerge
region As the similarity score ofReranking OfScore, Final ranking.

Let's talk about it. I'm up thereBase region  Generation process of CombinationQAM Understanding:

aboutFMP Method: Every lastFeature Map We'll get oneBase Region, therefore, adoptFMP Method, What we got in the endBase
region The number of convolution kernels in this layer. And ultimatelybase-region Representation, The paper does use onesum-pooling Way, Every one of themBase
Region Only one value will come out in the end, And in the process of optimization,Merge
Region The final representation will also become a value, It's impossible toquery The inner product of the vector of. That's what I've been wondering about this paper, If any senior( It is better to hear Tao than to hear it) I see the problem, Please give me some advice.( It could also be a mistake in the paper)

aboutOSPP method: Because in differentFM On different scalesBase Region Selection, So it's differentBase
Region There are different vector representations. We can easily apply itQAM YesBase Region Select.


Deep Image Retrieval: Learning global representations for image search. In
ECCV, 2016.

Address of thesis:https://arxiv.org/abs/1604.01325 <https://arxiv.org/abs/1604.01325>

extended version:end to end learning of deep visual representations for image
retrieval, in arxiv, 16.10. <https://arxiv.org/pdf/1610.07940v1.pdf>

=====

Look at the picture and talk:



As can be seen from the figure, the overall framework of this paper:

1 Be based onpre-trained model on Imagenet( asVGG16)

2 fromLandmarks dataset[17], Dig out onefull perhapsclean Data set( Include category labels Full Datset andbounding box
Clean Dataset)

3 Data setFull Daset To carry outfinetune, hisloss It's a general classificationloss; useClean
Dataset To carry outfinetune, hisloss bytriplet loss



4 Using the trained model to open data setfeature extraction,similarity measure Adopt European distance(dot product)

It is also used in the paperquery expansion Way toboost performance.

=====

Let's focus on the above1 and3.

( For the acquisition of dataset, In fact, I didn't understand it, Just understand that you need to provide oneclean Data set of)

 

1 pre-trained model And theframework.

Here we can useAlexNet,VGGNet,Resnet etc. depend on与你想要的效果(performance和speed)

对于VGGNet(如VGG16),摘掉全连接层,取而代之的是RPN + RoI Pooling +shift + fc + L2等.

为什么要用RPN,这里为了取代rigid grid的做法(仅在test的时候取代,而finetune时,proposals就是rigid
grid,具体看论文中的引用论文).

也可以看extended version的论文,将RPN彻底取代rigid grid,形成end2end的framework.

至于shift + fc的作用就是取代一般pipeline中的PCA Whitten.

这里的L2和后续的求和(所有regions的feature对应求和得到最后global compact的image
representation)和L2,仿效一般pipepline的做法.

(具体可以参考

Particular Object Retrieval with Integral Max-Pooling of CNN Activations. In
ICLR, 2016 <https://arxiv.org/abs/1511.05879>.)

 

因为以上的操作都是可导的,这样就可以将它们嵌入到一个模型中,进行forward和backward地训练模型,而再也不是一个pipeline的做法.

 

2 MAC feature的简单介绍:(pooling可以用sum也可以用max,或者其他的)



R-MAC:一般的MAC是针对whole image的feaute map,而R-MAC的做法就是参考RoI Pooling的做法,将bounding
box 投影到feature map上,然后仅在投影在feature map上的区域进行pooling.

 

3 RPN

这个没有什么好说的,请参考faster rcnn.

具体是将RPN这个sub-network放到上面framework上,数据采用Clean Dataset.

 

4 训练

先用rigid grid的方式产生region,用于训练siamese的triplet
loss或者简单的classification,对应的数据集为Clean Dataset和Full Dataset.

然后用Clean Dataset来训练RPN

最后测试的时候,用RPN产生的proposals替代rigid grid.

(不过论文中提到用Full Dataset训练的classification的模型来初始化triplet loss的模型,效果更佳)

至于具体的网络参数,请参考论文.



 

5 数据的产生



 

 

 

【CVPR 2016】faster r-cnn features for instance search 笔记

 

 

 

论文源码以及视频:http://imatge-upc.github.io/retrieval-2016-deepvision/

我自己制作的ppt地址:http://download.csdn.net/detail/dengbingfeng/9524748

这篇paper的解读默认大家对faster-rcnn有基本的了解....

 

 

基本流程:

        利用现有的faster-rcnn物体检测只前向传播一次来提取整个图像的卷积特征和区域卷积特征,共享计算.


检索物体在检索图像中用提供的坐标框表示其位置,使用faster-rcnn提取整个数据集图像的conv5_3层特征,并于待检索图像的conv5_3层特征比较余弦相似度,这样便完成对整个数据集图像的第一次rank,即和待检索图片越相似越排名越靠前.

        在第一次rank后的基础上,针对排名top
N的图片,利用faster-rcnn框处物体检测框,取出所有物体检测框的pool5层特征和待检索物体的pool5层特征比较余弦相似度,依然越相似的排名越靠前,完成rerank,即第二次排序.

        操作完后将top 10的结果显示出来.



细节:

1.Image-wise pooling of activations (IPA)

        就是用最后一层卷积层的激活值来构建对整幅图片的描述.

2.Region-wise pooling of activations (RPA)

        RPN产生的proposals的卷积特征求和池化特征先用L2归一化,whitening后再L2归一化一次,而最大池化特征只进行一次L2归一化.

3.微调faster-rcnn

        两种:只调整全连接层和除前两层卷积层外都所有层都微调

4.Class-Agnostic Spatial Reranking (CA-SR)

        未知类别空间排序

5.Class-Specific Spatial Reranking (CS-SR)

特定类别排序,使用相同检索物体微调后的网络,可以直接使用RPN proposal的得分来作为与待检索物体的相似度得分,

得分用来对图片列表进行排序.

6.数据集

在Oxford和Pairs数据集里,输出12种类别可能(11种建筑+背景).

在INS 13中有30种不同的检索实例,输出31种类别可能.

只调整全连接层在检索物体较难的时候效果不好.

 

整个网络结构

 



整个网络从总体上看是faster-rcnn的网络结构,上面一部分是faster-rcnn 的RPN net部分,RPN net的输出rpn
proposals,网络的下面部分

是ROI pooling 加上三个全连接层,输出是class probabilities.

       Image-wise pooling of activations(IPA):
  这一步骤实际上抽出image的representation,具体的方法是从卷积层的最后一层

conv5_3(针对VGG16
Net,并且经过了reLu层之后),然后做pooling,具体pooling 的方法作者是借鉴另外一篇paper:《particular object
retrieval

with integeral max-pooling of CNN activations》.举个例子来说:如果最后conv5_3得出的feature
map的维度是K*W*H,其中K为卷积核的数目,W*H

为每一个卷积核卷积之后的feature map,这样对于每一个W*H的feature Map 采用max-pooling 或者sum-pooling
就能得到一个值.这样,整个K*W*H

采用pooling之后得到的feature即为K*1的向量.

     Region-wise pooling of activations(RPA):
 这一步骤得到的是region的representation,有了上面的IPA,这一步的RPA也很容易理解,

就是找出region proposals 的ROI pooling,在ROI pooling层上面做max-pooling.

    

fine-tuning faster rcnn

fine tuning 采用两种方式:

strategy1: fine tuning ROI pooling之后的三层网络.



strategy2:fine tuning network after conv_2



fine-tuining 所使用图像为query 图像以及将其做horizontal flip之后的图像(个人感觉图像好少).

 

 

3.Image Retrieval

一共分为三个步骤:

1.过滤:提取出查询图像以及数据库图像的IPA,然后通过计算余弦距离将数据库图像进行排序.(整个过程都是使用的图像的IPA与区域无关).

2.空间重排:

空间重排采用了两种方法:

Class-Agnostic Spatial Reranking (CA-SR):假设类别不可知,计算每一个query bounding
box的RPA与采用第一部过滤前N幅图像每一个proposal的

余弦距离,最高的作为query与图像的余弦距离.

Class-Specific Spatial Reranking(CS-SR):使用和query相同的instances
来fine-tuin过后的整个网络,然后使用FC-8之后的class-probality 的类别得分

将其作为query与proposal 的得分.



 

3.查询扩展:最简单的查询扩展的方法.