New image retrieval Summarize the current situation of image retrieval Convenient for yourself and others to view Data collected from the network In case of infringement Please contact to delete

 

End to end feature learning method

NetVLAD: CNN architecture for weakly supervised place recognition (CVPR 2016)

This article is from INRIA Of Relja
Arandjelović The work of others . This paper focuses on a specific application of case search —— Position recognition . In the problem of location recognition , Given a query picture , Mark data set by querying a large-scale location , Then use the location of those similar images to estimate the location of the query image . The author first uses Google
Street View Time
Machine Large scale location mark data set is established , Then a convolutional neural network architecture is proposed ,NetVLAD—— take VLAD Method embedded in CNN In the network , And realize “end-to-end” Learning from . This method is shown in the figure below :

 


original VLAD Method hard-assignment Operations are nondifferentiable ( Assign each local feature to its nearest center point ), So it can't be embedded directly into CNN On the Internet , And participate in error back propagation . The solution to this article is to use softmax The hard-assignment Operation converted to soft-assignment operation —— use 1x1 Convolution sum softmax Function to get the probability that the local feature belongs to each center point / weight , Then assign it to have the maximum probability / Center point of weight . therefore NetVLAD It contains three parameters that can be learned ,, Which is the top 1x1 Convolution parameters , For forecasting soft-assignment, Represented as the center point of each cluster . And in the picture above VLAD
core The corresponding accumulated residual operation is completed in the layer . The author explained it to us through the figure below NetVLAD Compared to the original VLAD Advantages of :( Greater flexibility —— Learning better cluster center points )

 

 



 

Another improvement of this article is Weakly supervised triplet ranking
loss. In order to solve the problem that the training data may contain noise , take triplet ranking
loss Replace the positive and negative samples with the potential positive samples ( Include at least one positive sample , But I'm not sure which one ) And a clear negative sample set . And in training , The feature distance between the constrained query image and the most likely positive image in the positive sample set is smaller than the feature distance between the query image and all the images in the negative sample set .

Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles
(CVPR 2016)

The next article focuses on vehicle identification / Search questions , From Peking University Hongye Liu The work of others . As shown in the figure below , This problem can also be regarded as an instance search task .

 


Like many supervised deep case search methods , This paper aims to map the original image into a Euclidean feature space , And make it , More images of the same vehicle , Pictures of vehicles not of the same kind are far away . To achieve this effect , The common method is to optimize triplet
ranking loss, To train CNN network . however , The author found that the original triplet ranking loss There are some problems , As shown in the figure below :

 

 

 

 


For the same sample , The triples on the left are adjusted by the loss function , The triples on the right are ignored . The difference between the two is anchor Different choices , This leads to instability in training . To overcome this problem , For the author coupled
clusters loss(CCL) To replace triplet ranking
loss. The characteristic of the loss function is to change three tuples into a positive sample set and a negative sample set , And make the samples in the positive samples aggregate with each other , The negative samples are more distant from the positive samples , Thus, random selection is avoided anchor Negative impact of samples . The specific effect of this loss function is shown in the figure below :

 

 



 

Finally, this paper aims at the particularity of vehicle problem , And combined with the above design coupled clusters
loss, A hybrid network architecture is designed , And build the relevant vehicle database to provide the required training samples .

DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich
Annotations (CVPR 2016)

The final article is also published in CVPR 2016 upper , It introduces clothing recognition and search , It is also a task related to instance search , From the Chinese University of Hong Kong Ziwei
Liu The work of others . first , This article introduces a project named DeepFashion Clothes database of . The database contains more than 800K Clothes picture of ,50 Fine grained categories and 1000 Attributes , It also provides key points and cross posture of clothes / Cross domain clothing relationships (cross-pose/cross-domain
pair correspondences), Some specific examples are shown in the figure below :

 


Then in order to show the effect of the database , The author proposes a novel deep learning network ,FashionNet—— Predict key points and attributes of clothes by combining , Learning to get more distinguishing features . The overall framework of the network is as follows :

 

 



 


FashionNet There are three stages in the forward calculation of : First stage , Enter a clothing picture into the blue branch of the network , To predict the visibility and location of key points of clothes . Second stage , Based on the key position predicted in the previous step , Key point pooling layer (landmark
pooling layer) Get local features of clothes . The third stage , take “fc6 global” Global characteristics and “fc6
local” Local features of “fc7_fusion”, As the final image feature .FashionNet Four loss functions are introduced , And use an iterative training method to optimize . These losses are : Regression loss corresponds to key point positioning ,softmax Loss corresponds to key visibility and clothing category , Cross entropy loss function corresponding to attribute prediction and triple loss function corresponding to similarity learning between clothes . The author classifies clothes separately , Attribute prediction and clothing search , take FashionNet Comparison with other methods , All have achieved significantly better results .


summary : When there is enough labeled data , Deep learning can learn image features and measurement functions at the same time . The idea behind this is to use a given metric function , The learning feature makes the feature have the best discrimination in this metric space . Therefore, the main research direction of end-to-end feature learning method is how to construct better feature representation and loss function form .

be based on CNN Feature coding method of feature


The depth case search algorithm introduced in the above part of this paper , Focus on data-driven end-to-end feature learning methods and corresponding image search data sets . next , This paper focuses on another problem : When these related search datasets are not available , How to extract effective image features . In order to overcome the shortage of domain data , A feasible strategy is to CNN Pre training model ( Training on other task data sets CNN Model , such as ImageNet Image classification data set ) Based on , Extract the feature map of one layer (feature
map), Coding it to get image features suitable for instance search task . This part will be based on relevant papers in recent years , Introduce some main methods ( special , All of the CNN Models are based on ImageNet Pre training model of classification data set ).

Multi-Scale Orderless Pooling of Deep Convolutional Activation Features (ECCV
2014)

This article was published in ECCV 2014 upper , It's from the University of North Carolina at Chapel Hill Yunchao Gong And the University of Illinois at Urbana Champaign Liwei Wang The work of others .
Due to the global CNN Feature lack of geometric invariance , Limited the classification and matching of variable scenes . The author attributes the problem to the overall CNN Features contain too much spatial information , Therefore, it is proposed that multi-scale
orderless pooling (MOP-CNN)—— take CNN Characteristics and disorder VLAD Combination of coding methods .


MOP-CNN The main steps are , First of all CNN Network as “ Local features ” Extractor , Then extract the image's “ Local features ”, And adopt VLAD Put these “ Local features ” Encoded as image features on this scale , Finally, all scale image features are connected to form the final image features . The framework for feature extraction is as follows :

 

The author tests on two tasks: classification and instance search , As shown in the figure below , Proved MOP-CNN Compared with general CNN Global feature has better classification and search effect .

 

 



 

Exploiting Local Features from Deep Networks for Image Retrieval (CVPR 2015
workshop)

This article was published in CVPR 2015 workshop upper , It's from the University of Maryland, Park College Joe Yue-Hei
Ng The work of others . Many recent studies have shown that , Compared with the output of all connected layers , Characteristic map of convolution layer (feature
map) More suitable for instance search . This paper introduces how to transform the characteristic map of convolution into “ Local features ”, And use VLAD Encode it as an image feature . in addition , In addition, a series of experiments were carried out to observe the influence of the characteristic maps of different convolutions on the accuracy of case search .



Aggregating Deep Convolutional Features for Image Retrieval(ICCV 2015)

The next article is published in ICCV 2015 upper , From Moscow Institute of physics and technology Artem Babenko And Skolkovo Institute of Technology Victor
Lempitsky Work of . As can be seen from the above two articles , Many deep case search methods use unordered coding methods . But it includes VLAD,Fisher
Vector These coding methods are usually computationally expensive . To overcome this problem , This article designed a more simple , And a more efficient coding method ——Sum pooing.Sum
pooling The specific definitions are as follows :



Among them is the local feature of convolution layer in spatial position ( The method of extracting local features here , Consistent with the previous article ). in use sum
pooling after , Further execution of global features PCA and L2 Normalize to get the final feature . Author and Fisher Vector,Triangulation
embedding and max pooling These methods are compared , Proved sum pooling The method is not only simple in calculation , And it works better .

come from Deep learning lecture  https://zhuanlan.zhihu.com/p/22265265

 

 

Where to Focus: Query Adaptive Matching for Instance Retrieval Using
Convolutional Feature Maps (arXiv 1606.6811)

 

This paper 《Particular object retrieval with integral max-pooling of CNN activations
》  On the basis of this, a new method is proposed Reranking Method of .

Before the beginning of the narrative , Let's understand convolution Feature Map

 

The image above is a visualization of different convolutions , We can see ,early convolutional layer Capture a major visual model , and late
convolutional layer It's more about the representation of the object outline .

    In this paper Reranking Process management :

One   Method introduction

1 production base regions, There are two ways :

1.1 Feature Map Pooling(FMP)

    For a certain layer of convolution network , If any D Convolution kernels , Can be produced D Zhang Feature Map(FM) chart . For each FM, We choose a non-zero response as a Base
Regions(BR), such BR The number of FM Number of . And then BR The response value in sum-pooling, Every one of them FM You get a value fd. But for a given Image, quite a lot FM There's a lot of overlap , So it corresponds to pooling features , that is fd Basically the same , We have Fd Value for a cluster , Set cluster center to K.( It can be understood here D individual BR Clustering K individual BR).





                   sum-pooling Pictorial meaning ( Add the response values )

1.2 Overlapped Spatial Pyramid Pooling (OSPP)

 

    OSPP Law and proposal R-MAC In the paper Regions Same extraction method , Corresponding to different scales , We extract l × (l + m − 1) individual Regions, Its width = 2
min(W; H)/(l + 1), Then evenly sample out m Regions (BR region )

2 Reranking process  

Paper proposed Query Adaptive Matching(QAM) A method of Reranking, That's right BR To merge into one Merge
Region, And this selection process turns into an optimization problem . Use this process , For a picture , Selection and query The most similar merge region (merge regions) 

 

Through the above optimization process ( In fact, the last is a common quadratic programming problem ) Let's take out a picture Merge Region. Calculated query and Merge
region As the similarity score of Reranking Of Score, Final ranking .

Let's talk about it. I'm up there Base region  Generation process of combination QAM Understanding of :

about FMP method : every last Feature Map We'll get one Base Region, therefore , adopt FMP Method of , What we got in the end Base
region The number of convolution kernels in this layer . And ultimately base-region Representation of , The paper does use one sum-pooling How , Every one of them Base
Region Only one value will come out in the end , And in the process of optimization ,Merge
Region The final representation will also become a value , It's impossible to query The inner product of the vector of . That's what I've been wondering about this paper , If any senior ( It is better to hear Tao than to hear it ) I see the problem , Please give me some advice .( It could also be a mistake in the paper )

about OSPP method : Because in different FM On different scales Base Region selection , So it's different Base
Region There are different vector representations . We can easily apply it QAM Yes Base Region Select .


Deep Image Retrieval: Learning global representations for image search. In
ECCV, 2016.

Paper address :https://arxiv.org/abs/1604.01325 <https://arxiv.org/abs/1604.01325>

extended version:end to end learning of deep visual representations for image
retrieval, in arxiv, 16.10. <https://arxiv.org/pdf/1610.07940v1.pdf>

=====

Talk with pictures :



As can be seen from the figure, the overall framework of this paper :

1 be based on pre-trained model on Imagenet( as VGG16)

2 from Landmarks dataset[17], Dig out one full perhaps clean Data set for ( Include category labels Full Datset and bounding box
Clean Dataset)

3 Working with datasets Full Daset Come on finetune, his loss It's a general classification loss; use Clean
Dataset Come on finetune, his loss by triplet loss



4 Using the trained model to open data set feature extraction,similarity measure Adopt European distance (dot product)

It is also used in the paper query expansion The way to boost performance.

=====

Let's focus on the above 1 and 3.

( For the acquisition of dataset , In fact, I didn't understand it , Just understand that you need to provide one clean Data set of )

 

1 pre-trained model And the framework.

Here we can use AlexNet,VGGNet,Resnet etc. , depend on 与你想要的效果(performance和speed)

对于VGGNet(如VGG16),摘掉全连接层,取而代之的是RPN + RoI Pooling +shift + fc + L2等.

为什么要用RPN,这里为了取代rigid grid的做法(仅在test的时候取代,而finetune时,proposals就是rigid
grid,具体看论文中的引用论文).

也可以看extended version的论文,将RPN彻底取代rigid grid,形成end2end的framework.

至于shift + fc的作用就是取代一般pipeline中的PCA Whitten.

这里的L2和后续的求和(所有regions的feature对应求和得到最后global compact的image
representation)和L2,仿效一般pipepline的做法.

(具体可以参考

Particular Object Retrieval with Integral Max-Pooling of CNN Activations. In
ICLR, 2016 <https://arxiv.org/abs/1511.05879>.)

 

因为以上的操作都是可导的,这样就可以将它们嵌入到一个模型中,进行forward和backward地训练模型,而再也不是一个pipeline的做法.

 

2 MAC feature的简单介绍:(pooling可以用sum也可以用max,或者其他的)



R-MAC:一般的MAC是针对whole image的feaute map,而R-MAC的做法就是参考RoI Pooling的做法,将bounding
box 投影到feature map上,然后仅在投影在feature map上的区域进行pooling.

 

3 RPN

这个没有什么好说的,请参考faster rcnn.

具体是将RPN这个sub-network放到上面framework上,数据采用Clean Dataset.

 

4 训练

先用rigid grid的方式产生region,用于训练siamese的triplet
loss或者简单的classification,对应的数据集为Clean Dataset和Full Dataset.

然后用Clean Dataset来训练RPN

最后测试的时候,用RPN产生的proposals替代rigid grid.

(不过论文中提到用Full Dataset训练的classification的模型来初始化triplet loss的模型,效果更佳)

至于具体的网络参数,请参考论文.



 

5 数据的产生



 

 

 

【CVPR 2016】faster r-cnn features for instance search 笔记

 

 

 

论文源码以及视频:http://imatge-upc.github.io/retrieval-2016-deepvision/

我自己制作的ppt地址:http://download.csdn.net/detail/dengbingfeng/9524748

这篇paper的解读默认大家对faster-rcnn有基本的了解....

 

 

基本流程:

        利用现有的faster-rcnn物体检测只前向传播一次来提取整个图像的卷积特征和区域卷积特征,共享计算.


检索物体在检索图像中用提供的坐标框表示其位置,使用faster-rcnn提取整个数据集图像的conv5_3层特征,并于待检索图像的conv5_3层特征比较余弦相似度,这样便完成对整个数据集图像的第一次rank,即和待检索图片越相似越排名越靠前.

        在第一次rank后的基础上,针对排名top
N的图片,利用faster-rcnn框处物体检测框,取出所有物体检测框的pool5层特征和待检索物体的pool5层特征比较余弦相似度,依然越相似的排名越靠前,完成rerank,即第二次排序.

        操作完后将top 10的结果显示出来.



细节:

1.Image-wise pooling of activations (IPA)

        就是用最后一层卷积层的激活值来构建对整幅图片的描述.

2.Region-wise pooling of activations (RPA)

        RPN产生的proposals的卷积特征求和池化特征先用L2归一化,whitening后再L2归一化一次,而最大池化特征只进行一次L2归一化.

3.微调faster-rcnn

        两种:只调整全连接层和除前两层卷积层外都所有层都微调

4.Class-Agnostic Spatial Reranking (CA-SR)

        未知类别空间排序

5.Class-Specific Spatial Reranking (CS-SR)

特定类别排序,使用相同检索物体微调后的网络,可以直接使用RPN proposal的得分来作为与待检索物体的相似度得分,

得分用来对图片列表进行排序.

6.数据集

在Oxford和Pairs数据集里,输出12种类别可能(11种建筑+背景).

在INS 13中有30种不同的检索实例,输出31种类别可能.

只调整全连接层在检索物体较难的时候效果不好.

 

整个网络结构

 



整个网络从总体上看是faster-rcnn的网络结构,上面一部分是faster-rcnn 的RPN net部分,RPN net的输出rpn
proposals,网络的下面部分

是ROI pooling 加上三个全连接层,输出是class probabilities.

       Image-wise pooling of activations(IPA):
  这一步骤实际上抽出image的representation,具体的方法是从卷积层的最后一层

conv5_3(针对VGG16
Net,并且经过了reLu层之后),然后做pooling,具体pooling 的方法作者是借鉴另外一篇paper:《particular object
retrieval

with integeral max-pooling of CNN activations》.举个例子来说:如果最后conv5_3得出的feature
map的维度是K*W*H,其中K为卷积核的数目,W*H

为每一个卷积核卷积之后的feature map,这样对于每一个W*H的feature Map 采用max-pooling 或者sum-pooling
就能得到一个值.这样,整个K*W*H

采用pooling之后得到的feature即为K*1的向量.

     Region-wise pooling of activations(RPA):
 这一步骤得到的是region的representation,有了上面的IPA,这一步的RPA也很容易理解,

就是找出region proposals 的ROI pooling,在ROI pooling层上面做max-pooling.

    

fine-tuning faster rcnn

fine tuning 采用两种方式:

strategy1: fine tuning ROI pooling之后的三层网络.



strategy2:fine tuning network after conv_2



fine-tuining 所使用图像为query 图像以及将其做horizontal flip之后的图像(个人感觉图像好少).

 

 

3.Image Retrieval

一共分为三个步骤:

1.过滤:提取出查询图像以及数据库图像的IPA,然后通过计算余弦距离将数据库图像进行排序.(整个过程都是使用的图像的IPA与区域无关).

2.空间重排:

空间重排采用了两种方法:

Class-Agnostic Spatial Reranking (CA-SR):假设类别不可知,计算每一个query bounding
box的RPA与采用第一部过滤前N幅图像每一个proposal的

余弦距离,最高的作为query与图像的余弦距离.

Class-Specific Spatial Reranking(CS-SR):使用和query相同的instances
来fine-tuin过后的整个网络,然后使用FC-8之后的class-probality 的类别得分

将其作为query与proposal 的得分.



 

3.查询扩展:最简单的查询扩展的方法.