New image retrieval Summarize the current situation of image retrieval Convenient for yourself and others to view Data collected from the network In case of infringement Please contact to delete


End to end feature learning method

NetVLAD: CNN architecture for weakly supervised place recognition (CVPR 2016)

This article is from INRIA Of Relja
Arandjelović The work of others . This paper focuses on a specific application of case search —— Position recognition . In the problem of location recognition , Given a query picture , Mark data set by querying a large-scale location , Then use the location of those similar images to estimate the location of the query image . The author first uses Google
Street View Time
Machine Large scale location mark data set is established , Then a convolutional neural network architecture is proposed ,NetVLAD—— take VLAD Method embedded in CNN In the network , And realize “end-to-end” Learning from . This method is shown in the figure below :


original VLAD Method hard-assignment Operations are nondifferentiable ( Assign each local feature to its nearest center point ), So it can't be embedded directly into CNN On the Internet , And participate in error back propagation . The solution to this article is to use softmax The hard-assignment Operation converted to soft-assignment operation —— use 1x1 Convolution sum softmax Function to get the probability that the local feature belongs to each center point / weight , Then assign it to have the maximum probability / Center point of weight . therefore NetVLAD It contains three parameters that can be learned ,, Which is the top 1x1 Convolution parameters , For forecasting soft-assignment, Represented as the center point of each cluster . And in the picture above VLAD
core The corresponding accumulated residual operation is completed in the layer . The author explained it to us through the figure below NetVLAD Compared to the original VLAD Advantages of :( Greater flexibility —— Learning better cluster center points )




Another improvement of this article is Weakly supervised triplet ranking
loss. In order to solve the problem that the training data may contain noise , take triplet ranking
loss Replace the positive and negative samples with the potential positive samples ( Include at least one positive sample , But I'm not sure which one ) And a clear negative sample set . And in training , The feature distance between the constrained query image and the most likely positive image in the positive sample set is smaller than the feature distance between the query image and all the images in the negative sample set .

Deep Relative Distance Learning: Tell the Difference Between Similar Vehicles
(CVPR 2016)

The next article focuses on vehicle identification / Search questions , From Peking University Hongye Liu The work of others . As shown in the figure below , This problem can also be regarded as an instance search task .


Like many supervised deep case search methods , This paper aims to map the original image into a Euclidean feature space , And make it , More images of the same vehicle , Pictures of vehicles not of the same kind are far away . To achieve this effect , The common method is to optimize triplet
ranking loss, To train CNN network . however , The author found that the original triplet ranking loss There are some problems , As shown in the figure below :





For the same sample , The triples on the left are adjusted by the loss function , The triples on the right are ignored . The difference between the two is anchor Different choices , This leads to instability in training . To overcome this problem , For the author coupled
clusters loss(CCL) To replace triplet ranking
loss. The characteristic of the loss function is to change three tuples into a positive sample set and a negative sample set , And make the samples in the positive samples aggregate with each other , The negative samples are more distant from the positive samples , Thus, random selection is avoided anchor Negative impact of samples . The specific effect of this loss function is shown in the figure below :




Finally, this paper aims at the particularity of vehicle problem , And combined with the above design coupled clusters
loss, A hybrid network architecture is designed , And build the relevant vehicle database to provide the required training samples .

DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich
Annotations (CVPR 2016)

The final article is also published in CVPR 2016 upper , It introduces clothing recognition and search , It is also a task related to instance search , From the Chinese University of Hong Kong Ziwei
Liu The work of others . first , This article introduces a project named DeepFashion Clothes database of . The database contains more than 800K Clothes picture of ,50 Fine grained categories and 1000 Attributes , It also provides key points and cross posture of clothes / Cross domain clothing relationships (cross-pose/cross-domain
pair correspondences), Some specific examples are shown in the figure below :


Then in order to show the effect of the database , The author proposes a novel deep learning network ,FashionNet—— Predict key points and attributes of clothes by combining , Learning to get more distinguishing features . The overall framework of the network is as follows :




FashionNet There are three stages in the forward calculation of : First stage , Enter a clothing picture into the blue branch of the network , To predict the visibility and location of key points of clothes . Second stage , Based on the key position predicted in the previous step , Key point pooling layer (landmark
pooling layer) Get local features of clothes . The third stage , take “fc6 global” Global characteristics and “fc6
local” Local features of “fc7_fusion”, As the final image feature .FashionNet Four loss functions are introduced , And use an iterative training method to optimize . These losses are : Regression loss corresponds to key point positioning ,softmax Loss corresponds to key visibility and clothing category , Cross entropy loss function corresponding to attribute prediction and triple loss function corresponding to similarity learning between clothes . The author classifies clothes separately , Attribute prediction and clothing search , take FashionNet Comparison with other methods , All have achieved significantly better results .

summary : When there is enough labeled data , Deep learning can learn image features and measurement functions at the same time . The idea behind this is to use a given metric function , The learning feature makes the feature have the best discrimination in this metric space . Therefore, the main research direction of end-to-end feature learning method is how to construct better feature representation and loss function form .

be based on CNN Feature coding method of feature

The depth case search algorithm introduced in the above part of this paper , Focus on data-driven end-to-end feature learning methods and corresponding image search data sets . next , This paper focuses on another problem : When these related search datasets are not available , How to extract effective image features . In order to overcome the shortage of domain data , A feasible strategy is to CNN Pre training model ( Training on other task data sets CNN Model , such as ImageNet Image classification data set ) Based on , Extract the feature map of one layer (feature
map), Coding it to get image features suitable for instance search task . This part will be based on relevant papers in recent years , Introduce some main methods ( special , All of the CNN Models are based on ImageNet Pre training model of classification data set ).

Multi-Scale Orderless Pooling of Deep Convolutional Activation Features (ECCV

This article was published in ECCV 2014 upper , It's from the University of North Carolina at Chapel Hill Yunchao Gong And the University of Illinois at Urbana Champaign Liwei Wang The work of others .
Due to the global CNN Feature lack of geometric invariance , Limited the classification and matching of variable scenes . The author attributes the problem to the overall CNN Features contain too much spatial information , Therefore, it is proposed that multi-scale
orderless pooling (MOP-CNN)—— take CNN Characteristics and disorder VLAD Combination of coding methods .

MOP-CNN The main steps are , First of all CNN Network as “ Local features ” Extractor , Then extract the image's “ Local features ”, And adopt VLAD Put these “ Local features ” Encoded as image features on this scale , Finally, all scale image features are connected to form the final image features . The framework for feature extraction is as follows :


The author tests on two tasks: classification and instance search , As shown in the figure below , Proved MOP-CNN Compared with general CNN Global feature has better classification and search effect .




Exploiting Local Features from Deep Networks for Image Retrieval (CVPR 2015

This article was published in CVPR 2015 workshop upper , It's from the University of Maryland, Park College Joe Yue-Hei
Ng The work of others . Many recent studies have shown that , Compared with the output of all connected layers , Characteristic map of convolution layer (feature
map) More suitable for instance search . This paper introduces how to transform the characteristic map of convolution into “ Local features ”, And use VLAD Encode it as an image feature . in addition , In addition, a series of experiments were carried out to observe the influence of the characteristic maps of different convolutions on the accuracy of case search .

Aggregating Deep Convolutional Features for Image Retrieval(ICCV 2015)

The next article is published in ICCV 2015 upper , From Moscow Institute of physics and technology Artem Babenko And Skolkovo Institute of Technology Victor
Lempitsky Work of . As can be seen from the above two articles , Many deep case search methods use unordered coding methods . But it includes VLAD,Fisher
Vector These coding methods are usually computationally expensive . To overcome this problem , This article designed a more simple , And a more efficient coding method ——Sum pooing.Sum
pooling The specific definitions are as follows :

Among them is the local feature of convolution layer in spatial position ( The method of extracting local features here , Consistent with the previous article ). in use sum
pooling after , Further execution of global features PCA and L2 Normalize to get the final feature . Author and Fisher Vector,Triangulation
embedding and max pooling These methods are compared , Proved sum pooling The method is not only simple in calculation , And it works better .

come from Deep learning lecture



Where to Focus: Query Adaptive Matching for Instance Retrieval Using
Convolutional Feature Maps (arXiv 1606.6811)


This paper 《Particular object retrieval with integral max-pooling of CNN activations
》  On the basis of this, a new method is proposed Reranking Method of .

Before the beginning of the narrative , Let's understand convolution Feature Map


The image above is a visualization of different convolutions , We can see ,early convolutional layer Capture a major visual model , and late
convolutional layer It's more about the representation of the object outline .

    In this paper Reranking Process management :

One   Method introduction

1 production base regions, There are two ways :

1.1 Feature Map Pooling(FMP)

    For a certain layer of convolution network , If any D Convolution kernels , Can be produced D Zhang Feature Map(FM) chart . For each FM, We choose a non-zero response as a Base
Regions(BR), such BR The number of FM Number of . And then BR The response value in sum-pooling, Every one of them FM You get a value fd. But for a given Image, quite a lot FM There's a lot of overlap , So it corresponds to pooling features , that is fd Basically the same , We have Fd Value for a cluster , Set cluster center to K.( It can be understood here D individual BR Clustering K individual BR).

                   sum-pooling Pictorial meaning ( Add the response values )

1.2 Overlapped Spatial Pyramid Pooling (OSPP)


    OSPP Law and proposal R-MAC In the paper Regions Same extraction method , Corresponding to different scales , We extract l × (l + m − 1) individual Regions, Its width = 2
min(W; H)/(l + 1), Then evenly sample out m Regions (BR region )

2 Reranking process  

Paper proposed Query Adaptive Matching(QAM) A method of Reranking, That's right BR To merge into one Merge
Region, And this selection process turns into an optimization problem . Use this process , For a picture , Selection and query The most similar merge region (merge regions) 


Through the above optimization process ( In fact, the last is a common quadratic programming problem ) Let's take out a picture Merge Region. Calculated query and Merge
region As the similarity score of Reranking Of Score, Final ranking .

Let's talk about it. I'm up there Base region  Generation process of combination QAM Understanding of :

about FMP method : every last Feature Map We'll get one Base Region, therefore , adopt FMP Method of , What we got in the end Base
region The number of convolution kernels in this layer . And ultimately base-region Representation of , The paper does use one sum-pooling How , Every one of them Base
Region Only one value will come out in the end , And in the process of optimization ,Merge
Region The final representation will also become a value , It's impossible to query The inner product of the vector of . That's what I've been wondering about this paper , If any senior ( It is better to hear Tao than to hear it ) I see the problem , Please give me some advice .( It could also be a mistake in the paper )

about OSPP method : Because in different FM On different scales Base Region selection , So it's different Base
Region There are different vector representations . We can easily apply it QAM Yes Base Region Select .

Deep Image Retrieval: Learning global representations for image search. In
ECCV, 2016.

Paper address : <>

extended version:end to end learning of deep visual representations for image
retrieval, in arxiv, 16.10. <>


Talk with pictures :

As can be seen from the figure, the overall framework of this paper :

1 be based on pre-trained model on Imagenet( as VGG16)

2 from Landmarks dataset[17], Dig out one full perhaps clean Data set for ( Include category labels Full Datset and bounding box
Clean Dataset)

3 Working with datasets Full Daset Come on finetune, his loss It's a general classification loss; use Clean
Dataset Come on finetune, his loss by triplet loss

4 Using the trained model to open data set feature extraction,similarity measure Adopt European distance (dot product)

It is also used in the paper query expansion The way to boost performance.


Let's focus on the above 1 and 3.

( For the acquisition of dataset , In fact, I didn't understand it , Just understand that you need to provide one clean Data set of )


1 pre-trained model And the framework.

Here we can use AlexNet,VGGNet,Resnet etc. , depend on 与你想要的效果(performance和speed)

对于VGGNet(如VGG16),摘掉全连接层,取而代之的是RPN + RoI Pooling +shift + fc + L2等.

为什么要用RPN,这里为了取代rigid grid的做法(仅在test的时候取代,而finetune时,proposals就是rigid

也可以看extended version的论文,将RPN彻底取代rigid grid,形成end2end的framework.

至于shift + fc的作用就是取代一般pipeline中的PCA Whitten.

这里的L2和后续的求和(所有regions的feature对应求和得到最后global compact的image


Particular Object Retrieval with Integral Max-Pooling of CNN Activations. In
ICLR, 2016 <>.)




2 MAC feature的简单介绍:(pooling可以用sum也可以用max,或者其他的)

R-MAC:一般的MAC是针对whole image的feaute map,而R-MAC的做法就是参考RoI Pooling的做法,将bounding
box 投影到feature map上,然后仅在投影在feature map上的区域进行pooling.



这个没有什么好说的,请参考faster rcnn.

具体是将RPN这个sub-network放到上面framework上,数据采用Clean Dataset.


4 训练

先用rigid grid的方式产生region,用于训练siamese的triplet
loss或者简单的classification,对应的数据集为Clean Dataset和Full Dataset.

然后用Clean Dataset来训练RPN

最后测试的时候,用RPN产生的proposals替代rigid grid.

(不过论文中提到用Full Dataset训练的classification的模型来初始化triplet loss的模型,效果更佳)



5 数据的产生




【CVPR 2016】faster r-cnn features for instance search 笔记













        操作完后将top 10的结果显示出来.


1.Image-wise pooling of activations (IPA)


2.Region-wise pooling of activations (RPA)




4.Class-Agnostic Spatial Reranking (CA-SR)


5.Class-Specific Spatial Reranking (CS-SR)

特定类别排序,使用相同检索物体微调后的网络,可以直接使用RPN proposal的得分来作为与待检索物体的相似度得分,




在INS 13中有30种不同的检索实例,输出31种类别可能.





整个网络从总体上看是faster-rcnn的网络结构,上面一部分是faster-rcnn 的RPN net部分,RPN net的输出rpn

是ROI pooling 加上三个全连接层,输出是class probabilities.

       Image-wise pooling of activations(IPA):

Net,并且经过了reLu层之后),然后做pooling,具体pooling 的方法作者是借鉴另外一篇paper:《particular object

with integeral max-pooling of CNN activations》.举个例子来说:如果最后conv5_3得出的feature

为每一个卷积核卷积之后的feature map,这样对于每一个W*H的feature Map 采用max-pooling 或者sum-pooling


     Region-wise pooling of activations(RPA):

就是找出region proposals 的ROI pooling,在ROI pooling层上面做max-pooling.


fine-tuning faster rcnn

fine tuning 采用两种方式:

strategy1: fine tuning ROI pooling之后的三层网络.

strategy2:fine tuning network after conv_2

fine-tuining 所使用图像为query 图像以及将其做horizontal flip之后的图像(个人感觉图像好少).



3.Image Retrieval





Class-Agnostic Spatial Reranking (CA-SR):假设类别不可知,计算每一个query bounding


Class-Specific Spatial Reranking(CS-SR):使用和query相同的instances
来fine-tuin过后的整个网络,然后使用FC-8之后的class-probality 的类别得分

将其作为query与proposal 的得分.