faiss简介及示例 - 好文

<>简介

faiss是为稠密向量提供高效相似度搜索和聚类的框架。由Facebook AI Research
<https://research.fb.com/category/facebook-ai-research-fair/>研发。具有以下特性。

* 1、提供多种检索方法
* 2、速度快
* 3、可存在内存和磁盘中
* 4、C++实现，提供Python封装调用。
* 5、大部分算法支持GPU实现
下面给出一些快速链接方便查找更多内容。

github <https://github.com/facebookresearch/faiss>
官方文档 <https://github.com/facebookresearch/faiss/wiki>
c++类信息
<https://rawgit.com/facebookresearch/faiss/master/docs/html/annotated.html>
Troubleshooting
<https://github.com/facebookresearch/faiss/wiki/Troubleshooting>
官方安装文档 <https://github.com/facebookresearch/faiss/blob/master/INSTALL.md>

<>安装

文档中给出来编译安装，conda等安装方式。因为公司服务器编译安装需要权限，所有我们一般使用conda的方式安装python Module。
# 更新conda conda update conda # 先安装mkl conda install mkl #
faiss提供gpu和cpu版，根据服务选择 conda install faiss-cpu -c pytorch # cpu conda install
faiss-gpu -c pytorch # gpu # 校验是否安装成功 python -c "import faiss"
<>Quick Start

这里先给出官方提供的demo来感受一下faiss的使用。

首先构建训练数据和测试数据
import numpy as np d = 64 # dimension nb = 100000 # database size nq = 10000 #
nb of queries np.random.seed(1234) # make reproducible xb =
np.random.random((nb, d)).astype('float32') xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32') xq[:, 0] += np.arange(nq) /
1000.
上面我们构建了shape为[100000,64]的训练数据xb，和shape为[10000,64]的查询数据xq。
然后创建索引(Index)。faiss创建索引对向量预处理，提高查询效率。
faiss提供多种索引方法，这里选择最简单的暴力检索L2距离的索引：IndexFlatL2。
创建索引时必须指定向量的维度d。大部分索引需要训练的步骤。IndexFlatL2跳过这一步。
当索引创建好并训练(如果需要)之后，我们就可以执行add和search方法了。add方法一般添加训练时的样本，search就是寻找相似相似向量了。

一些索引可以保存整型的ID，每个向量可以指定一个ID，当查询相似向量时，会返回相似向量的ID及相似度(或距离)。如果不指定，将按照添加的顺序从0开始累加。其中
IndexFlatL2不支持指定ID。
import faiss # make faiss available index = faiss.IndexFlatL2(d) # build the
index print(index.is_trained) index.add(xb) # add vectors to the index
print(index.ntotal)
我们有了包含向量的索引后，就可以传入搜索向量查找相似向量了。
k = 4 # we want to see 4 nearest neighbors D, I = index.search(xq, k) # actual
search print(I[:5]) # neighbors of the 5 first queries print(D[-5:]) #
neighbors of the 5 last queries
上面代码中，我们定义返回每个需要查询向量的最近4个向量。查询返回两个numpy array对象D和I。D表示与相似向量的距离(distance)，维度，I
表示相似用户的ID。

我们可以得到类似于下面的结果
[[ 0 393 363 78] [ 1 555 277 364] [ 2 304 101 13] [ 3 173 18 182] [ 4 288 370
531]] [[ 0. 7.17517328 7.2076292 7.25116253] [ 0. 6.32356453 6.6845808
6.79994535] [ 0. 5.79640865 6.39173603 7.28151226] [ 0. 7.27790546 7.52798653
7.66284657] [ 0. 6.76380348 7.29512024 7.36881447]]
<>加速搜索

如果需要存储的向量太多，通过暴力搜索索引IndexFlatL2速度很慢，这里介绍一种加速搜索的方法的索引IndexIVFFlat
。翻译过来叫倒排文件，其实是使用K-means建立聚类中心，然后通过查询最近的聚类中心，然后比较聚类中的所有向量得到相似的向量。

创建IndexIVFFlat时需要指定一个其他的索引作为量化器(quantizer)来计算距离或相似度。

这里同使用IndexFlatL2对比，在add方法之前需要先训练。

下面简述示例中的几个参数。

faiss.METRIC_L2: faiss定义了两种衡量相似度的方法(metrics)，分别为faiss.METRIC_L2、
faiss.METRIC_INNER_PRODUCT。一个是欧式距离，一个是向量内积。

nlist：聚类中心的个数

k：查找最相似的k个向量

index.nprobe：查找聚类中心的个数，默认为1个。

代码示例如下
nlist = 100 #聚类中心的个数 k = 4 quantizer = faiss.IndexFlatL2(d) # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2) # here we
specify METRIC_L2, by default it performs inner-product search assert not
index.is_trained index.train(xb) assert index.is_trained index.add(xb) # add
may be a bit slower as well D, I = index.search(xq, k) # actual search
print(I[-5:]) # neighbors of the 5 last queries index.nprobe = 10 # default
nprobe is 1, try a few more D, I = index.search(xq, k) print(I[-5:]) #
neighbors of the 5 last queries
<>减少内存

2018-02-22之后版本添加了磁盘存储inverted indexes的方式，使用可参考demo
<https://github.com/facebookresearch/faiss/blob/master/demos/demo_ondisk_ivf.py>
.

上面我们看到的索引IndexFlatL2和IndexIVFFlat都会全量存储所有的向量在内存中，为满足大的数据量的需求，faiss提供一种基于
Product Quantizer(乘积量化)
<https://hal.inria.fr/file/index/docid/514462/filename/paper_hal.pdf>
的压缩算法编码向量大小到指定的字节数。此时，存储的向量时压缩过的，查询的距离也是近似的。关于乘积量化的算法可自行搜索。

下面给出demo。类似IndexIVFFlat，这里使用的是IndexIVFPQ
nlist = 100 m = 8 # number of bytes per vector k = 4 quantizer =
faiss.IndexFlatL2(d) # this remains the same index =
faiss.IndexIVFPQ(quantizer, d, nlist, m, 8) # 8 specifies that each sub-vector
is encoded as 8 bits index.train(xb) index.add(xb) D, I = index.search(xb[:5],
k) # sanity check print(I) print(D) index.nprobe = 10 # make comparable with
experiment above D, I = index.search(xq, k) # search print(I[-5:])
之前我们定义的维度为d = 64，向量的数据类型为float32。这里压缩成了8个字节。所以压缩比率为 (64*32/8) / 8 = 32

返回的结果如下，第一个向量同自己的距离为1.40704751，不是0。因为如上所述返回的是近似距离，但是整体上返回的最相似的top k的向量ID没有变化。
[[ 0 608 220 228] [ 1 1063 277 617] [ 2 46 114 304] [ 3 791 527 316] [ 4 159
288 393]] [[ 1.40704751 6.19361687 6.34912491 6.35771513] [ 1.49901485
5.66632462 5.94188499 6.29570007] [ 1.63260388 6.04126883 6.18447495
6.26815748] [ 1.5356375 6.33165455 6.64519501 6.86594009] [ 1.46203303
6.5022912 6.62621975 6.63154221]]
<>简化索引的表达

通过上面IndexIVFFlat和IndexIVFPQ
我们可以看到，他们的构造需要先提供另外一个index。类似的，faiss还提供pca、lsh等方法，有时候他们会组合使用。这样组合的对构造索引会比较麻烦，faiss提供了通过字符串表达的方式构造索引。
如，下面表达式就能表示上面的创建IndexIVFPQ的实例。
index = faiss.index_factory(d, "IVF100,PQ8")
这里有一点文档中没有提到的，通过查看c++代码
<https://github.com/facebookresearch/faiss/blob/b24e05dc7e09a3062ea24e06d3585100d4ce19f9/AutoTune.cpp>
，index_factory方法还有第三个参数，就是上面说的metric。可传入的就上面两种。
Index *index_factory (int d, const char *description_in, MetricType metric)
更多的组合实例可以看demo
<https://github.com/facebookresearch/faiss/blob/master/demos/demo_auto_tune.py>

每类索引的简写可查询Basic indexes
<https://github.com/facebookresearch/faiss/wiki/Faiss-indexes>

<>GPU使用

注意有些索引不支持GPU，哪些支持哪些不支持可查询Basic indexes
<https://github.com/facebookresearch/faiss/wiki/Faiss-indexes>

可通过faiss.get_num_gpus()查询有多少个gpu
ngpus = faiss.get_num_gpus() print("number of GPUs:", ngpus)
使用gpu的完整示例。

1、使用一块gpu
# build a flat (CPU) index index_flat = faiss.IndexFlatL2(d) # make it into a
gpu index gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
2、使用全部gpu
cpu_index = faiss.IndexFlatL2(d) gpu_index =
faiss.index_cpu_to_all_gpus(cpu_index) # build the index gpu_index.add(xb) #
add vectors to the index print(gpu_index.ntotal) k = 4 # we want to see 4
nearest neighbors D, I = gpu_index.search(xq, k) # actual search print(I[:5]) #
neighbors of the 5 first queries print(I[-5:]) # neighbors of the 5 last queries

热门工具换一换