【机器学习】建立基于GitHub库的推荐系统引擎 - 好文

如果不熟悉协同过滤算法的可以查看我的一篇文章：【推荐系统】协同过滤浅入（基于用户/项目/内容/混合方式）
<https://blog.csdn.net/ChenVast/article/details/82590711>

代码存放在我的GitHub：https://github.com/935048000/GitHubRecommendationSystem
<https://github.com/935048000/GitHubRecommendationSystem>

开始

该推荐引擎是用于GitHub的库推荐

这里使用GitHub的API，基于协同过滤的推荐系统。

这个推荐系统的任务是获得我所有标星的资料库，然后得到这些库的全部创作者。

再获取这些作者的标星资料库。然后比较已加星的资料库，找到和我最相似的用户。发现最相似的GitHub用户后，把他标星的所有资料库生成一组推荐。

准备
# 导入需要的库 import pandas as pd import numpy as np import requests import json

获取令牌，令牌的获取方法：

访问该URL，进入到Personalaccess tokens（个人访问令牌）页面。

https://github.com/settings/tokens

点击生成新的令牌，勾上repo和user即可，点击生成令牌。

然后得到一串随机串，填入上面的mypw变量里即可。
# 需要自己的GitHub账户和令牌 myun = '935048000' mypw = 'Tokens'

创建获取资料库函数

该实验的前提是你的标星的资料库有一定的数量，才能有意义。
# 创建一个能拉取我已标星资料库名称的函数 my_starred_repos = [] def get_starred_by_me(): resp_list
= [] last_resp = '' first_url_to_get = 'https://api.github.com/user/starred'
first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw)) last_resp =
first_url_resp resp_list.append(json.loads(first_url_resp.text)) while
last_resp.links.get('next'): next_url_to_get = last_resp.links['next']['url']
next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw)) last_resp =
next_url_resp resp_list.append(json.loads(next_url_resp.text)) for i in
resp_list: for j in i: msr = j['html_url'] my_starred_repos.append(msr)

# 调用函数，获取我已标星的资料库名字,并输出名字列表 get_starred_by_me() print(my_starred_repos)

共有45个
# 获取每个标星库的用户名，并输出 my_starred_users = [] for ln in my_starred_repos:
right_split = ln.split('.com/')[1] starred_usr = right_split.split('/')[0]
my_starred_users.append(starred_usr) print(my_starred_users)

# 去除重复的用户 print(len(set(my_starred_users))) # 剩下40个

创建标星资料库检索函数
# 构建一个可以检索他们所有标星的资料库的函数 starred_repos = {k:[] for k in set(my_starred_users)}
def get_starred_by_user(user_name): starred_resp_list = [] last_resp = ''
first_url_to_get = 'https://api.github.com/users/'+ user_name +'/starred'
first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw)) last_resp =
first_url_resp starred_resp_list.append(json.loads(first_url_resp.text)) while
last_resp.links.get('next'): next_url_to_get = last_resp.links['next']['url']
next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw)) last_resp =
next_url_resp starred_resp_list.append(json.loads(next_url_resp.text)) for i in
starred_resp_list: for j in i: sr = j['html_url']
starred_repos.get(user_name).append(sr)

# 调用函数 for usr in list(set(my_starred_users)): print(usr) try:
get_starred_by_user(usr) except: print('failed for user', usr)

# 为所有被标星的资料库，构建一个特征集，并去除重复的资料库 repo_vocab = [item for sl in
list(starred_repos.values()) for item in sl] repo_set = list(set(repo_vocab))

# 这里就有700多个库了 # 对每位用户和每个资料库建立一个二进制向量，标星=1，不标星=0. all_usr_vector = [] for k,v
in starred_repos.items(): usr_vector = [] for url in repo_set: if url in v:
usr_vector.extend([1]) else: usr_vector.extend([0])
all_usr_vector.append(usr_vector)

# 把这些数据放入一个pandas的DataFrame里 df = pd.DataFrame(all_usr_vector,
columns=repo_set, index=starred_repos.keys())

# 查看DataFrame print(df)

# 需要将自己和他们比较，需要把自己添加进数据框里。 my_repo_comp = [] for i in df.columns: if i in
my_starred_repos: my_repo_comp.append(1) else: my_repo_comp.append(0) mrc =
pd.Series(my_repo_comp).to_frame(myun).T

# 输出 print(mrc)

# 添加列名并连接到数据框中 mrc.columns = df.columns fdf = pd.concat([df, mrc]) print(fdf)

# 计算自己和其他用户的相似性，使用scipy库的pearsonr函数 from sklearn.metrics import
jaccard_similarity_score from scipy.stats import pearsonr

计算相似度
# 将数据框中最后一个向量和其他向量进行比较，并生成中心化余弦相似度。 sim_score = {} for i in range(len(fdf)):
ss = pearsonr(fdf.iloc[-1,:], fdf.iloc[i,:]) sim_score.update({i: ss[0]}) sf =
pd.Series(sim_score).to_frame('similarity') print(sf)

相似度排序
# 有一些值为NaN，因为他们没有给任何项目标星，导致计算除以0. # 对值进行排序，返回最相似的索引列表
sf.sort_values('similarity', ascending=False)

# 第一个是自己，相似度100%，查询和自己最相似的用户,索引等于31的用户。 print(fdf.index[31])

# 查看他的标星的库 fdf.iloc[31,:][fdf.iloc[31,:]==1]

验证

去GitH上验证一下，准确无误

# 创建一个数据框，内容为和我最相似的三位用户已标星的资料库 all_recs =
fdf.iloc[[31,17,36,40],:][fdf.iloc[[31,17,36,40],:]==1].fillna(0).T
all_recs[(all_recs==1).all(axis=1)] str_recs_tmp =
all_recs[all_recs[myun]==0].copy() str_recs = str_recs_tmp.iloc[:,:-1].copy()
str_recs

# 看看是否存在两位共同标星的资料库 str_recs[(str_recs==1).all(axis=1)]
str_recs[str_recs.sum(axis=1)>1]

实验进行到此结束。

更进一步的话，安装每个被推荐的项目的标星数量进行排序

为了改进结果，可以添加基于内容的过滤。

为自己的库建立特征，这些特征可以表明这是个人兴趣。

生成一组单词特征，可以用于生成基于协同过滤的推荐。

如python、machine等等单词标签。确保与我们不太相似的用户可以提供基于自身兴趣的推荐。

热门工具换一换