LSTM实现简单的问答系统，keras的bAbI - 好文

3.3 LSTM实现简单的问答系统

3.3.1 问答系统简介

3.3.2 基于keras实现简单的问答系统

模型逻辑图如下：

数据集：Facebook的bAbI数据
训练集：
1 Mary moved to the bathroom. 2 Sandra journeyed to the bedroom. 3 Mary got
the football there.4 John went to the kitchen. 5 Mary went back to the kitchen.
6 Mary went back to the garden. 7 Where is the football? garden 3 6 8 Sandra
went back to the office.9 John moved to the office. 10 Sandra journeyed to the
hallway.11 Daniel went back to the kitchen. 12 Mary dropped the football. 13
John got the milk there.14 Where is the football? garden 12 6 15 Mary took the
football there.16 Sandra picked up the apple there. 17 Mary travelled to the
hallway.18 John journeyed to the kitchen. 19 Where is the football? hallway 15
17 训练集是对话 + 问题 + 答案的形式，每个问句中以tab键分割问题、答案以及含有答案的句子索引。
接下来利用两个循环神经网络实现简单的问答系统。
（1）获取预处理
数据在amazoneaws的网站上，如果在运行代码出现下载不成功，就要先把数据集下载下来，然后放到keras的数据集目录下。代码中有具体操作。
# 获取数据 from keras.utils.data_utils import get_file import tarfile try: path =
get_file('babi-tasks-v1-2.tar.gz', \ origin=
'https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz') except:
print('Error downloading dataset, please download it manually:\n' '$ wget
http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n' '$ mv
tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz') raise
（2）数据预处理
对文本数据进行向量化，word2vector

* 对文本数据 Tokenize，因为本数据集为英文，分词可直接用空格，如果数据集为中文，需要利用结巴或者其他分词器进行分词。 #将每个单词分割来 def
tokenize(data): import re # ‘\W’ 匹配所有的字母数字下划线以外的字符 return [x.strip() for x in
re.split(r"(\W+)?", data) if x.strip()]
* 解析对话文本 # parse_dialog 将所有的对话进行解析，返回tokenize后的(对话,问题,答案) # 如果
only_supporting为真表明只返回含有答案的对话 def parse_dialog(lines, only_supporting = False):
data = [] dialog = []for line in lines: line = line.strip() nid, line =
line.split(' ',1) nid = int(nid) # 标号为1表示新的一段文本的开始，重新记录 if nid == 1: dialog = []
#含有tab键的说明就是问题，将问题，答案和答案的索引分割开 if '\t' in line: ques, ans, data_idx =
line.split('\t') ques = tokenize(ques) substory = None if only_supporting :
data_idx = list(map(int,data_idx)) substory = [dialog[ i-1 ] for i in
data_idx.split()]else: substory = [x for x in dialog] data.append((substory
,ques, ans))else: # 不含有tab键的就是对话，tokenize后加入dialog的list line = tokenize(line)
dialog.append(line)return data
*
获得每个对话文本，将tokenize后的每个对话文本放在一个列表中。将（对话，问题，答案）组成相对应的tuple存储。
#这里的maxlen是控制文本最大长度的，可以利用分位数找出覆盖90%数据的长度，令其为maxlen。 # 否则序列长度太长，训练时内存不够。 def
get_dialog(f, only_supporting = False, max_length = None): #将对话完整的提取出来 data =
parse_dialog(f.readlines(),only_supporting = only_supporting) flatten =lambda
data: reduce(lambda x, y: x + y, data) data = [(flatten(dialog), ques, ans) for
(dialog, ques, ans)in data if not max_length or len(flatten(dialog))<max_length]
return data
*
数据长度归一化。找出对话文本的最大单词长度，对所有的对话进行padding，将长度归一化。问题集同此。
def vectorize_dialog(data,wd_idx, dialog_maxlen, ques_maxlen): #向量化,返回对应词表的索引号
import numpy as np from keras.preprocessing.sequence import pad_sequences
dialog_vec = [] ques_vec = [] ans_vec = []for dialog, ques, ans in data:
dialog_idx = [wd_idx[w]for w in dialog] ques_idx = [wd_idx[w] for w in ques]
ans_zero = np.zeros(len(wd_idx) +1) ans_zero[wd_idx[ans] ] = 1
dialog_vec.append(dialog_idx) ques_vec.append(ques_idx) ans_vec.append(ans_zero)
#序列长度归一化，分别找出对话，问题和答案的最长长度，然后相对应的对数据进行padding。 return pad_sequences(dialog_vec,
maxlen = dialog_maxlen),\ pad_sequences(ques_vec, maxlen = ques_maxlen),\
np.array(ans_vec)
* 准备数据，并利用上述函数进行预处理。 #准备数据 train_tar = tar.extractfile(data_path.format(
'train')) test_tar = tar.extractfile(data_path.format('test')) train =
get_dialog(train_tar) test = get_dialog(test_tar)# 建立词表。词表就是文本中所有出现过的单词组成的词表。
lexicon = set()for dialog, ques, ans in train + test: lexicon |= set(dialog +
ques + [ans]) lexicon = sorted(lexicon) lexicon_size = len(lexicon)+1
#word2vec，并求出对话集和问题集的最大长度，padding时用。 wd_idx = dict((wd, idx+1) for idx, wd in
enumerate(lexicon)) dialog_maxlen = max(map(len,(xfor x, _, _ in train + test
))) ques_maxlen = max(map(len,(xfor _, x, _ in train + test )))
#计算分位数，在get_dialog函数中传参给max_len dia_80 = np.percentile(map(len,(x for x, _, _ in
train + test )),80) # 对训练集和测试集，进行word2vec dialog_train, ques_train, ans_train
= vectorize_dialog(train, wd_idx, dialog_maxlen, ques_maxlen) dialog_test,
ques_test, ans_test = vectorize_dialog(test, wd_idx, dialog_maxlen, ques_maxlen)
* 搭建神经网络模型

因为对话集和问题集之间存在内在联系，所以先分别进行embedding和dropout，在merge到一起后，传入LSTM网络中，最后LSTN网络的输出进入一个softmax函数，完成模型的搭建。

* 对话集的网络搭建。

注：dialog_maxlen = 149，embedding_out = 50.
输入：（dialog_maxlen，）即（None，149).
输出：也就是embedding层的输出，shape为（dialog_maxlen ，embedding_out).

因为对数据进行了长度的归一化处理，所以每个dialog的长度都为dialog_maxlen ，所以此时输出数据的shape为
（149，50）。每个列向量就是每个dialog进行embedding后的向量。
#对话集构建网络—— embedding + dropout dialog = Input(shape = (dialog_maxlen, ),dtype=
'int32') encodeed_dialog = embeddings.Embedding(lexicon_size,
embedding_out)(dialog) encodeed_dialog = Dropout(0.3)(encodeed_dialog)
* 问题集网络搭建

问题集进入LSTM网络，keras的LSTM默认是只输出hidden_layer的最后一维，所以LSTM的输出只有一列向量。

LSTM的输出进行RepeatVector，也就是重复dialog_maxlen次，这样encodeed_ques的shape就变为了（dialog_maxlen，lstm_out）。与encodeed_dialog的shape相同，这样才能进行merge层的add。
#问题集 embedding + dropout + lstm question = Input(shape = (ques_maxlen,),dtype=
'int32') encodeed_ques = embeddings.Embedding(lexicon_size,
embedding_out)(question) encodeed_ques = Dropout(0.3)(encodeed_ques)
encodeed_ques = LSTM(units = lstm_out)(encodeed_ques) encodeed_ques =
RepeatVector(dialog_maxlen)(encodeed_ques)
* 对话集和问题集之间是存在内在联系的，将二者通过merge层之后再进行循环神经网络的训练。

# merge 对话集和问题集的模型 merge后进行 lstm + dropout + dense merged =
Add()([encodeed_dialog, encodeed_ques]) merged = LSTM(units = lstm_out)(merged)
merged = Dropout(0.3)(merged) preds = Dense(units = lexicon_size, activation =
'softmax')(merged) model = Model([dialog, question], preds)
* 编译 print('compiling........') model.compile(optimizer='adam', loss =
'categorical_crossentropy', metrics = ['accuracy'] )
* 训练 #训练 print('training.......') model.fit([dialog_train, ques_train],
ans_train, batch_size = batch_size, epochs = epochs, verbose =1,
validation_split =0.1 ) loss , accu = model.evaluate([dialog_test, ques_test],
ans_test, verbose=1, batch_size = batch_size) print('%s: %.4f \n %s: %.4f' % (
'loss', loss, 'accu', accu))
* 预测 pre = model.predict([dialog_test, ques_test], batch_size = batch_size,
verbose =1) #输出测试过程 def get_key(dic,value): return [k for k,v in wd_idx.items()
if v == value] import numpy as np a = pre[0].tolist() a.index(max(a)) for i in
range(len(dialog_test)): ques = [] lis_dia = list(map(lambda x :
get_key(wd_idx,x), dialog_test[i])) dialog = reduce(lambda x,y :x+' '+y
,(reduce(lambda x, y: x+y,lis_dia))) lis_ques = (map(lambda x :
get_key(wd_idx,x), ques_test[i])) ques = reduce(lambda x,y :x+' '+y,(reduce(
lambda x, y: x+y,lis_ques))) ans_idx = np.argmax(ans_test[i]) pre_idx =
np.argmax(pre[i]) print('%s\n %s ' % ('dialog',dialog)) print('%s\n %s ' % (
'question',ques)) print('%s\n %s ' % ('right_answer',get_key( wd_idx,
ans_idx))) print('%s\n %s\n' % ('pred',get_key(wd_idx, pre_idx)))
预测输出示例：

总的来说就是，对话集数据进行embedding后的输出（矩阵），与问题集进行LSTM及RepeatVector后的输出（矩阵），两个矩阵（shape相同）相对应的拼接成一个大的矩阵，称为mat_merge；mat_merge进入LSTM训练，输出最后一个hidden_state的输出，称为h_out（一个列向量）。然后h_out进入全连接层，此时h_out与词表中的每个词计算softmax的概率值，概率最大的即为预测的答案。

热门工具换一换