最近在学pytorch,所以尝试使用pytorch实现textCNN,ps(git
上有其他人textCNN的实现)。pytorch比tensorflow好的一个地方就在于好学,适合初学者。

首先,要注意的就是这个样例的数据预处理,我使用的数据是中文文本分类数据集THUCNews,
THUCNews是根据新浪新闻RSS订阅频道2005~2011年间的历史数据筛选过滤生成,包含74万篇新闻文档(2.19
GB),均为UTF-8纯文本格式。我们在原始新浪新闻分类体系的基础上,重新整合划分出14个候选分类类别:财经、彩票、房产、股票、家居、教育、科技、社会、时尚、时政、体育、星座、游戏、娱乐。使用THUCTC工具包在此数据集上进行评测,准确率可以达到88.6%。

数据下载链接在THUCTC: 一个高效的中文文本分类工具
<https://link.zhihu.com/?target=http%3A//thuctc.thunlp.org/>。

首先是数据预处理这里,我们需要提取出中文,去掉那些非中文的字符。

具体函数可以看github,这里不贴出这块代码。

数据预处理要讲原始文本数据转换为训练数据。

第一步:数据预处理
def datahelper(dir): #返回为文本,文本对应标签 labels_index={} index_lables={} num_recs=0
fs = os.listdir(dir) MAX_SEQUENCE_LENGTH = 200 MAX_NB_WORDS = 50000
EMBEDDING_DIM = 20 VALIDATION_SPLIT = 0.2 i = 0; for f in fs: labels_index[f] =
i; index_lables[i] = f i = i + 1; print(labels_index) texts = [] labels = [] #
list of label ids for la in labels_index.keys(): print(la + " " +
index_lables[labels_index[la]]) la_dir = dir + "/" + la; fs =
os.listdir(la_dir) for f in fs: file = open(la_dir + "/" + f, encoding='utf-8')
lines = file.readlines(); text = '' for line in lines: if len(line) > 5: line =
extract_chinese(line) words = jieba.lcut(line, cut_all=False, HMM=True) text =
words texts.append(text) labels.append(labels_index[la]) num_recs = num_recs +
1 return texts,labels,labels_index,index_lables

返回的文本为list,需要将list里面字符单词替换为数字索引,首先,构建词表

#词表
word_vocb=[] word_vocb.append('') for text in texts: for word in text:
word_vocb.append(word) word_vocb=set(word_vocb) vocb_size=len(word_vocb)

词表构建好之后,构建词表到索引的映射

#词表与索引的map


word_to_idx={word:i for i,word in enumerate(word_vocb)}
idx_to_word={word_to_idx[word]:word for word in word_to_idx}
就可以构建训练数据

#生成训练数据,需要将训练数据的Word转换为word的索引


for i in range(0,len(texts)): if len(texts[i])<max_len: for j in
range(0,len(texts[i])): texts_with_id[i][j]=word_to_idx[texts[i][j]] for j in
range(len(texts[i]),max_len): texts_with_id[i][j] = word_to_idx[''] else: for j
in range(0,max_len): texts_with_id[i][j]=word_to_idx[texts[i][j]]

(ps,这里要注意每个训练文本的大小要限制在max_len,不够补充空格即可)

第二步:构建textCNN模型

#textCNN模型

class textCNN(nn.Module): def __init__(self,args): super(textCNN,
self).__init__() vocb_size = args['vocb_size'] dim = args['dim'] n_class =
args['n_class'] max_len = args['max_len']
embedding_matrix=args['embedding_matrix'] #需要将事先训练好的词向量载入 self.embeding =
nn.Embedding(vocb_size, dim,_weight=embedding_matrix) self.conv1 =
nn.Sequential( nn.Conv2d(in_channels=1, out_channels=16, kernel_size=5,
stride=1, padding=2), nn.ReLU(), nn.MaxPool2d(kernel_size=2) # (16,64,64) )
self.conv2 = nn.Sequential( nn.Conv2d(in_channels=16, out_channels=32,
kernel_size=5, stride=1, padding=2), nn.ReLU(), nn.MaxPool2d(2) ) self.conv3 =
nn.Sequential( nn.Conv2d(in_channels=32, out_channels=64, kernel_size=5,
stride=1, padding=2), nn.ReLU(), nn.MaxPool2d(2) ) self.conv4 = nn.Sequential(
# (16,64,64) nn.Conv2d(in_channels=64, out_channels=128, kernel_size=5,
stride=1, padding=2), nn.ReLU(), nn.MaxPool2d(2) ) self.out = nn.Linear(512,
n_class) def forward(self, x): x = self.embeding(x)
x=x.view(x.size(0),1,max_len,word_dim) #print(x.size()) x = self.conv1(x) x =
self.conv2(x) x = self.conv3(x) x = self.conv4(x) x = x.view(x.size(0), -1) #
将(batch,outchanel,w,h)展平为(batch,outchanel*w*h) #print(x.size()) output =
self.out(x) return output


这里我们使用的embedding层的参数大小为vocb_size*dim,即词汇表大小乘词向量的维度,注意,这里使用的训练好词向量的参数,而不是随机的词向量。

训练好的词向量:


#每个单词的对应的词向量 embeddings_index = getw2v() #预先处理好的词向量 embedding_matrix =
np.zeros((nb_words, word_dim)) for word, i in word_to_idx.items(): if i >=
nb_words: continue if word in embeddings_index: embedding_vector =
embeddings_index[word] if embedding_vector is not None: # words not found in
embedding index will be all-zeros. embedding_matrix[i] = embedding_vector
args['embedding_matrix']=torch.Tensor(embedding_matrix)


第三步 训练


设置的学习率为LR = 0.001,optimiser为Adam,使用的损失函数为 nn.CrossEntropyLoss()。
LR = 0.001 optimizer = torch.optim.Adam(cnn.parameters(), lr=LR) #损失函数
loss_function = nn.CrossEntropyLoss() #训练批次大小 epoch_size=1000;
texts_len=len(texts_with_id) print(texts_len) #划分训练数据和测试数据 x_train, x_test,
y_train, y_test = train_test_split(texts_with_id, labels, test_size=0.2,
random_state=42) test_x=torch.LongTensor(x_test)
test_y=torch.LongTensor(y_test) train_x=x_train train_y=y_train
test_epoch_size=300; for epoch in range(EPOCH): for i in
range(0,(int)(len(train_x)/epoch_size)): b_x =
Variable(torch.LongTensor(train_x[i*epoch_size:i*epoch_size+epoch_size])) b_y =
Variable(torch.LongTensor((train_y[i*epoch_size:i*epoch_size+epoch_size])))
output = cnn(b_x) loss = loss_function(output, b_y) optimizer.zero_grad()
loss.backward() optimizer.step() print(str(i)) print(loss) pred_y =
torch.max(output, 1)[1].data.squeeze() acc = (b_y == pred_y) acc =
acc.numpy().sum() accuracy = acc / (b_y.size(0)) acc_all = 0; for j in range(0,
(int)(len(test_x) / test_epoch_size)): b_x = Variable(torch.LongTensor(test_x[j
* test_epoch_size:j * test_epoch_size + test_epoch_size])) b_y =
Variable(torch.LongTensor((test_y[j * test_epoch_size:j * test_epoch_size +
test_epoch_size]))) test_output = cnn(b_x) pred_y = torch.max(test_output,
1)[1].data.squeeze() # print(pred_y) # print(test_y) acc = (pred_y == b_y) acc
= acc.numpy().sum() print("acc " + str(acc / b_y.size(0))) acc_all = acc_all +
acc accuracy = acc_all / (test_y.size(0)) print("epoch " + str(epoch) + " step
" + str(i) + " " + "acc " + str(accuracy))

具体代码在
https://github.com/13061051/PytorchLeran