Keras LSTM对20 Newsgroups数据集进行分类 - 好文

1.20 Newsgroup数据集介绍

20newsgroups数据集是用于文本分类、文本挖据和信息检索研究的国际标准数据集之一。数据集收集了大约20,000左右的新闻组文档，均匀分为20个不同主题的新闻组集合。一些新闻组的主题特别相似(e.g. comp.sys.ibm.pc.hardware/
comp.sys.mac.hardware)，还有一些却完全不相关 (e.g misc.forsale /soc.religion.christian)。

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

misc.forsale

talk.politics.misc

talk.politics.guns

talk.politics.mideast

talk.religion.misc

alt.atheism

soc.religion.christian

20newsgroups数据集有三个版本。第一个版本19997是原始的并没有修改过的版本。第二个版本bydate是按时间顺序分为训练(60%)和测试(40%)两部分数据集，不包含重复文档和新闻组名（新闻组，路径，隶属于，日期）。第三个版本18828不包含重复文档，只有来源和主题。

* 20news-19997.tar.gz
<http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz> –原始20 Newsgroups数据集
* 20news-bydate.tar.gz
<http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz> –按时间分类;
不包含重复文档和新闻组名(18846 个文档)
* 20news-18828.tar.gz
<http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz>– 不包含重复文档，只有来源和主题
(18828 个文档)

在sklearn中，该模型有两种装载方式，第一种是sklearn.datasets.fetch_20newsgroups，返回一个可以被文本特征提取器（如sklearn.feature_extraction.text.CountVectorizer）自定义参数提取特征的原始文本序列；第二种是sklearn.datasets.fetch_20newsgroups_vectorized，返回一个已提取特征的文本序列，即不需要使用特征提取器。

2.加载训练好的向量（Glove，100维）
BASE_DIR = './data' GLOVE_DIR = BASE_DIR + '/glove.6B/' TEXT_DATA_DIR =
BASE_DIR + '/20_newsgroup/' MAX_SEQUENCE_LENGTH = 1000 MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100 VALIDATION_SPLIT = 0.2 batch_size = 32 print('Indexing word
vectors.') embeddings_index = {} f =
open(os.path.join(GLOVE_DIR,'glove.6B.100d.txt'),encoding='utf-8') for line in
f: values = line.split() word = values[0] coefs =
np.asarray(values[1:],dtype='float32') embeddings_index[word] = coefs f.close()
print('Found %s word vectors.'%len(embeddings_index) )
3.加载数据集，这里也包括标签，我们是根据文档所在的文件夹用数字进行分类
print('Processing text dataset') texts = [] labels_index = {} labels = [] for
name in sorted(os.listdir(TEXT_DATA_DIR)): path =
os.path.join(TEXT_DATA_DIR,name) if os.path.isdir(path): label_id =
len(labels_index) labels_index[name] = label_id #每个文件夹给一个ID for fname in
sorted(os.listdir(path)): if fname.isdigit(): fpath = os.path.join(path,fname)
if sys.version_info<(3,): f = open(fpath) else: f =
open(fpath,encoding='latin-1') texts.append(f.read()) f.close()
labels.append(label_id) print('Found %s texts.'%len(texts))
4.将文本数据向量化
tokenizer = Tokenizer(nb_words = MAX_NB_WORDS) tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts) word_index =
tokenizer.word_index print('Found %s unique tokens.'%len(word_index))
5.构造训练集和测试集，这里我们对数据进行了清洗

data = pad_sequences(sequences,maxlen = MAX_SEQUENCE_LENGTH) labels =
to_categorical(np.asarray(labels)) print('Shape of data tensor:',data.shape)
print('Shape of label tensor:',labels.shape) indices = np.arange(data.shape[0])
np.random.shuffle(indices) data = data[indices] labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT*data.shape[0]) x_train =
data[:-nb_validation_samples] y_train = labels[:-nb_validation_samples] x_val =
data[-nb_validation_samples:] y_val = labels[-nb_validation_samples:]
print('Preparing embedding matrix.') print(nb_validation_samples)
6.我们构建了LSTM网络模型，并对其评估

nb_words = min(MAX_NB_WORDS,len(word_index)) embedding_matrix =
np.zeros((nb_words +1,EMBEDDING_DIM)) for word,i in word_index.items(): if
i>MAX_NB_WORDS: continue embedding_vector = embeddings_index.get(word) if
embedding_vector is not None: embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape) embedding_layer = Embedding( nb_words+1,
EMBEDDING_DIM, weights= [embedding_matrix], input_length = MAX_SEQUENCE_LENGTH,
trainable=False,#trainable，由于我们的W是word2vec训练出来的，算作预训练模型，所以就无需训练了。 dropout = 0.2
) batch_size = 32 print('Build model...') model = Sequential()
model.add(embedding_layer) model.add(LSTM(100,dropout_W =
0.2,dropout_U=0.2))#输出维度 :100 model.add(Dense(1))
model.add(Activation('sigmoid'))
model.add(Dense(len(labels_index),activation='softmax')) model.compile(
loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'] )
print('Train...')
model.fit(x_train,y_train,batch_size=batch_size,nb_epoch=5,validation_data=(x_val,y_val))
score,acc = model.evaluate(x_val,y_val,batch_size=batch_size) print('Test
score:',score) print('Test sccuracy:',acc)
7.训练结果（部分结果）
3936/3999 [============================>.] - ETA: 0s 3968/3999
[============================>.] - ETA: 0s 3999/3999
[==============================] - 52s 13ms/step Test score:
0.18472591743048325 Test sccuracy: 0.9499999885411226

热门工具换一换