NLTK（三）：使用模型做预测 - 好文

简书著作权归作者所有，任何形式的转载都请联系作者获得授权并注明出处。

这篇文章主要是通过两个简单的例子让大家了解一下如何使用 NLTK
做预测。第一个例子是根据一个给定的人名来预测这个人的性别。第二个例子是确定所有评论中积极评论和消极评论所占的比例。
预测主要有四个步骤：
（1）准备数据
（2）提取特征
（3）训练模型
（4）使用模型做预测

一、根据人名预测性别
1、准备数据
这里的数据我们直接使用 NLTK 资源中提供的人名。如果想要查看该人名相关的文件可以到 nltk_data --> corpora -->
names 文件夹去查看。加载这些数据的代码如下：
from nltk.corpus import names # Load data and training names = ([(name,
'male') for name in names.words('male.txt')] + [(name, 'female') for name in
names.words('female.txt')])
为了给大家一个直观的印象，加载进来的数据的形式如下：
[(u'Aaron', 'male'), (u'Abbey', 'male'), (u'Abbie', 'male')] [(u'Zorana',
'female'), (u'Zorina', 'female'), (u'Zorine', 'female')]
2、特征提取
我们这里是选取人名的最后一个字母作为该人名的特征。这种选取方式可能不准确甚至是不合理，但没有关系，因为我们的目的是了解 NTTK
提取特征的方式。
在这个实例中我们使用如下的代码实现特征提取，即提取各个人名的最后一个字母。
featuresets = [(gender_features(n), g) for (n,g) in names]
其中方法 gender_features(n) 的代码如下：
def gender_features(word): return {'last_letter': word[-1]}
3、模型训练和预测
其代码如下：
# Train classifier = nltk.NaiveBayesClassifier.train(train_set) # Predict
print(classifier.classify(gender_features('Frank')))
在训练过程中，我们直接使用了 NLTK 提供的贝叶斯分类器模型做训练。NLTK
中有很多模型供我们使用，这里我们仅选用了贝叶斯分类器来做实例的演示。现在将完整的代码展示如下：
import nltk.classify.util from nltk.corpus import names def
gender_features(word): return {'last_letter': word[-1]} # Load data and
training names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')]) train_set =
[(gender_features(n), g) for (n, g) in names] classifier =
nltk.NaiveBayesClassifier.train(train_set) # Predict
print(classifier.classify(gender_features('Frank')))
二、确定积极评论和消极评论所占的比例
1、准备数据
这个例子中的训练数据是我们自己造的，但并不影响我们学习。这里使用的训练数据如下：
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good',
'nice', 'great', ':)' ] negative_vocab = [ 'bad', 'terrible','useless', 'hate',
':(' ] neutral_vocab = [
'movie','the','sound','was','is','actors','did','know','words','not' ]
我们把训练数据分为三类：积极的词汇、消极的词汇和中性词汇。
2、特征提取
在进行特征提取时，我们的操作很简单，只是为每个词汇打上相应的标签而已。特征提取的代码如下：
def word_feats(words): return dict([(word, True) for word in words])
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
3、模型训练和预测
将上面三个特征提取列表融合到一起便是训练数据集：
train_set = negative_features + positive_features + neutral_features
我们仍然以贝叶斯分类器为例来训练模型：
classifier = NaiveBayesClassifier.train(train_set)
完整的代码展示如下：
from nltk.classify import NaiveBayesClassifier def word_feats(words): return
dict([(word, True) for word in words]) positive_vocab = ['awesome',
'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)']
negative_vocab = ['bad', 'terrible', 'useless', 'hate', ':('] neutral_vocab =
['movie', 'the', 'sound', 'was', 'is', 'actors', 'did', 'know', 'words', 'not']
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
train_set = negative_features + positive_features + neutral_features classifier
= NaiveBayesClassifier.train(train_set) # Predict neg = 0 pos = 0 sentence =
"Awesome movie, I liked it" sentence = sentence.lower() words =
sentence.split(' ') for word in words: classResult =
classifier.classify(word_feats(word)) if classResult == 'neg': neg = neg + 1 if
classResult == 'pos': pos = pos + 1 print('Positive: ' + str(float(pos) /
len(words))) print('Negative: ' + str(float(neg) / len(words)))
相关文档

Category: NLTK <https://pythonspot.com/category/nltk/>

上一篇：NLTK（二）：英文词性标注 <https://www.jianshu.com/p/c273e926d734>
下一篇：Shell Script（一）：第一个可执行的 Shell 脚本 <https://www.jianshu.com/p/3a8dd14a9355>

热门工具换一换