机器学习算法完整版见fenghaootong-github
<https://github.com/fenghaotong/MachineLearning/tree/master/LogisticRegression>

房价预测

数据集描述

数据共有81个特征

SalePrice - the property’s sale price in dollars. This is the target variable
that you’re trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
….

导入所需模块
import numpy as np import pandas as pd import matplotlib.pyplot as plt import
seabornas sns import math as mat from scipy import stats from scipy.stats import
normfrom sklearn import preprocessing import statsmodels.api as sm from patsy
import dmatrices import warnings warnings.filterwarnings('ignore') %matplotlib
inlineimport sklearn.linear_model as LinReg import sklearn.metrics as metrics
导入数据
#loading the data data_train = pd.read_csv('../DATA/SalePrice_train.csv')
data_test = pd.read_csv('../DATA/SalePrice_test.csv')
数据共有81个特征,为了便于说明只挑选7个特征
OverallQual
GrLivArea
GarageCars
TotalBsmtSF
1stFlrSF
FullBath
YearBuilt
因为这些数据与房子的售卖价格相关性比较大

具体如何选择特征,见数据清理
<https://github.com/fenghaotong/MachineLearning/blob/master/LogisticRegression/DataExploration.ipynb>

数据预处理
data_train.shape (1460, 81) vars = ['OverallQual', 'GrLivArea', 'GarageCars',
'TotalBsmtSF', 'FullBath','YearBuilt'] Y = data_train[['SalePrice']] #dim
(1460, 1) ID_train = data_train[['Id']] #dim (1460, 1) ID_test = data_test[['Id'
]]#dim (1459, 1) #extract only the relevant feature with cross correlation >0.5
respect to SalePrice X_matrix = data_train[vars] X_matrix.shape #dim (1460,6)
X_test = data_test[vars] X_test.shape#dim (1459,6) (1459, 6)
查看丢失数据
#check for missing data: #missing data total =
X_matrix.isnull().sum().sort_values(ascending=False) percent =
(X_matrix.isnull().sum()/X_matrix.count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20) #no missing data in this training set
Total Percent
YearBuilt 0 0.0
FullBath 0 0.0
TotalBsmtSF 0 0.0
GarageCars 0 0.0
GrLivArea 0 0.0
OverallQual 0 0.0 total = X_test.isnull().sum().sort_values(ascending=False)
percent = (X_test.isnull().sum()/X_test.count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20) #missing data in this test set
Total Percent
TotalBsmtSF 1 0.000686
GarageCars 1 0.000686
YearBuilt 0 0.000000
FullBath 0 0.000000
GrLivArea 0 0.000000
OverallQual 0 0.000000 #help(mat.ceil) #去上限
使用均值代替缺失的数据
#使用均值代替缺失的数据 X_test['TotalBsmtSF'] = X_test['TotalBsmtSF'].fillna(X_test[
'TotalBsmtSF'].mean()) X_test['GarageCars'] = X_test['GarageCars'
].fillna(mat.ceil(X_test['GarageCars'].mean())) total =
X_test.isnull().sum().sort_values(ascending=False) percent =
(X_test.isnull().sum()/X_test.count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
Total Percent
YearBuilt 0 0.0
FullBath 0 0.0
TotalBsmtSF 0 0.0
GarageCars 0 0.0
GrLivArea 0 0.0
OverallQual 0 0.0 X_test.shape (1459, 6)
* 然后预处理模块的特征缩放和均值归一化。
进一步提供了一个实用类StandardScaler,它实现了变换方法来计算训练集上的均值和标准差,以便稍后能够在测试集上重新应用相同的变换。
max_abs_scaler = preprocessing.MaxAbsScaler() X_train_maxabs =
max_abs_scaler.fit_transform(X_matrix) print(X_train_maxabs) [[ 0.7 0.30308401
0.5 0.1400982 0.66666667 0.99651741] [ 0.6 0.22367955 0.5 0.20654664 0.66666667
0.98308458] [ 0.7 0.31655441 0.5 0.15057283 0.66666667 0.99552239] ..., [ 0.7
0.41474654 0.25 0.18854337 0.66666667 0.96567164] [ 0.5 0.191067 0.25
0.17643208 0.33333333 0.97014925] [ 0.5 0.22261609 0.25 0.20556465 0.33333333
0.97761194]] X_test_maxabs = max_abs_scaler.fit_transform(X_test)
print(X_test_maxabs) [[ 0.5 0.17585868 0.2 0.17311089 0.25 0.97562189] [ 0.6
0.26084396 0.2 0.26084396 0.25 0.97412935] [ 0.5 0.31972522 0.4 0.18213935 0.5
0.99353234] ..., [ 0.5 0.24023553 0.4 0.24023553 0.25 0.97512438] [ 0.5
0.19038273 0. 0.17899902 0.25 0.99104478] [ 0.7 0.39254171 0.6 0.19548577 0.5
0.99154229]]
模型训练
lr=LinReg.LinearRegression().fit(X_train_maxabs,Y)
模型预测
Y_pred_train = lr.predict(X_train_maxabs) print("Los Reg performance
evaluation on Y_pred_train") print("R-squared =", metrics.r2_score(Y,
Y_pred_train)) Los Reg performance evaluation on Y_pred_train R-squared =
0.768647335422 Y_pred_test = lr.predict(X_test_maxabs) print("Lin Reg
performance evaluation on X_test") #print("R-squared =", metrics.r2_score(Y,
Y_pred_test)) print("Coefficients =", lr.coef_) Lin Reg performance evaluation
on X_test Coefficients = [[ 205199.68775757 305095.8264889 58585.26325362
178302.68126933 -16511.92112734 676458.9666186 ]]
Logistic Regression

导入模块
#导入模块 import pandas as pd import numpy as np
数据预处理
#创建特征列表表头 column_names = ['Sample code number','Clump Thickness','Uniformity
of Cell Size','Uniformity of Cell Shape','Marginal Adhesion','Single Epithelial
Cell Size','Bare Nuclei','Bland Chromatin','Normal Nucleoli','Mitoses','Class']
#使用pandas.read_csv函数从网上读取数据集 data = pd.read_csv('DATA/data.csv'
,names=column_names)#将?替换为标准缺失值表示 data = data.replace(to_replace='?',value =
np.nan)#丢弃带有缺失值的数据(只要有一个维度有缺失便丢弃) data = data.dropna(how='any') #查看data的数据量和维度
data.shape (683, 11) data.head(10)
Sample code number Clump Thickness Uniformity of Cell Size Uniformity of Cell
Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin
Normal Nucleoli Mitoses Class
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2
5 1017122 8 10 10 8 7 10 9 7 1 4
6 1018099 1 1 1 1 2 10 3 1 1 2
7 1018561 2 1 2 1 2 1 3 1 1 2
8 1033078 2 1 1 1 2 1 1 1 5 2
9 1033078 4 2 1 1 2 1 2 1 1 2
由于原始数据没有提供对应的测试样本用于评估模型性能,这里对带标记的数据进行分割,25%作为测试集,其余作为训练集
#使用sklearn.cross_validation里的train_test_split模块分割数据集 from
sklearn.cross_validationimport train_test_split #随机采样25%的数据用于测试,剩下的75%用于构建训练集
X_train,X_test,y_train,y_test = train_test_split(data[column_names[1:10
]],data[column_names[10]],test_size = 0.25,random_state = 33) #查看训练样本的数量和类别分布
y_train.value_counts() 2 344 4 168 Name: Class, dtype: int64 #查看测试样本的数量和类别分布
y_test.value_counts() 2 100 4 71 Name: Class, dtype: int64
建立模型,预测数据
#从sklearn.preprocessing导入StandardScaler from sklearn.preprocessing import
StandardScaler#从sklearn.linear_model导入LogisticRegression(逻辑斯蒂回归) from
sklearn.linear_modelimport LogisticRegression
#从sklearn.linear_model导入SGDClassifier(随机梯度参数) from sklearn.linear_model import
SGDClassifier ss = StandardScaler() X_train = ss.fit_transform(X_train) X_test
= ss.transform(X_test) lr = LogisticRegression() #调用逻辑斯蒂回归,使用fit函数训练模型参数
lr.fit(X_train,y_train) lr_y_predict = lr.predict(X_test)#调用随机梯度的fit函数训练模型
lr_y_predict array([2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 4, 2, 4, 4, 4, 4, 4, 2,
2, 4, 4, 2, 4, 4, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 2, 4, 2, 2,
4, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 2, 4, 2,
2, 2, 2, 2, 2, 4, 4, 2, 2, 2, 4, 2, 2, 2, 4, 2, 4, 2, 4, 4, 2, 2, 2, 2, 4, 4,
2, 2, 2, 4, 2, 2, 4, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 4, 2, 2, 4, 4, 2, 4,
2, 2, 2, 4, 2, 2, 4, 4, 2, 4, 4, 2, 2, 2, 2, 4, 2, 4, 2, 4, 2, 2, 2, 2, 2, 4,
4, 2, 4, 4, 2, 4, 2, 2, 2, 2, 4, 4, 4, 2, 4, 2, 2, 4, 2, 4, 4], dtype=int64)
使用线性分类模型进行良/恶性肿瘤预测任务的性能分析
#从sklearn.metrics导入classification_report from sklearn.metrics import
classification_report#使用逻辑斯蒂回归模型自带的评分函数score获得模型在测试集上的准确性结果 print('Accuracy of
LR Classifier:',lr.score(X_test,y_test))
#使用classification_report模块获得逻辑斯蒂模型其他三个指标的结果(召回率,精确率,调和平均数)
print(classification_report(y_test,lr_y_predict,target_names=['Benign',
'Malignant'])) Accuracy of LR Classifier: 0.988304093567 precision recall
f1-score support Benign 0.99 0.99 0.99 100 Malignant 0.99 0.99 0.99 71 avg /
total 0.99 0.99 0.99 171