Exploratory Data Analysis

type(iris.data)# numpy.ndarray type(iris.target) # numpy.ndarray print
iris.feature_namesprint iris.target_names X = iris.data y = iris.target Out: [
'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'
] ['setosa' 'versicolor' 'virginica']
4个特征：花萼长度、花萼宽度、花瓣长度、花瓣宽度
3个分类：山鸢尾、变色鸢尾、维吉尼亚鸢尾

[[5.1, 3.5, 1.4, 0.2], [4.9, 3. , 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1,
1.5, 0.2], [5. , 3.6, 1.4, 0.2], [5.4, 3.9, 1.7, 0.4], [4.6, 3.4, 1.4, 0.3], [5.
,3.4, 1.5, 0.2], [4.4, 2.9, 1.4, 0.2], [4.9, 3.1, 1.5, 0.1]]

print dict(zip(*np.unique(y,return_counts=True))) Out: {0: 50, 1: 50, 2: 50}

print stats.describe(X) Out: DescribeResult( nobs=150L, minmax=(array([4.3, 2.
,1. , 0.1]), array([7.9, 4.4, 6.9, 2.5])), mean=array([5.84333333, 3.054 ,
3.75866667, 1.19866667]), variance=array([0.68569351, 0.18800403, 3.11317942,
0.58241432]), skewness=array([ 0.31175306, 0.33070281, -0.27171195, -0.10394367
]), kurtosis=array([-0.57356795, 0.2414433 , -1.3953593 , -1.33524564]))
PCA降维

cov = np.cov(X.T) print np.round(cov, decimals=2) # 打印保留2位小数的结果 Out: [[ 0.69,
-0.04, 1.27, 0.52], [-0.04, 0.19, -0.32, -0.12], [ 1.27, -0.32, 3.11, 1.3 ], [
0.52, -0.12, 1.3 , 0.58]]

（PS：当数据是中心化（样本值减去均值）了的，协方差矩阵还可以通过1n−1XTX1n−1XTX得到，也就是它的转置乘以自身再除以样本数量。）

eig_val,eig_vec = np.linalg.eig(cov) Out: array([4.22484077, 0.24224357,
0.07852391, 0.02368303]) array([[ 0.36158968, -0.65653988, -0.58099728,
0.31725455], [-0.08226889, -0.72971237, 0.59641809, -0.32409435], [ 0.85657211,
0.1757674 , 0.07252408, -0.47971899], [ 0.35884393, 0.07470647, 0.54906091,
0.75112056]])

indices = np.argsort(np.abs(eig_val))[::-1][:2] Out: array([0, 1], dtype=int64)

transform_matrix = eig_vec[:, indices] Out: [[ 0.36158968, -0.65653988], [-
0.08226889, -0.72971237], [ 0.85657211, 0.1757674 ], [ 0.35884393, 0.07470647]]

new_X = np.dot(X, transform_matrix) Out: [[ 2.82713597, -5.64133105], [
2.79595248, -5.14516688], [ 2.62152356, -5.17737812], [ 2.7649059 , -5.00359942
], [2.78275012, -5.64864829], [ 3.23144574, -6.06250644], [ 2.69045242, -
5.23261922], [ 2.8848611 , -5.48512908], [ 2.62338453, -4.7439257 ], [
2.83749841, -5.20803203] ...

* 计算协方差矩阵
* 对协方差矩阵进行特征分解（对角化）
* 选择特征值绝对值最大的特征值对应的特征向量作为转换矩阵，将原始数据降维。