about PCA（ principal component analysis ,Principal components analysis）, Here is a very easy to understand article explanation ： principal component analysis (PCA) The most detailed and comprehensive interpretation

<http://mp.weixin.qq.com/s?__biz=MjM5MTgyNTk1NA==&mid=2649907627&idx=2&sn=e65f700bee531da5b16c6700ccf6a693&source=41#wechat_redirect>
, There's no more verbosity here , Here is the main introduction PCA Algorithm and LDA Algorithm in MNIST Application on dataset .
The main reference is Kaggle Last one Kernel, You can also read this article directly Kernel, A link is attached here ：Interactive Intro to
Dimensionality Reduction

<https://www.kaggle.com/arthurtok/interactive-intro-to-dimensionality-reduction>

python code ：

Import some basic libraries
import numpy as np # linear algebra import pandas as pd # data processing, CSV
file I/O (e.g. pd.read_csv) import plotly.offline as py
py.init_notebook_mode(connected=True) import plotly.graph_objs as go import
plotly.toolsas tls import seaborn as sns import matplotlib.image as mpimg import
matplotlib.pyplotas plt import matplotlib # Import PCA from
sklearn.decompositionimport PCA
PCA

In short ,PCA Is a linear transformation algorithm , It attempts to project the original features of our data into a smaller set of features ( Or subspace ) upper , Keep most of the information at the same time . To do this , The algorithm tries to find the most suitable direction in the new subspace / angle ( Principal component ), This principal component maximizes variance . Why maximize variance ? See the first link in this article .

MNIST data set

MNIST Data set is a digital data set in computer vision , It's basically an entry-level data set in machine learning , You can go to Kaggle Up and down ：MNIST Dataset
<https://www.kaggle.com/arthurtok/interactive-intro-to-dimensionality-reduction/data>
.
#(42000, 785) # Separate training set features and Tags target = train['label'] train = train.drop("label"
,axis=1)# Drop the label feature
Calculate eigenvector

First, standardize the data for each feature to get X_std, And then calculate the covariance matrix cov_mat, Calculating eigenvalues of covariance matrix eig_vals And eigenvectors eig_vecs
. Each eigenvalue is bound with its corresponding eigenvector as a feature pair eig_pairs, Sort feature pairs from large to small according to the size of feature values .
The sum of the eigenvalues is then calculated tot And the ratio of each feature to the sum of eigenvalues var_exp, Cumulative proportion of eigenvalues to sum of eigenvalues cum_var_exp
# Standardize data sets from sklearn.preprocessing import StandardScaler X = train.values
X_std = StandardScaler().fit_transform(X) # Calculating eigenvalues and eigenvectors of covariance matrix mean_vec = np.mean
(X_std, axis=0) cov_mat = np.cov(X_std.T) eig_vals, eig_vecs = np.linalg.eig
(cov_mat)# establish （ characteristic value , feature vector ） Tuple pair List of # Create a list of (eigenvalue, eigenvector)
tuples eig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in
range(len(eig_vals))]# Sort the eigenvalue, eigenvector pair from high to low
eig_pairs.sort(key = lambda x: x[0], reverse= True) # Calculation of Explained
Variance from the eigenvalues tot = sum(eig_vals) # Sum of eigenvalues var_exp = [(i/tot)*100
for iin sorted(eig_vals, reverse=True)] # Proportion of individual eigenvalues cum_var_exp = np.cumsum
(var_exp)# Cumulative proportion of characteristic value
Draw the scale of eigenvalue and the cumulative scale of eigenvalue

The specific code can refer to the original , No post here .

The small picture in the upper right corner of the picture above is the same as the large one , It's just a zoom . The yellow and green lines are the cumulative proportions of the eigenvalues , It can be seen that the cumulative proportion is 100% Of , no problem . Black and red lines are the proportions of individual eigenvalues . Abscissa to 784 until , representative 784 Dimensional characteristics .

As we can see , In our 784 Of properties or columns , about 90% The explained variance of can be used 200 Described by the above characteristics . therefore , If you want to implement a PCA, Before extraction 200 Features will be a logical choice , Because they've taken up about 90% Information about .

Visual feature vector

As mentioned above , because PCA Method attempts to capture the optimal direction of the maximum variance ( feature vector ). therefore , It may be useful to visualize these directions and their associated eigenvalues . For fast implementation , It'll only be extracted here PCA Before eigenvalue 30 individual ( use Sklearn Of
.components_ method ) feature vector .
# call SKlearn Of PCA method n_components = 30 pca =
PCA(n_components=n_components).fit(train.values) eigenvalues =
pca.components_.reshape(n_components,28, 28) # extract PCA principal component （ characteristic value ）, Think about it. It should be eigenvector
eigenvalues = pca.components_# Drawing n_row = 4 n_col = 7 # Plot the first 8
eignenvalues plt.figure(figsize=(13,12)) for i in list(range(n_row * n_col)): #
for offset in [10, 30,0]: # plt.subplot(n_row, n_col, i + 1) offset =0
plt.subplot(n_row, n_col, i +1) plt.imshow(eigenvalues[i].reshape(28,28), cmap=
'jet') title_text = 'Eigenvalue ' + str(i + 1) plt.title(title_text, size=6.5)
plt.xticks(()) plt.yticks(()) plt.show()

The little piece above depicts PCA Method is our MNIST Before dataset generation 30 Best direction or principal component axis . What's interesting is that , When the first component “ characteristic value 1” And 28 weight “ characteristic value 28” When comparing , Obviously , More complex directions or components are generated in the search for maximum variance , Thus, the variance of the new feature subspace is maximized .

use Sklearn realization PCA algorithm

Now use Sklearn tool kit , The algorithm of principal component analysis is as follows :
# Delete our earlier created X object del X # Before value adoption 6000 Samples to speed up the calculation X= train[:6000
].values del train # Standardized data X_std = StandardScaler().fit_transform(X) # Call the
PCAmethod with 5 components. pca = PCA(n_components=5) pca.fit(X_std) X_5d = pca
.transform(X_std) # "Target" Before you take it 6000 individual Target = target[:6000]
X_std Data after standardization ,shape by （6000,784）,X_5d Data after dimension reduction ,shape by （6000,5）

Before painting 2 Scatter diagram of principal components

When it comes to these dimensionality reduction methods , Scatter is the most commonly used , Because they allow clustering ( If any ) Large and convenient visualization , And that's exactly what we did before we drew 2 What to do when you are a principal component . Here's how 5 The first principal component and the second principal component are scatter plots drawn in horizontal and vertical coordinates , Code can be seen in the original .

As you can see from the scatter diagram , You can distinguish some obvious clusters from the collective spots of color . These clusters represent different points potentially representing different numbers . however
, because PCA The algorithm is unsupervised , The reason why the picture above has color , Because the data set is labeled when drawing , If the dataset is not labeled , You can't see these colors . that , If no color is added , How can we separate our data points into a new feature space ?

first , We use Sklearn Established KMeans Clustering method , And use fit_predict Method computing cluster center , And predict the first and second PCA Component index of projection ( See if any perceptible clusters can be observed ).KMeans Cluster scatter ：
from sklearn.cluster import KMeans # KMeans clustering # Set a KMeans
clustering with 9 components ( 9 chosen sneakily ;)as hopefully we get back our
9 class labels) kmeans = KMeans(n_clusters=9) # Computecluster centers and
predictcluster indices X_clustered = kmeans.fit_predict(X_5d)

Visually ,KMeans The cluster generated by the algorithm seems to provide a clearer division between clusters , by comparison , ours PCA Class label added to projection . It's not surprising , because PCA It's an unsupervised approach , So it doesn't optimize the separation of categories . however , The task of classification is done by the next method we will discuss .

LDA

LDA image PCA equally , It is also a linear transformation method , Usually used for dimensionality reduction tasks . But it is different from unsupervised learning algorithm ,LDA Belongs to supervised learning method . because LDA The goal is to provide useful information about class Tags ,LDA The component axis will be calculated by ( Linear discriminator ) To maximize the distance between different categories .
In short , The difference between them is ：LDA And PCA Comparison of ,PCA Select the direction with maximum variance of sample point projection ,LDA Choose the best direction for classification performance ：
Sklearn Kit comes with built-in LDA function , So we call LDA The model is as follows :
lda = LDA(n_components=5) # Taking in as second argument the Target as labels
X_LDA_2D = lda.fit_transform(X_std, Target.values )

LDA The implemented syntax is very similar to PCA Grammar of . Use one call fit and transform method , It will LDA Fit model to data , And then by applying LDA Dimension reduction for transformation . however , because LDA Is a supervised learning algorithm , So there is a second parameter to the method that the user must provide , This will be a class tag , In this case, the target label for the number .

LDA Visual scatter

From the scatter above , We can see , in use LDA Time , And using class Tags PCA comparison , Data points come together more clearly . This is the inherent advantage of having class labels to monitor learning . In short , Choose the right tools for the right job .