Kaggle: UseMNIST Data set processingPCA Dimension reductionLDA Dimension reduction
aboutPCA（ principal component analysis,Principal components analysis）, Here is a very easy to understand article explanation： principal component analysis (PCA) The most detailed and comprehensive interpretation
, There's no more verbosity here, Here is the main introductionPCA Algorithm andLDA Algorithm inMNIST Application on dataset.
The main reference isKaggle Last articleKernel, You can also read this article directlyKernel, A link is attached here：Interactive Intro to
Import some basic libraries
import numpy as np # linear algebra import pandas as pd # data processing, CSV
file I/O (e.g. pd.read_csv) import plotly.offline as py
py.init_notebook_mode(connected=True) import plotly.graph_objs as go import
plotly.toolsas tls import seaborn as sns import matplotlib.image as mpimg import
matplotlib.pyplotas plt import matplotlib # Import PCA from
In short,PCA Is a linear transformation algorithm, It attempts to project the original features of our data into a smaller set of features( Subspace) upper, Keep most of the information at the same time. To do this, The algorithm tries to find the most suitable direction in the new subspace/ angle( Principal component), This principal component maximizes variance. Why maximize variance? See the first link in this article.
MNIST data set
MNIST Data set is a digital data set in computer vision, It's basically an entry-level data set in machine learning, Can arriveKaggle Upload and download：MNIST Dataset
# Load data set, Viewing dimensions train = pd.read_csv('../input/train.csv') print(train.shape)
#(42000, 785) # Separate training set features and Tags target = train['label'] train = train.drop("label"
,axis=1)# Drop the label feature
First, standardize the data for each feature to getX_std, And then calculate the covariance matrixcov_mat, Calculating eigenvalues of covariance matrixeig_vals And eigenvectors eig_vecs
. Each eigenvalue and its corresponding eigenvector are bound as a feature paireig_pairs, Sort feature pairs from large to small according to the size of feature values.
The sum of the eigenvalues is then calculatedtot And the ratio of each feature to the sum of eigenvaluesvar_exp, Cumulative proportion of eigenvalues to sum of eigenvaluescum_var_exp
# Standardize data sets from sklearn.preprocessing import StandardScaler X = train.values
X_std = StandardScaler().fit_transform(X) # Calculating eigenvalues and eigenvectors of covariance matrix mean_vec = np.mean
(X_std, axis=0) cov_mat = np.cov(X_std.T) eig_vals, eig_vecs = np.linalg.eig
(cov_mat)# Establish（ characteristic value, feature vector） Tuple pairs List # Create a list of (eigenvalue, eigenvector)
tuples eig_pairs = [ (np.abs(eig_vals[i]),eig_vecs[:,i]) for i in
range(len(eig_vals))]# Sort the eigenvalue, eigenvector pair from high to low
eig_pairs.sort(key = lambda x: x, reverse= True) # Calculation of Explained
Variance from the eigenvalues tot = sum(eig_vals) # Sum of eigenvalues var_exp = [(i/tot)*100
for iin sorted(eig_vals, reverse=True)] # Proportion of individual eigenvalues cum_var_exp = np.cumsum
(var_exp)# Cumulative proportion of characteristic value
Draw the scale of eigenvalue and the cumulative scale of eigenvalue
The specific code can refer to the original, No post here.
The small picture in the upper right corner of the picture above is the same as the large one, It's just a zoom. The yellow and green lines are the cumulative proportions of the eigenvalues, It can be seen that the cumulative proportion is100% Of, No problem. Black and red lines are the proportions of individual eigenvalues. Abscissa to784 Until, representative784 Dimension feature.
As we can see, In our784 Of properties or columns, about90% The explained variance of can be used200 Described by the above characteristics. therefore, If you want to implement aPCA, Before extraction200 Features will be a logical choice, Because they've taken up about90% Information.
Visual feature vector
As mentioned above, BecausePCA Method attempts to capture the optimal direction of the maximum variance( feature vector). therefore, It may be useful to visualize these directions and their associated eigenvalues. For fast implementation, It'll only be extracted herePCA Before eigenvalues30 individual( UseSklearn Of
.components_ Method) feature vector.
# call SKlearn Of PCA Method n_components = 30 pca =
PCA(n_components=n_components).fit(train.values) eigenvalues =
pca.components_.reshape(n_components,28, 28) # extractPCA principal component（ characteristic value）, Think about it. It should be eigenvector
eigenvalues = pca.components_# Drawing n_row = 4 n_col = 7 # Plot the first 8
eignenvalues plt.figure(figsize=(13,12)) for i in list(range(n_row * n_col)): #
for offset in [10, 30,0]: # plt.subplot(n_row, n_col, i + 1) offset =0
plt.subplot(n_row, n_col, i +1) plt.imshow(eigenvalues[i].reshape(28,28), cmap=
'jet') title_text = 'Eigenvalue ' + str(i + 1) plt.title(title_text, size=6.5)
plt.xticks(()) plt.yticks(()) plt.show()
The little piece above depictsPCA Method is ourMNIST Before dataset generation30 Best direction or principal component axis. Interestingly, When the first component“ characteristic value1” And the first28 Weight“ characteristic value28” When comparing, Obviously, More complex directions or components are generated in the search for maximum variance, Thus, the variance of the new feature subspace is maximized.
useSklearn RealizationPCA algorithm
Now useSklearn Tool kit, The algorithm of principal component analysis is as follows:
# Delete our earlier created X object del X # Before using value6000 Samples to speed up the calculation X= train[:6000
].values del train # Standardized data X_std = StandardScaler().fit_transform(X) # Call the
PCAmethod with 5 components. pca = PCA(n_components=5) pca.fit(X_std) X_5d = pca
.transform(X_std) # "Target" Also take before6000 individual Target = target[:6000]
X_std Data after standardization,shape by（6000,784）,X_5d Data after dimension reduction,shape by（6000,5）
Before painting2 Scatter diagram of principal components
When it comes to these dimensionality reduction methods, Scatter is the most commonly used, Because they allow clustering( If any) Large and convenient visualization, And that's exactly what we did before we drew2 What to do when you are a principal component. The following is the5 The first principal component and the second principal component are scatter plots drawn in horizontal and vertical coordinates, Code can be seen in the original.
As you can see from the scatter diagram, You can distinguish some obvious clusters from the collective spots of color. These clusters represent different points potentially representing different numbers. however
, BecausePCA The algorithm is unsupervised, The reason why the picture above has color, Because the data set is labeled when drawing, If the dataset is not labeled, You can't see these colors. that, If no color is added, How can we separate our data points into a new feature space?
First, We useSklearn EstablishedKMeans Clustering method, And usefit_predict Method computing cluster center, And predict the first and secondPCA Component index of projection( See if any perceptible clusters can be observed).KMeans Cluster scatter：
from sklearn.cluster import KMeans # KMeans clustering # Set a KMeans
clustering with 9 components ( 9 chosen sneakily ;)as hopefully we get back our
9 class labels) kmeans = KMeans(n_clusters=9) # Computecluster centers and
predictcluster indices X_clustered = kmeans.fit_predict(X_5d)
Visually,KMeans The cluster generated by the algorithm seems to provide a clearer division between clusters, By comparison, OursPCA Class label added to projection. It's not surprising, becausePCA It's an unsupervised approach, So it doesn't optimize the separation of categories. however, The task of classification is done by the next method we will discuss.
LDA imagePCA equally, It is also a linear transformation method, Usually used for dimensionality reduction tasks. But it is different from unsupervised learning algorithm,LDA Belongs to supervised learning method. BecauseLDA The goal is to provide useful information about class Tags,LDA The component axis will be calculated by( Linear discriminator) To maximize the distance between different categories.
In short, The difference between them is：LDA AndPCA Contrast,PCA Select the direction with maximum variance of sample point projection,LDA Choose the best direction for classification performance：
Sklearn Kit comes with built-inLDA function, So we callLDA The model is as follows:
lda = LDA(n_components=5) # Taking in as second argument the Target as labels
X_LDA_2D = lda.fit_transform(X_std, Target.values )
LDA The implemented syntax is very similar toPCA Syntax. Use one callfit andtransform Method, It willLDA Fit model to data, And then by applyingLDA Dimension reduction for transformation. however, BecauseLDA Is a supervised learning algorithm, So there is a second parameter for the method that the user must provide, This will be a class tag, In this case, the target label for the number.
LDA Visual scatter
From the scatter above, We can see, in useLDA Time, And using class TagsPCA Comparison, Data points come together more clearly. This is the inherent advantage of having class labels to monitor learning. In short, Choose the right tools for the right job.