https://mp.weixin.qq.com/s/mtU-58ZPW9ruOj7H16zpVQ

<https://mp.weixin.qq.com/s/mtU-58ZPW9ruOj7H16zpVQ>

One ． data fetch ：

1. csv file ：csv_data= pd.read_csv('/ route /test.csv')

2. txt file ：f= open('/ route /test.txt', 'r')

3. excel file ：

import xlrd

f=xlrd.open_workbook(r'\ route \demo.xlsx',formatting_info=True)

table =data.sheet_by_name("Sheet2")

or

df = pd.read_excel("\ route \window regulator.xlsx",sheetname="Sheet2")

Two ． Data processing and cleaning ：

1. Data cleaning ：

A． Adjustment value and format , Remove noise , Untrusted value , Missing fields with more values

1） Remove spaces , Line break ：

" xyz ".strip() # returns "xyz"

" xyz ".lstrip() # returns "xyz "

" xyz ".rstrip() # returns " xyz"

" x y z ".replace(' ', '') # returns "xyz"

2） use split Disconnect and close ''.join(your_str.split())

3） Complete the replacement with regular expressions : import re strinfo = re.compile('word') b =

strinfo.sub('python',a)print b So is the output hello python

4） delete pandas DataFrame One of / Several columns ：

Method 1 ： direct del DF['column-name']

Method 2 ： use drop method , There are three equivalent expressions ：

i. DF= DF.drop('column_name', 1);

ii. DF.drop('column_name',axis=1,inplace=True)

iii. DF.drop([DF.columns[[0,1, 3]]],axis=1,inplace=True)

5） delete DataFrame A certain line

DataFrame.drop(labels=None,axis=0,index=None,columns=None, inplace=False)

Default here ：axis=0, To delete index, Therefore, it is deleted columns To be specified axis=1;inplace=False, By default, the delete operation does not change the original data , Instead, it returns a new dataframe;inplace=True, The deletion operation will be performed directly on the original data , You can't come back after deleting it .

6) Data type conversion ：

B． Normalization treatment , Discretization , Data transformation （log,0-1,exp,box-cox）：

1）0-1 Standardization ： This is the easiest and easiest way to think of , By traversing feature

vector Every piece of data in , take Max and Min Write it down , And passed Max-Min As a base （ Namely Min=0,Max=1） The data were normalized .

def MaxMinNormalization(x,Max,Min):

x = (x - Min) / (Max - Min);

return x;

2）Z-score Standardization ： This method gives the mean value of the original data （mean） And standard deviation （standard

deviation） Standardization of data . The processed data accord with the standard normal distribution , That is, the mean value is 0, The standard deviation is 1, The key here is the composite standard normal distribution , I think it changes the distribution of features to some extent .

def Z_ScoreNormalization(x,mu,sigma):

x = (x - mu) / sigma;

return x;

3）Sigmoid function ：Sigmoid Function is a function with S Shape function of curve , Is a good threshold function , And in the (0, 0.5) Center symmetry , stay (0,

0.5) There is a large slope nearby , And when the data tends to be positive infinity and negative infinity , The mapped values tend to infinity 1 and 0, I like it very much “ normalization method ”, It's because I think Sigmoid The function also has a good performance in threshold segmentation , According to the change of formula , The segmentation threshold can be changed , This is the normalization method , We only think about it (0,

0.5) The case of a point as a segmentation threshold ：

def sigmoid(X,useStatus):

if useStatus:

return 1.0 / (1 +np.exp(-float(X)));

else:

return float(X);

4） Transform data range ： In addition to the methods described above , Another common method is to zoom the attribute to a specified maximum and minimum value （ Usually 1-0） between , This can be done through preprocessing.MinMaxScaler Class implementation .

X_std=(X-X.min(axis=0))/(X.max(axis=0)-X.min(axis=0))

X_scaled=X_std/(max-min)+min

5） Regularization ： The process of regularization is to scale each sample to the unit norm （ The norm of each sample is 1）, If you want to use the following quadratic form （ Dot product ） Or other kernel methods to calculate the similarity between two samples, this method can be very useful . The main idea of regularization is to calculate the p- norm , Then divide each element in the sample by the norm , The result of this treatment is that each processed sample p- norm （l1-norm,l2-norm） be equal to 1.

p- The calculation formula of norm ：||X||p=(|x1|^p+|x2|^p+...+|xn|^p)^1/p

This method is mainly used in text classification and clustering . for example , For two TF-IDF Vector l2-norm Carry out point product , We can get the cosine similarity of these two vectors . have access to preprocessing.normalize() Function to convert the specified data , use processing.Normalizer() Class implements the fitting and transformation of training set and test set .

6）Box-cox Transformation （stats.boxcox）：Box-Cox The transformation is Box and Cox stay 1964 A generalized power transformation method proposed in , It is a data transformation commonly used in statistical modeling , It is used when the continuous response variable does not satisfy the normal distribution .Box-Cox After transformation , It can reduce the correlation between unobservable error and prediction variables to a certain extent .Box-Cox The main feature of the transformation is the introduction of a parameter , The parameter is estimated by the data itself, and then the data transformation form is determined ,Box-Cox Transformation can obviously improve the normality of data , Symmetry and equality of variance , It is effective for many practical data .

2. Data sampling ：

A. Sampling without replacement

use random.sample

import random

idxTest = random.sample(range(nPoints),nSample)

# Got it idxTest It's a list form

B. Sampling with return

use random.choice （ Choose one at a time ）

BagSamples=50

for i in range(nBagSamples):

idxBag.append(np.random.choice(range(len(xTrain))))

#choice It's not directly elected list

For more information, see pandas.DataFrame.resample

Three ． Feature Engineering ：

1. Feature processing ： Numerical type , Category type , Time type , Text type , Statistical type , Combinatorial features , Feature derivation , as ：

1） Quantitative feature binarization

2） Qualitative feature dummy variable quantization

3） Basic transformation of single variable , For example, by squaring a single variable , Root number ,log Conversion, etc .

4） Variables are derived by adding time dimensions , such as 3 Monthly transaction data ,6 Month trading data, etc 5） Multivariable operation , For example, two variables are added , A new variable is obtained by multiplying or calculating a ratio between variables .

6） Multiple variables PCA,LDA Dimensionality reduction .

Python sklearn Class summary ：

2. feature selection ： When the data preprocessing is completed , We need to select meaningful feature input machine learning algorithm and model for training . First, based on business understanding , Select features that have a significant effect on dependent variables , To evaluate the difficulty of feature acquisition , Coverage , Accuracy . Then feature selection is carried out based on data level , Generally speaking , Consider two aspects of feature selection ：

*

Is the feature divergent ： If a feature does not diverge , For example, the variance is close to 0, That is to say, the samples have no difference in this feature , There is no distinction between the sample and the sample .

*

Correlation between features and targets ： This is obvious , Features with high correlation with target , It should be preferred . Except for removing the low variance method , The other methods introduced in this paper are all considered from the perspective of correlation .

Feature selection has two main purposes ： One is to reduce the number of features , Dimension reduction , Make the model generalization ability stronger , Reduce over fitting ; The second is to enhance the understanding between features and eigenvalues .

According to the form of feature selection, feature selection methods can be divided into three categories 3 species ：

A． Filter type ： Each feature was scored according to divergence or relevance , Set threshold or the number of thresholds to be selected , Select features .sklearn.feature_selection.SelectKBest

1） Variance selection method

Variance selection was used , The variance of each feature must be calculated first , Then according to the threshold , Select features whose variance is greater than the threshold . use feature_selection Library's VarianceThreshold Class to select features .

2） Correlation coefficient method

The correlation coefficient method was used , Firstly, the correlation coefficient of each feature to the target value and the correlation coefficient of the correlation coefficient should be calculated P value . use feature_selection Library's SelectKBest Class combined with correlation coefficient to select features .

3） Chi square test

The classical chi square test is to test the correlation between qualitative independent variables and qualitative dependent variables . Suppose the independent variable has N Value of species , The dependent variables are M Value of species , Consider the independent variable equal to i And the dependent variable is equal to j The difference between the observed value and the expected value of the sample frequency , Construct statistics ：

The meaning of this statistic is simply the correlation between independent variables and dependent variables . use feature_selection Library's SelectKBest Class combined with chi square test to select features .

4） Mutual information method

Classical mutual information is also used to evaluate the correlation between qualitative independent variables and qualitative dependent variables , The formula of mutual information is as follows ：

In order to process quantitative data , The maximum information coefficient method is proposed , use feature_selection Library's SelectKBest Class combined with maximum information coefficient method to select features .

B． Wrapping type ： According to the objective function （ It's usually a prediction score ）, Select several features at a time , Or exclude some features . Recursive feature elimination ： The recursive elimination feature method uses a base model for multiple rounds of training , After each round of training , Eliminating the characteristics of some weight coefficients , The next round of training is based on the new feature set . use feature_selection Library's RFE Class to select features .

C． Embedded type ： First, some machine learning algorithms and models are used for training , The weight coefficient of each feature is obtained , Select features from large to small according to coefficient . Similar to filtration , But it is through training to determine the advantages and disadvantages of the characteristics .

1） Feature selection method based on penalty term

Using the base model with penalty , In addition to screening out features , At the same time, the dimension is reduced . use feature_selection Library's SelectFromModel Quasi junction zone L1 The logistic regression model of penalty term .

2） Feature selection method based on tree model

In the tree model GBDT It can also be used as a base model for feature selection , use feature_selection Library's SelectFromModel Class combination GBDT Model .

3. Feature validity analysis ： Feature weight analysis , Effectiveness of supervisory characteristics , Prevent feature quality degradation , Affect model performance .

Four . Model selection