One . data fetch :


1.    csv file :csv_data= pd.read_csv('/ route /test.csv')

2.    txt file :f= open('/ route /test.txt', 'r')

3.    excel file :

import xlrd

f=xlrd.open_workbook(r'\ route \demo.xlsx',formatting_info=True)

table =data.sheet_by_name("Sheet2")


df = pd.read_excel("\ route \window regulator.xlsx",sheetname="Sheet2")


Two . Data processing and cleaning :


1.     Data cleaning :

A. Adjustment value and format , Remove noise , Untrusted value , Missing fields with more values

1) Remove spaces , Line break :

" xyz ".strip() # returns "xyz"

" xyz ".lstrip() # returns "xyz "

" xyz ".rstrip() # returns " xyz"

" x y z ".replace(' ', '') # returns "xyz"

2) use split Disconnect and close ''.join(your_str.split())

3) Complete the replacement with regular expressions : import re strinfo = re.compile('word') b =
strinfo.sub('python',a)print b So is the output hello python

4) delete pandas DataFrame One of / Several columns :

Method 1 : direct del DF['column-name']

Method 2 : use drop method , There are three equivalent expressions :

i. DF= DF.drop('column_name', 1);

ii. DF.drop('column_name',axis=1,inplace=True)

iii. DF.drop([DF.columns[[0,1, 3]]],axis=1,inplace=True)  

5) delete DataFrame A certain line

DataFrame.drop(labels=None,axis=0,index=None,columns=None, inplace=False) 

Default here :axis=0, To delete index, Therefore, it is deleted columns To be specified axis=1;inplace=False, By default, the delete operation does not change the original data , Instead, it returns a new dataframe;inplace=True, The deletion operation will be performed directly on the original data , You can't come back after deleting it .

6) Data type conversion :


B. Normalization treatment , Discretization , Data transformation (log,0-1,exp,box-cox):


1)0-1 Standardization : This is the easiest and easiest way to think of , By traversing feature
vector Every piece of data in , take Max and Min Write it down , And passed Max-Min As a base ( Namely Min=0,Max=1) The data were normalized .

def MaxMinNormalization(x,Max,Min):

x = (x - Min) / (Max - Min);

     return x;


2)Z-score Standardization : This method gives the mean value of the original data (mean) And standard deviation (standard
deviation) Standardization of data . The processed data accord with the standard normal distribution , That is, the mean value is 0, The standard deviation is 1, The key here is the composite standard normal distribution , I think it changes the distribution of features to some extent .

def Z_ScoreNormalization(x,mu,sigma):

     x = (x - mu) / sigma;

     return x;


3)Sigmoid function :Sigmoid Function is a function with S Shape function of curve , Is a good threshold function , And in the (0, 0.5) Center symmetry , stay (0,
0.5) There is a large slope nearby , And when the data tends to be positive infinity and negative infinity , The mapped values tend to infinity 1 and 0, I like it very much “ normalization method ”, It's because I think Sigmoid The function also has a good performance in threshold segmentation , According to the change of formula , The segmentation threshold can be changed , This is the normalization method , We only think about it (0,
0.5) The case of a point as a segmentation threshold :

def sigmoid(X,useStatus):

     if useStatus:

            return 1.0 / (1 +np.exp(-float(X)));


            return float(X);


4) Transform data range : In addition to the methods described above , Another common method is to zoom the attribute to a specified maximum and minimum value ( Usually 1-0) between , This can be done through preprocessing.MinMaxScaler Class implementation .




5) Regularization : The process of regularization is to scale each sample to the unit norm ( The norm of each sample is 1), If you want to use the following quadratic form ( Dot product ) Or other kernel methods to calculate the similarity between two samples, this method can be very useful . The main idea of regularization is to calculate the p- norm , Then divide each element in the sample by the norm , The result of this treatment is that each processed sample p- norm (l1-norm,l2-norm) be equal to 1.


             p- The calculation formula of norm :||X||p=(|x1|^p+|x2|^p+...+|xn|^p)^1/p


This method is mainly used in text classification and clustering . for example , For two TF-IDF Vector l2-norm Carry out point product , We can get the cosine similarity of these two vectors . have access to preprocessing.normalize() Function to convert the specified data , use processing.Normalizer() Class implements the fitting and transformation of training set and test set .


6)Box-cox Transformation (stats.boxcox):Box-Cox The transformation is Box and Cox stay 1964 A generalized power transformation method proposed in , It is a data transformation commonly used in statistical modeling , It is used when the continuous response variable does not satisfy the normal distribution .Box-Cox After transformation , It can reduce the correlation between unobservable error and prediction variables to a certain extent .Box-Cox The main feature of the transformation is the introduction of a parameter , The parameter is estimated by the data itself, and then the data transformation form is determined ,Box-Cox Transformation can obviously improve the normality of data , Symmetry and equality of variance , It is effective for many practical data .


2.     Data sampling :


A. Sampling without replacement  
use random.sample

import random

idxTest = random.sample(range(nPoints),nSample)      

# Got it idxTest It's a list form


B. Sampling with return  
use random.choice ( Choose one at a time )


for i in range(nBagSamples):


#choice It's not directly elected list


For more information, see pandas.DataFrame.resample


Three . Feature Engineering :

1.     Feature processing : Numerical type , Category type , Time type , Text type , Statistical type , Combinatorial features , Feature derivation , as :

1) Quantitative feature binarization

2) Qualitative feature dummy variable quantization

3) Basic transformation of single variable , For example, by squaring a single variable , Root number ,log Conversion, etc .

4) Variables are derived by adding time dimensions , such as 3 Monthly transaction data ,6 Month trading data, etc 5) Multivariable operation , For example, two variables are added , A new variable is obtained by multiplying or calculating a ratio between variables .

6) Multiple variables PCA,LDA Dimensionality reduction .


Python sklearn Class summary :



2.     feature selection : When the data preprocessing is completed , We need to select meaningful feature input machine learning algorithm and model for training . First, based on business understanding , Select features that have a significant effect on dependent variables , To evaluate the difficulty of feature acquisition , Coverage , Accuracy . Then feature selection is carried out based on data level , Generally speaking , Consider two aspects of feature selection :

Is the feature divergent : If a feature does not diverge , For example, the variance is close to 0, That is to say, the samples have no difference in this feature , There is no distinction between the sample and the sample .

Correlation between features and targets : This is obvious , Features with high correlation with target , It should be preferred . Except for removing the low variance method , The other methods introduced in this paper are all considered from the perspective of correlation .

Feature selection has two main purposes : One is to reduce the number of features , Dimension reduction , Make the model generalization ability stronger , Reduce over fitting ; The second is to enhance the understanding between features and eigenvalues .

According to the form of feature selection, feature selection methods can be divided into three categories 3 species :


A. Filter type : Each feature was scored according to divergence or relevance , Set threshold or the number of thresholds to be selected , Select features .sklearn.feature_selection.SelectKBest


1) Variance selection method

Variance selection was used , The variance of each feature must be calculated first , Then according to the threshold , Select features whose variance is greater than the threshold . use feature_selection Library's VarianceThreshold Class to select features .

2) Correlation coefficient method

The correlation coefficient method was used , Firstly, the correlation coefficient of each feature to the target value and the correlation coefficient of the correlation coefficient should be calculated P value . use feature_selection Library's SelectKBest Class combined with correlation coefficient to select features .

3) Chi square test

The classical chi square test is to test the correlation between qualitative independent variables and qualitative dependent variables . Suppose the independent variable has N Value of species , The dependent variables are M Value of species , Consider the independent variable equal to i And the dependent variable is equal to j The difference between the observed value and the expected value of the sample frequency , Construct statistics :

The meaning of this statistic is simply the correlation between independent variables and dependent variables . use feature_selection Library's SelectKBest Class combined with chi square test to select features .

4) Mutual information method

Classical mutual information is also used to evaluate the correlation between qualitative independent variables and qualitative dependent variables , The formula of mutual information is as follows :

In order to process quantitative data , The maximum information coefficient method is proposed , use feature_selection Library's SelectKBest Class combined with maximum information coefficient method to select features . 


B. Wrapping type : According to the objective function ( It's usually a prediction score ), Select several features at a time , Or exclude some features . Recursive feature elimination : The recursive elimination feature method uses a base model for multiple rounds of training , After each round of training , Eliminating the characteristics of some weight coefficients , The next round of training is based on the new feature set . use feature_selection Library's RFE Class to select features .


C. Embedded type : First, some machine learning algorithms and models are used for training , The weight coefficient of each feature is obtained , Select features from large to small according to coefficient . Similar to filtration , But it is through training to determine the advantages and disadvantages of the characteristics .


1) Feature selection method based on penalty term

Using the base model with penalty , In addition to screening out features , At the same time, the dimension is reduced . use feature_selection Library's SelectFromModel Quasi junction zone L1 The logistic regression model of penalty term .

2) Feature selection method based on tree model

In the tree model GBDT It can also be used as a base model for feature selection , use feature_selection Library's SelectFromModel Class combination GBDT Model .


3.     Feature validity analysis : Feature weight analysis , Effectiveness of supervisory characteristics , Prevent feature quality degradation , Affect model performance .


Four . Model selection