One ． data fetch ：
1. csv file ：csv_data= pd.read_csv('/ route /test.csv')
2. txt file ：f= open('/ route /test.txt', 'r')
3. excel file ：
f=xlrd.open_workbook(r'\ route \demo.xlsx',formatting_info=True)
df = pd.read_excel("\ route \window regulator.xlsx",sheetname="Sheet2")
Two ． Data processing and cleaning ：
1. Data cleaning ：
A． Adjustment value and format , Remove noise , Untrusted value , Missing fields with more values
1） Remove spaces , Line break ：
" xyz ".strip() # returns "xyz"
" xyz ".lstrip() # returns "xyz "
" xyz ".rstrip() # returns " xyz"
" x y z ".replace(' ', '') # returns "xyz"
2） use split Disconnect and close ''.join(your_str.split())
3） Complete the replacement with regular expressions : import re strinfo = re.compile('word') b =
strinfo.sub('python',a)print b So is the output hello python
4） delete pandas DataFrame One of / Several columns ：
Method 1 ： direct del DF['column-name']
Method 2 ： use drop method , There are three equivalent expressions ：
i. DF= DF.drop('column_name', 1);
iii. DF.drop([DF.columns[[0,1, 3]]],axis=1,inplace=True)
5） delete DataFrame A certain line
Default here ：axis=0, To delete index, Therefore, it is deleted columns To be specified axis=1;inplace=False, By default, the delete operation does not change the original data , Instead, it returns a new dataframe;inplace=True, The deletion operation will be performed directly on the original data , You can't come back after deleting it .
6) Data type conversion ：
B． Normalization treatment , Discretization , Data transformation （log,0-1,exp,box-cox）：
1）0-1 Standardization ： This is the easiest and easiest way to think of , By traversing feature
vector Every piece of data in , take Max and Min Write it down , And passed Max-Min As a base （ Namely Min=0,Max=1） The data were normalized .
x = (x - Min) / (Max - Min);
2）Z-score Standardization ： This method gives the mean value of the original data （mean） And standard deviation （standard
deviation） Standardization of data . The processed data accord with the standard normal distribution , That is, the mean value is 0, The standard deviation is 1, The key here is the composite standard normal distribution , I think it changes the distribution of features to some extent .
x = (x - mu) / sigma;
3）Sigmoid function ：Sigmoid Function is a function with S Shape function of curve , Is a good threshold function , And in the (0, 0.5) Center symmetry , stay (0,
0.5) There is a large slope nearby , And when the data tends to be positive infinity and negative infinity , The mapped values tend to infinity 1 and 0, I like it very much “ normalization method ”, It's because I think Sigmoid The function also has a good performance in threshold segmentation , According to the change of formula , The segmentation threshold can be changed , This is the normalization method , We only think about it (0,
0.5) The case of a point as a segmentation threshold ：
return 1.0 / (1 +np.exp(-float(X)));
4） Transform data range ： In addition to the methods described above , Another common method is to zoom the attribute to a specified maximum and minimum value （ Usually 1-0） between , This can be done through preprocessing.MinMaxScaler Class implementation .
5） Regularization ： The process of regularization is to scale each sample to the unit norm （ The norm of each sample is 1）, If you want to use the following quadratic form （ Dot product ） Or other kernel methods to calculate the similarity between two samples, this method can be very useful . The main idea of regularization is to calculate the p- norm , Then divide each element in the sample by the norm , The result of this treatment is that each processed sample p- norm （l1-norm,l2-norm） be equal to 1.
p- The calculation formula of norm ：||X||p=(|x1|^p+|x2|^p+...+|xn|^p)^1/p
This method is mainly used in text classification and clustering . for example , For two TF-IDF Vector l2-norm Carry out point product , We can get the cosine similarity of these two vectors . have access to preprocessing.normalize() Function to convert the specified data , use processing.Normalizer() Class implements the fitting and transformation of training set and test set .
6）Box-cox Transformation （stats.boxcox）：Box-Cox The transformation is Box and Cox stay 1964 A generalized power transformation method proposed in , It is a data transformation commonly used in statistical modeling , It is used when the continuous response variable does not satisfy the normal distribution .Box-Cox After transformation , It can reduce the correlation between unobservable error and prediction variables to a certain extent .Box-Cox The main feature of the transformation is the introduction of a parameter , The parameter is estimated by the data itself, and then the data transformation form is determined ,Box-Cox Transformation can obviously improve the normality of data , Symmetry and equality of variance , It is effective for many practical data .
2. Data sampling ：
A. Sampling without replacement
idxTest = random.sample(range(nPoints),nSample)
# Got it idxTest It's a list form
B. Sampling with return
use random.choice （ Choose one at a time ）
for i in range(nBagSamples):
#choice It's not directly elected list
For more information, see pandas.DataFrame.resample
Three ． Feature Engineering ：
1. Feature processing ： Numerical type , Category type , Time type , Text type , Statistical type , Combinatorial features , Feature derivation , as ：
1） Quantitative feature binarization
2） Qualitative feature dummy variable quantization
3） Basic transformation of single variable , For example, by squaring a single variable , Root number ,log Conversion, etc .
4） Variables are derived by adding time dimensions , such as 3 Monthly transaction data ,6 Month trading data, etc 5） Multivariable operation , For example, two variables are added , A new variable is obtained by multiplying or calculating a ratio between variables .
6） Multiple variables PCA,LDA Dimensionality reduction .
Python sklearn Class summary ：
2. feature selection ： When the data preprocessing is completed , We need to select meaningful feature input machine learning algorithm and model for training . First, based on business understanding , Select features that have a significant effect on dependent variables , To evaluate the difficulty of feature acquisition , Coverage , Accuracy . Then feature selection is carried out based on data level , Generally speaking , Consider two aspects of feature selection ：
Is the feature divergent ： If a feature does not diverge , For example, the variance is close to 0, That is to say, the samples have no difference in this feature , There is no distinction between the sample and the sample .
Correlation between features and targets ： This is obvious , Features with high correlation with target , It should be preferred . Except for removing the low variance method , The other methods introduced in this paper are all considered from the perspective of correlation .
Feature selection has two main purposes ： One is to reduce the number of features , Dimension reduction , Make the model generalization ability stronger , Reduce over fitting ; The second is to enhance the understanding between features and eigenvalues .
According to the form of feature selection, feature selection methods can be divided into three categories 3 species ：
A． Filter type ： Each feature was scored according to divergence or relevance , Set threshold or the number of thresholds to be selected , Select features .sklearn.feature_selection.SelectKBest
1） Variance selection method
Variance selection was used , The variance of each feature must be calculated first , Then according to the threshold , Select features whose variance is greater than the threshold . use feature_selection Library's VarianceThreshold Class to select features .
2） Correlation coefficient method
The correlation coefficient method was used , Firstly, the correlation coefficient of each feature to the target value and the correlation coefficient of the correlation coefficient should be calculated P value . use feature_selection Library's SelectKBest Class combined with correlation coefficient to select features .
3） Chi square test
The classical chi square test is to test the correlation between qualitative independent variables and qualitative dependent variables . Suppose the independent variable has N Value of species , The dependent variables are M Value of species , Consider the independent variable equal to i And the dependent variable is equal to j The difference between the observed value and the expected value of the sample frequency , Construct statistics ：
The meaning of this statistic is simply the correlation between independent variables and dependent variables . use feature_selection Library's SelectKBest Class combined with chi square test to select features .
4） Mutual information method
Classical mutual information is also used to evaluate the correlation between qualitative independent variables and qualitative dependent variables , The formula of mutual information is as follows ：
In order to process quantitative data , The maximum information coefficient method is proposed , use feature_selection Library's SelectKBest Class combined with maximum information coefficient method to select features .
B． Wrapping type ： According to the objective function （ It's usually a prediction score ）, Select several features at a time , Or exclude some features . Recursive feature elimination ： The recursive elimination feature method uses a base model for multiple rounds of training , After each round of training , Eliminating the characteristics of some weight coefficients , The next round of training is based on the new feature set . use feature_selection Library's RFE Class to select features .
C． Embedded type ： First, some machine learning algorithms and models are used for training , The weight coefficient of each feature is obtained , Select features from large to small according to coefficient . Similar to filtration , But it is through training to determine the advantages and disadvantages of the characteristics .
1） Feature selection method based on penalty term
Using the base model with penalty , In addition to screening out features , At the same time, the dimension is reduced . use feature_selection Library's SelectFromModel Quasi junction zone L1 The logistic regression model of penalty term .
2） Feature selection method based on tree model
In the tree model GBDT It can also be used as a base model for feature selection , use feature_selection Library's SelectFromModel Class combination GBDT Model .
3. Feature validity analysis ： Feature weight analysis , Effectiveness of supervisory characteristics , Prevent feature quality degradation , Affect model performance .
Four . Model selection