List of articles

* One , Specific implementation steps <https://blog.csdn.net/qq_38251616/article/details/82775192#_4>
* The first 1 step : Data preprocessing <https://blog.csdn.net/qq_38251616/article/details/82775192#1_5>
* Import library <https://blog.csdn.net/qq_38251616/article/details/82775192#_6>
* Import dataset <https://blog.csdn.net/qq_38251616/article/details/82775192#_11>
* Digitizing category data <https://blog.csdn.net/qq_38251616/article/details/82775192#_17>
* Avoid the trap of virtual variables <https://blog.csdn.net/qq_38251616/article/details/82775192#_26>
* Split into test set and training set
<https://blog.csdn.net/qq_38251616/article/details/82775192#_30>
* The first 2 step : Training multiple linear regression model on training set
<https://blog.csdn.net/qq_38251616/article/details/82775192#2__35>
* The first 3 step : Predict results on test set
<https://blog.csdn.net/qq_38251616/article/details/82775192#3_41>
* Two , Detailed explanation of knowledge points <https://blog.csdn.net/qq_38251616/article/details/82775192#_47>
* 1. On multiple linear regression
<https://blog.csdn.net/qq_38251616/article/details/82775192#1__48>
* 2. about OneHotEncoder() code
<https://blog.csdn.net/qq_38251616/article/details/82775192#2_OneHotEncoder_59>
* 3. about toarray()
<https://blog.csdn.net/qq_38251616/article/details/82775192#3_toarray_99>
* 4. Virtual variable trap
<https://blog.csdn.net/qq_38251616/article/details/82775192#4__109>
----- Code gate
<https://github.com/kzbkzb/100-Days-Of-ML-Code-master-kzb/blob/master/Code/3-Day-of-ML-Code-by-Kzb.ipynb>
-----
----- Data gate
<https://github.com/kzbkzb/100-Days-Of-ML-Code-master-kzb/blob/master/Data/50_Startups.csv>
-----


<> One , Specific implementation steps

<> The first 1 step : Data preprocessing

<> Import library
import pandas as pd import numpy as np
<> Import dataset
dataset = pd.read_csv('50_Startups.csv') X = dataset.iloc[ : , :-1].values Y =
dataset.iloc[ : , 4 ].values
<> Digitizing category data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder =
LabelEncoder() X[: , 3] = labelencoder.fit_transform(X[ : , 3])
# For the first time 4 The characteristics are in progress OneHot code onehotencoder = OneHotEncoder(categorical_features = [3]) X
= onehotencoder.fit_transform(X).toarray()
<> Avoid the trap of virtual variables
X = X[: , 1:]
<> Split the data set into training set and test set
from sklearn.model_selection import train_test_split X_train, X_test, Y_train,
Y_test= train_test_split(X, Y, test_size=0.2, random_state=0)
<> The first 2 step : Training multiple linear regression model on training set
from sklearn.linear_model import LinearRegression regressor = LinearRegression(
) regressor.fit(X_train, Y_train)
<> The first 3 step : Predict results on test set
y_pred = regressor.predict(X_test)
<> Two , Detailed explanation of knowledge points

<>1. On multiple linear regression

Simple linear regression : influence Y The only factor , only one .
multiple linear regression : influence Y The factor of is not unique , There are several .

Same as linear regression , Multiple linear regression is naturally a regression problem .

Univariate linear regression equation :Y=aX+b.
Multiple linear regression is :Y=aX1+bX2+cX3+…+nXn.

It's equivalent to the one variable linear equation we learned in high school , Become n Elementary linear equation . because y Or that one y. It's just that the independent variables have increased .

<>2. about OneHotEncoder() code


In the practical application of machine learning tasks , Features are sometimes not always continuous values , It could be some sort of value , If the gender can be divided into “male” and “female”. In machine learning tasks , For such characteristics , Usually we need to digitize its features , Take the example below :

There are three characteristic attributes :

* Gender :[“male”,“female”]
* region :[“Europe”,“US”,“Asia”]
* browser :[“Firefox”,“Chrome”,“Safari”,“Internet Explorer”]
adopt LabelEncoder Digitize it :

* Gender :[0,1]
* region :[0,1,2]
* browser :[0,1,2,3]
Reuse OneHotEncoder Code :
from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder() enc.fit([
[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) """ If not toarray()
The words of , The output is a sparse storage format , That is, the form of index plus value , It can also be specified by parameters sparse = False To achieve the same effect """ ans = enc.
transform([[0, 1, 3]]).toarray() print(ans) # output [[ 1. 0. 0. 1. 0. 0. 0. 0. 1.]]
explain : For input arrays , This is still taking each row as a sample , Each column is treated as a feature .

* Let's start with the first feature , That's the first column [0,1,0,1], In other words, it has two values 0 perhaps 1, that one-hot Two bits are used to represent this feature ,[1,0]
express 0, [0,1] express 1, The first two bits in the output of the example above [1,0…] That is to say, the feature is 0.
* The second feature , Second column [0,1,2,0], It has three values , that one-hot You will use three digits to represent this feature ,[1,0,0] express 0, [0,1,0] express
1,[0,0,1] express 2, In the output of the above example, the third bit to the sixth bit […0,1,0,0…] That is to say, the feature is 1.
* The second feature , Third column [3,0,1,2], It has four values , that one-hot You'll use four bits to represent this feature ,[1,0,0,0] express 0,
[0,1,0,0] express 1,[0,0,1,0] express 2,[0,0,0,1] express 3, The last four bits in the output of the above example […0,0,0,1] That is to say, the feature is
3
It can be simply understood as “male”“US”“Safari” after LabelEncoder And OneHotEncoder The code of the :[[1. 0. 0. 1. 0.
0. 0. 0. 1.]]

more OneHot Coding knowledge available :scikit-learn in OneHotEncoder analysis
<https://www.cnblogs.com/zhoukui/p/9159909.html>

<>3. about toarray()

toarray(): Convert list to array

Python Native has no concept of array , This is different from Java Object oriented languages like .Python The original list works like an array , But there are essential differences between the two

The essential difference between list and array : The memory addresses of all elements in the list can be non contiguous , The array is continuous .

More detailed explanation :Python The difference between list and array in
<https://docs.lvrui.io/2016/07/24/Python%E4%B8%AD%E5%88%97%E8%A1%A8%E4%B8%8E%E6%95%B0%E7%BB%84%E7%9A%84%E5%8C%BA%E5%88%AB/>

<>4. Virtual variable trap


Virtual variable trap refers to more than two ( There are two ) The case of highly correlated variables . In short , There is a variable that can be predicted by other variables , Let's take a category with a repetition ( variable ) A visual example of : If we abandon the male category , Then the category can also be defined by the female category ( Female value is 0 Time , It means male , by 1 Time , Female ), vice versa .

The trap of virtual variables is easy to understand here , We'll discuss it later .