List of articles

* One , Specific implementation steps <https://blog.csdn.net/qq_38251616/article/details/82775192#_4>
* The first 1 step ： Data preprocessing <https://blog.csdn.net/qq_38251616/article/details/82775192#1_5>
* Import library <https://blog.csdn.net/qq_38251616/article/details/82775192#_6>
* Import dataset <https://blog.csdn.net/qq_38251616/article/details/82775192#_11>
* Digitizing category data <https://blog.csdn.net/qq_38251616/article/details/82775192#_17>
* Avoid the trap of virtual variables <https://blog.csdn.net/qq_38251616/article/details/82775192#_26>
* Split into test set and training set
<https://blog.csdn.net/qq_38251616/article/details/82775192#_30>
* The first 2 step ： Training multiple linear regression model on training set
<https://blog.csdn.net/qq_38251616/article/details/82775192#2__35>
* The first 3 step ： Predict results on test set
<https://blog.csdn.net/qq_38251616/article/details/82775192#3_41>
* Two , Detailed explanation of knowledge points <https://blog.csdn.net/qq_38251616/article/details/82775192#_47>
* 1. On multiple linear regression
<https://blog.csdn.net/qq_38251616/article/details/82775192#1__48>
<https://blog.csdn.net/qq_38251616/article/details/82775192#2_OneHotEncoder_59>
<https://blog.csdn.net/qq_38251616/article/details/82775192#3_toarray_99>
* 4. Virtual variable trap
<https://blog.csdn.net/qq_38251616/article/details/82775192#4__109>
----- Code gate
<https://github.com/kzbkzb/100-Days-Of-ML-Code-master-kzb/blob/master/Code/3-Day-of-ML-Code-by-Kzb.ipynb>
-----
----- Data gate
<https://github.com/kzbkzb/100-Days-Of-ML-Code-master-kzb/blob/master/Data/50_Startups.csv>
-----

<> One , Specific implementation steps

<> The first 1 step ： Data preprocessing

<> Import library
import pandas as pd import numpy as np
<> Import dataset
dataset = pd.read_csv('50_Startups.csv') X = dataset.iloc[ : , :-1].values Y =
dataset.iloc[ : , 4 ].values
<> Digitizing category data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder labelencoder =
LabelEncoder() X[: , 3] = labelencoder.fit_transform(X[ : , 3])
# For the first time 4 The characteristics are in progress OneHot code onehotencoder = OneHotEncoder(categorical_features = ) X
= onehotencoder.fit_transform(X).toarray()
<> Avoid the trap of virtual variables
X = X[: , 1:]
<> Split the data set into training set and test set
from sklearn.model_selection import train_test_split X_train, X_test, Y_train,
Y_test= train_test_split(X, Y, test_size=0.2, random_state=0)
<> The first 2 step ： Training multiple linear regression model on training set
from sklearn.linear_model import LinearRegression regressor = LinearRegression(
) regressor.fit(X_train, Y_train)
<> The first 3 step ： Predict results on test set
y_pred = regressor.predict(X_test)
<> Two , Detailed explanation of knowledge points

<>1. On multiple linear regression

Simple linear regression ： influence Y The only factor , only one .
multiple linear regression ： influence Y The factor of is not unique , There are several .

Same as linear regression , Multiple linear regression is naturally a regression problem .

Univariate linear regression equation ：Y=aX+b.
Multiple linear regression is ：Y=aX1+bX2+cX3+…+nXn.

It's equivalent to the one variable linear equation we learned in high school , Become n Elementary linear equation . because y Or that one y. It's just that the independent variables have increased .

In the practical application of machine learning tasks , Features are sometimes not always continuous values , It could be some sort of value , If the gender can be divided into “male” and “female”. In machine learning tasks , For such characteristics , Usually we need to digitize its features , Take the example below ：

There are three characteristic attributes ：

* Gender ：[“male”,“female”]
* region ：[“Europe”,“US”,“Asia”]
* browser ：[“Firefox”,“Chrome”,“Safari”,“Internet Explorer”]

* Gender ：[0,1]
* region ：[0,1,2]
* browser ：[0,1,2,3]
Reuse OneHotEncoder Code ：
from sklearn.preprocessing import OneHotEncoder enc = OneHotEncoder() enc.fit([
[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) """ If not toarray()
The words of , The output is a sparse storage format , That is, the form of index plus value , It can also be specified by parameters sparse = False To achieve the same effect """ ans = enc.
transform([[0, 1, 3]]).toarray() print(ans) # output [[ 1. 0. 0. 1. 0. 0. 0. 0. 1.]]
explain ： For input arrays , This is still taking each row as a sample , Each column is treated as a feature .

* Let's start with the first feature , That's the first column [0,1,0,1], In other words, it has two values 0 perhaps 1, that one-hot Two bits are used to represent this feature ,[1,0]
express 0, [0,1] express 1, The first two bits in the output of the example above [1,0…] That is to say, the feature is 0.
* The second feature , Second column [0,1,2,0], It has three values , that one-hot You will use three digits to represent this feature ,[1,0,0] express 0, [0,1,0] express
1,[0,0,1] express 2, In the output of the above example, the third bit to the sixth bit […0,1,0,0…] That is to say, the feature is 1.
* The second feature , Third column [3,0,1,2], It has four values , that one-hot You'll use four bits to represent this feature ,[1,0,0,0] express 0,
[0,1,0,0] express 1,[0,0,1,0] express 2,[0,0,0,1] express 3, The last four bits in the output of the above example […0,0,0,1] That is to say, the feature is
3
It can be simply understood as “male”“US”“Safari” after LabelEncoder And OneHotEncoder The code of the ：[[1. 0. 0. 1. 0.
0. 0. 0. 1.]]

more OneHot Coding knowledge available ：scikit-learn in OneHotEncoder analysis
<https://www.cnblogs.com/zhoukui/p/9159909.html>

toarray()： Convert list to array

Python Native has no concept of array , This is different from Java Object oriented languages like .Python The original list works like an array , But there are essential differences between the two

The essential difference between list and array ： The memory addresses of all elements in the list can be non contiguous , The array is continuous .

More detailed explanation ：Python The difference between list and array in