One ,Multiple Features — Multidimensional features  
This section will introduce a more effective linear regression form . This form is applicable to multiple variables or features .
So far , We discussed univariate / Regression model of characteristics , as follows . Housing area x Forecast house price y. The following formula is what we call “ hypothesis ”, among x Is the only characteristic quantity .
Now we add more features to the house price model , For example, number of rooms, floors, etc , Construct a model with multiple variables , The features in the model are (x1,x2,...xn).

After adding more features , We introduce a new set of comments :

n Number of representative features

x(i) For and on behalf of i Training examples , It's the number one in the feature matrix i That's ok , It's a vector (vector).

Let's say , Above ,

In the representative characteristic matrix i Line No j Features , That is to say i Training example No j Features .

For example, it represents the second example in the above figure 3 Characteristic quantity .

Support multivariate hypothesis h Expressed as :

There are n+1 Parameters and n Variables , this n Variables represent the n Characteristic quantity . In order to simplify the formula , introduce x0=1, Then the formula is converted to :

for example : Company k$

It can be used to predict the price of a house after a period of time : The starting price of the house is 80K ,0.1x1 It means the price per unit area rises after a period of time 0.1K$,
The price of the house will vary with the number of houses ( use x2 express ) Increased by 0.01*x2, Will increase with the number of floors 3*x3, -2*X4 Indicates that the price of the house will depreciate with the increase of use time .

The parameter in the model is a n+1 dimension ( First dimension X0 It's a constant 1, of course , It can also be said that this is an additional characteristic quantity defined by us ) Vector of , Any training example is also n+1 Vector of dimension , Characteristic matrix
X The dimension of is  m*(n+1). At this point, our eigenvectors and parameter vectors can be expressed in the following forms :

The hypothesis can be rewritten as :

So the formula can be simplified as :, Superscript T Transposition of representative matrix .

This is the hypothetical form in the case of multiple eigenvalues , Another name is multiple linear regression .“ multivariate ” It refers to multiple characteristic quantities or variables used for prediction , It's just a little more pleasant .
Two ,Gradient Descent for Multiple Variables — Multivariable gradient descent
In the previous section, we discussed multivariable ( Or multiple features ) Hypothesis form of linear regression , This section describes how to set parameters for this assumption , Especially, how to use gradient descent to deal with multiple linear regression .
Quickly summarize the variable marks , as follows :  Hypothesis Is the hypothesis form of multiple linear regression , By convention x0 =
1. Parameters for this model include θ0~θn, We don't see them as n Independent variables , But as a
n+1 Vector of dimension . So we can take the parameters of this model as a vector of the model itself . The cost function is specified by the sum of the squares of the error terms , But not J As a being n+1 Functions with arguments , Instead, treat it as a parameter of θ Functions of vectors .

Here's the gradient drop , We need to update every θj. among α Is the rate of learning (learning rate), Derivative part is cost function to parameter θj Find partial derivative :
Now let's see what it looks like to use gradient descent method .
below , On the left is N=1 Gradient descent method of time , There are two independent update rules , Corresponding parameters θ0 and θ1. Circle part is equivalent to cost function J Yes θ0 Find partial derivative . On the right
N>=1 Gradient descent method of time , Wreath part is equivalent to cost function J Yes θj Find partial derivative .
It should be explained why the above two algorithms are the same , Why gradient descent algorithm . Let's look at the following example , We have 3 Eigenvalues θ0~θ2, Update with three update rules θ0~θ2.
observation θ0 Update rules for , It can be found that , It is associated with N=1 Temporal θ0 Update rules are the same .( The reason for the starting price is , In our symbolic Convention , Yes x(i)0 = 1 Agreement of )
observation θ1 Update rules for , It can be found that , It is associated with N=1 Temporal θ1 The update rules are actually the same . We just used new symbols x(i)1 To represent the first feature .

Three ,Gradient Descent in Practice I-Feature Scaling

— Gradient descent method practice 1 Feature scaling

This section and the next section will explain some practical skills in gradient operation , Make the operation effect of gradient descent method better . In order to speed up the convergence of gradient descent method , In this section, we will explain a method called feature scaling (feature
scaling) Method of .
Now there's a machine learning problem , Multiple features . What you need to do is make sure that the values of these characteristics have similar ranges , So when using gradient descent method in this problem , It will converge faster .
For example: , There are two other characteristics of the following questions x1,x2. among x1 It's the size of the house , And x1∈(0,2000);x2 It's the number of rooms , And x2∈(1,5).

Draw the cost function J(θ) Outline of . Cost function J(θ) It's about θ0,θ1,θ2 Function of , because θ0 It's a constant , Only the position of the contour map in the coordinate system will be affected , Does not affect its shape , So not for now θ0. Just draw J(θ) about θ1,θ2 Figure of .

because x1 The value range of is far greater than x2 Value range of , So the contour of the cost function is flat and oblique ,2000:5 The scale of will make the ellipse in the contour more slender . If we run the gradient descent algorithm on the cost function . It can take a long time to converge to the global minimum , As shown on the right .
If you exaggerate the picture , As shown on the left , It could be worse , It might even oscillate back and forth , To find a way to the global minimum .

Primary school , Here's a question , That is, why the convergence path in the figure above is not the figure below ( Editor's own thinking ) Medium orange , Go straight to the bottom , Why take a turn first , Now in retrospect, too “ wet behind the ears ” La :

After thinking later , Maybe that's why : Think about a problem , Using gradient descent method , At some point , When deciding which direction to go next , How to choose the direction ? The point is on a 3D image , So it should be 360° Tangent lines in all directions , The direction we choose is the direction with the largest tangent slope . On image , As shown below , Suppose a person is standing at the black spot in the picture , The area is too large , His field of vision can only be in the black circle in the picture , He's in the circle , At current location , The steepest direction you can see is the direction marked by the big red arrow .
therefore , When the proportion of cost function parameters is too large , An effective method is feature scaling .
say concretely , You can put features x1 Defined as house area /2000, features x2 Defined as number of rooms /5, Such eigenvalues x1,x2 It's all in [0,1] Inside . At this time, the contour of the cost function is a regular circle , The gradient descent method will soon find a shortcut to the minimum value . therefore , Scale by minimum , The range of eigenvalues can be eliminated .

on the whole , The purpose of feature scaling is to constrain the value of feature to [-1,+1] In scope .
features x0 Always value is 1, So it always satisfies x0∈[-1,+1]. As for other characteristics , It may need to be treated in some way ( such as , Divide each by a different number ), Keep it in the same range .
be careful :-1,+1 These two numbers , It's not that important . Such as features x1∈(0,3), And characteristics x2∈(-1,2), It doesn't matter , Because it's very close [-1,1] The scope of . If characteristic x3∈[-100,+100], that x3 And [-1,+1] It's very far away , Already O(10^2) There's an order of magnitude gap .X3 It's probably a feature that doesn't scale very well . of course , If characteristic x4∈(-0.00001,+0.00001), It's not appropriate .
But worry about using too much , Is the range of eigenvalues too large or too small , Because as long as they're close enough , So the gradient descent method can work normally .

In feature scaling , In addition to dividing the feature by the maximum value , Mean normalization is also possible (mean normalization).
Mean normalization means , For features xi, It can be used xi-ui To replace it , So that the average value of all features is 0.

about x0=1, This processing is not required , Because its value is equal to 1, Mean cannot be equal to 0. For other features , For example, features representing the area of a house x1∈(0,2000), If the characteristic value of the house area , The average is 1000, Then you can x1 Proceed as follows . Another example , The number of bedrooms in the room is [0,5], An average house has two bedrooms , Then it can be normalized as follows x2.

like this , Then we can work out new features x1,x2, So that their range can be [-0.5,0.5] between .
More general , Mean normalization can be expressed as the following formula :
among x1 Is characteristic ,u1 Is the characteristic of all samples in the training set x1 Average of .S1 yes x1 Scope of , Namely . Standard deviation can also be used as S1.

Now you can see , If you press S1= Maximum - minimum value , So the denominator above 5 It will become 4 了 , But it doesn't matter , As long as this number can make the range of features closer, it is OK . So , Feature scaling does not need to be too precise , Just to make the gradient drop run faster .
Next section , Another approach will be introduced , Make the gradient drop run faster .
Four ,Gradient Descent in Practice II-Learing Rate

— Gradient descent method practice 2 Learning rate
This section will introduce some other skills about gradient descent algorithm , Around learning rate α Expand discussion .
Here are the update rules of gradient descent algorithm . first , We will show you how to debug , And some skills to make gradient descent algorithm work correctly . second , Learning how to choose learning rate α.

Here's how to ensure that the gradient descent algorithm works correctly .
The task of gradient descent algorithm is to θ Find a value , To make the cost function J(θ) Take the minimum value .

In order to judge whether the gradient descent algorithm converges or not , You can draw it J(θ) Curve changing with the number of iterations , Abscissa is the iteration step of gradient descent algorithm ( be careful , Not a parameter θ), Ordinate is cost function J(θ) Value of , Each point in the graph corresponds to a θ value . It can be seen that , Steps in iteration 300~400 between , The cost function is hardly decreasing , therefore , It can be said that the cost function J(θ) The number of steps in the iteration is equal to 400 Time convergence .
This graph can help us to see if the cost function converges ,

Some automatic convergence tests can also be performed , In other words, an algorithm is used to tell you whether the gradient descent algorithm converges .
A typical example is , If the cost function J(θ) Value of , Reduce to a very small number ε, Then we can think that the function has converged , For example, you can choose ε=e^(-3).
But actually we need to choose a suitable threshold ε It's very difficult , So in order to judge whether the gradient descent algorithm converges or not , The most common is the simple drawing method above .

in addition ,“J(θ)— Iteration steps ” Curves can also be used when the algorithm is not running properly , Warn in advance . for example , if “J(θ)— Iteration steps ” The curve looks like this , Namely J(θ) It increases with the number of iteration steps , So obviously , At this time, the gradient descent algorithm does not work correctly . This usually means that a smaller learning rate should be used α.

If J(θ) On the rise , So the most common reason is α Too large , It is easy to stagger the minimum value in iteration , Centered on minimum , Iterative , Growing . Obviously , The solution is to reduce the learning rate α. Of course, it could be a code error , So we need to check it carefully .
It may also be as shown in the lower left corner , Namely J(θ) Decreasing , enlarge , Wavy change , The reason for this is also likely to be α Too large , The solution is naturally smaller α了.
But that's not to say α The smaller the better , Because if α Too small , The convergence of gradient descent algorithm may be very slow .

To summarize , Learning rate α Too small , The convergence of gradient descent algorithm may be very slow ; Learning rate α too big , It may cause gradient descent algorithm not to decline in a few iterations , On the contrary, it increases , Can't even converge . of course α too big , It may also lead to slow convergence of the algorithm . And to find out what happened , You can draw it J(θ) Curve changing with the number of iterations .
In the actual work , You can try 多选几个α值试试,分别画出其“J(θ)—迭代步数”曲线,选择使算法收敛最快的α作为最终值.通常,可以选择α的值,间隔3倍.例如...

五 Features and Polynomial Regression 


不要忘了我们这节的目的,是为了讲解选择特征的方法.那么思考一下,只能像上面这样选择特征吗? 当然不是,还有其他的选择方法.例如我们令Area =
frontage * depth,那么就可以只用Area这一个变量作为模型的特征. 即 hθ(x) = θ0 + θ1*x .x即面积Area.
与选择特征密切相关的一个概念是多项式回归(Polynomial Regression).假如有如下住房价格数据集,为了拟合数据,可
假设我们使用三次函数作为选用的模型. 按照以前的假设形式(hθ(x) = θ0 +θ1x1 +θ2x2 +θ3x3 —— 假设公式
而如果我们想拟合下面的三次模型(θ0 +θ1x + θ2x^2 + θ3x^3)呢? 我们现在讨论的是预测房子的价格,房价 =θ0 +θ1*(房子面积) +
θ2*(房子面积)^2 + θ3*(房子面积)^3,即 房价 =θ0 +θ1*(size) + θ2*(size)^2 + θ3*(size)^3 ——

x1 = (size)

x2 = (size)^2

x3 = (size)^3
这种思想总结起来就是:将特征像上面这样设置,再应用线性回归(hθ(x) = θ0 +θ1x1 + θ2x2 + θ3x3)的方法,就可以 拟合三次函数模型.
如果像这样(x1 = ,x2 = ,x3 = )选择数据,那么特征值的归一化就更为重要了.因为size ∈ [0,10^3] ,size^2∈[1,10^6],

向下面这样,一个二次模型可能不能对数据进行很好的拟合,但除了转而使用三次模型外,我们还可以采用另外的方法.例如 将公式改为下面的形式可能就可以了.

我们还讨论了如何选择特征,例如我们不使用房屋的临街宽度和纵向深度,而是使用它们的乘积,从而得到房屋的 土地面积这个特征.
六 正规方程——Normal Equation 
到目前为止,在线性回归问题中,为了减小代价函数,我们一直使用的是梯度下降算法. (线性回归,梯度下降,正规方程,这三者之间有什么关系呢?可以参看网址:
机器学习_线性回归,梯度下降算法与正规方程 <>
故,我们只需确定使得cost最小的参数即可.求使cost最小的参数,可以使用梯度下降算法或正规方程法. )


举例说明该方法,假设有一个非常简单(简单在θ是实数)的代价函数J(θ) = αθ^2 + bθ +
那么如何最小化这个二次函数呢? 从微积分的角度来说,就是求导,令导数为0,求解此等式就可以得到使J(θ)最小的参数θ.

那应该怎么做呢?举例说明. 如下,一个训练集中有m=4个训练样本,为了实现正规方程法,首先
维向量,m是训练样本数量,n是特征变量数,n+1是因为加了一个额外的她特征变量x0. 最后,计算下面的公式

接下来构建矩阵X,这也被称为设计矩阵(design matrix).构建矩阵的方法是:  



最后,什么时候应该用梯度下降法,什么时候应该用正规方程法呢?下面是他们的优缺点. 假设有m个训练样本,n个特征变量.
* 需要选择学习速率α  这意味着需要运行多次,尝试不同的α,找到运行效果最好的那个.这是一种额外的工作和麻烦.

* 需要进行多次迭代计算  某些细节问题可能还会导致迭代的很慢.

* 不需要选择学习速率α
* 不需要进行多次迭代,一次计算即可.  所以,也不需要画出J(θ)的曲线,来检查收敛性.也不需要采取其他的额外步骤.

梯度下降法的优点: 即便有很多特征,也能运行地很好

七 Normal Equation Noninvertibility — 正规方程的不可逆性
对于计算θ的公式θ = (X^T.*X)^(-1).*X^T.*y

一种可能的原因是,在学习问题中有多余的功能,例如在预测住房价格时,如果x1是以英尺计算的住房面积,x2是以平方米计算的住房面积,我们知道1m =
3.28英尺,所以x1,x2之间始终能满足某种转换,即X1 = (3.28)^2 * X2.这样就导致(X^T.*X)不可逆.
 另一种原因是,使用了过多的特征(eg.m <= n),例如有m=10个样本,选用了n=100个特征,加上x0,就是101个特征了,