One ,Multiple Features — Multidimensional characteristics 
This section will introduce a more effective linear regression form. This form is applicable to multiple variables or features.
So far, We discussed univariate/ Regression model of characteristics, as follows. Housing areax Forecast house pricey. The following formula is what we call“ hypothesis”, amongx Is the only characteristic quantity.
Now we add more features to the house price model, For example, number of rooms, floors, etc, Construct a model with multiple variables, The features in the model are(x1,x2,...xn).

After adding more features, We introduce a new set of comments:

n Number of representative features

x(i) Representative number i Training examples, It's the number one in the feature matrixi That's ok, It's a vector(vector).

For example, Pictured above,

In the representative characteristic matrix i Row number j Characteristic, That is the first. i Training example No j Characteristic.

For example, it represents the second example in the above figure3 Characteristic quantity.

Support multivariate hypothesis h Expressed as:

There aren+1 Parameter sumn Variables, thisn Variables represent then Characteristic quantity. In order to simplify the formula, Introducex0=1, Then the formula is converted to:

for example: Companyk$

It can be used to predict the price of a house after a period of time: The starting price of the house is80K ,0.1x1 It means the price per unit area rises after a period of time0.1K$,
The price of the house will follow the number of houses( usex2 Express) Increased by0.01*x2, Will increase with the number of floors3*x3, -2*X4 It means that the house price will depreciate with the increase of use time.

The parameter in the model is an+1 dimension( First dimensionX0 It's a constant1, Of course, It can also be said that this is an additional characteristic quantity defined by us) Vector, Any training example is alson+1 Dimension vector, Characteristic matrix
X The dimension is m*(n+1). At this point, our eigenvectors and parameter vectors can be expressed in the following forms:

The hypothesis can be rewritten as:

So the formula can be simplified as:, SuperscriptT Transposition of representative matrix.

This is the hypothetical form in the case of multiple eigenvalues, Another name is multiple linear regression.“ Multivariate” It refers to multiple characteristic quantities or variables used for prediction, It's just a little more pleasant.
Two,Gradient Descent for Multiple Variables — Multivariable gradient descent
In the previous section, we discussed multivariable( Or multiple features) Hypothesis form of linear regression, This section describes how to set parameters for this assumption, Especially, how to use gradient descent to deal with multiple linear regression.
Quickly summarize the variable marks, as follows:  Hypothesis Is the hypothesis form of multiple linear regression, According to conventionx0 =
1. Parameters for this model includeθ0~θn, We don't see them asn Independent variables, But as a
n+1 Dimension vector. So we can take the parameters of this model as a vector of the model itself. The cost function is specified by the sum of the squares of the error terms, But not.J As a beingn+1 Functions with arguments, Instead, treat it as a parameter ofθ Functions of vectors.

Here's the gradient drop, We need to update everyθj. amongα It's the rate of learning(learning rate), Derivative part is cost function to parameterθj Partial derivative:
Now let's see what it looks like to use gradient descent method.
Below, On the left isN=1 Gradient descent method of time, There are two independent update rules, Corresponding parametersθ0 andθ1. Circle part is equivalent to cost functionJ Yesθ0 Partial derivative. On the right is
N>=1 Gradient descent method of time, Wreath part is equivalent to cost functionJ Yesθj Partial derivative.
It should be explained why the above two algorithms are the same, Why gradient descent algorithm. Let's look at the following example, We have3 Eigenvaluesθ0~θ2, Update with three update rulesθ0~θ2.
observationθ0 Update rules for, Can be found, It andN=1 Hourθ0 The update rules are actually the same.( The reason for the starting price is, In our symbolic Convention, Yesx(i)0 = 1 Convention)
observationθ1 Update rules for, Can be found, It andN=1 Hourθ1 The update rules are actually the same. We just used new symbolsx(i)1 To represent the first feature.

Three,Gradient Descent in Practice I-Feature Scaling

— Gradient descent method practice1 Feature scaling

This section and the next section will explain some practical skills in gradient operation, Make the operation effect of gradient descent method better. In order to speed up the convergence of gradient descent method, In this section, we will explain a method called feature scaling(feature
scaling) Method.
Now there's a machine learning problem, Multiple features. What you need to do is make sure that the values of these characteristics have similar ranges, So when using gradient descent method in this problem, It will converge faster.
Examples are as follows, There are two other characteristics of the following questionsx1,x2. amongx1 It's the size of the house, Andx1∈(0,2000);x2 Number of rooms, Andx2∈(1,5).

Draw the cost functionJ(θ) Outline drawing. Cost functionJ(θ) It's aboutθ0,θ1,θ2 Function, becauseθ0 It's a constant, Only the position of the contour map in the coordinate system will be affected, Does not affect its shape, So not for nowθ0. Draw onlyJ(θ) aboutθ1,θ2 Graphics.

Becausex1 The value range of is far greater thanx2 Value range of, So the contour of the cost function is flat and oblique,2000:5 The scale of will make the ellipse in the contour more slender. If we run the gradient descent algorithm on the cost function. It may take a long time to converge to the global minimum, Right-hand chart.
If you exaggerate the picture, If left, It could be worse, It might even oscillate back and forth, To find a way to the global minimum.

Primary school, Here's a question, That is, why the convergence path in the figure above is not the figure below( Editor's own thinking) Medium orange, Go straight to the bottom, Why take a turn first, Now in retrospect, too“ wet behind the ears” La:

After thinking later, Maybe that's why: Think about a problem, Using gradient descent method, At a certain point, Decide which direction to go next, How to choose the direction? The point is on a 3D image, So it should be360° Tangent lines in all directions, The direction we choose is the direction with the largest tangent slope. On image, Following chart, Suppose a person is standing at the black spot in the picture, The area is too large, His field of vision can only be in the black circle in the picture, He is in the circle. At current location, The steepest direction you can see is the direction marked by the big red arrow.
therefore, When the proportion of cost function parameters is too large, An effective method is feature scaling.
say concretely, You can put featuresx1 Defined as house area/2000, Featuresx2 Defined as number of rooms/5, Such eigenvaluesx1,x2 It's all in[0,1] Inside. At this time, the contour of the cost function is a regular circle, The gradient descent method will soon find a shortcut to the minimum value. therefore, Scale by minimum, The range of eigenvalues can be eliminated.

On the whole, The purpose of feature scaling is to constrain the value of feature to[-1,+1] Within limits.
Featuresx0 Always value is1, So it always satisfiesx0∈[-1,+1]. As for other characteristics, It may need to be treated in some way( such as, Divide each by a different number), Keep it in the same range.
Be careful:-1,+1 These two numbers, It's not that important. Features such asx1∈(0,3), And characteristicsx2∈(-1,2), It doesn't matter, Because it's very close[-1,1] The scope of the. If characteristicx3∈[-100,+100], thatx3 And[-1,+1] It's very far away, Already beenO(10^2) There's an order of magnitude gap.X3 It's probably a feature that doesn't scale very well. Of course, If characteristicx4∈(-0.00001,+0.00001), It's not appropriate.
But worry about using too much, Is the range of eigenvalues too large or too small, Because as long as they're close enough, So the gradient descent method can work normally.

In feature scaling, In addition to dividing the feature by the maximum value, Mean normalization is also possible(mean normalization).
Mean normalization means, For featurexi, Can usexi-ui To replace it. So that the average value of all features is0.

aboutx0=1, This processing is not required, Because its value is equal to1, Mean cannot be equal to0. For other features, For example, features representing the area of a housex1∈(0,2000), If the characteristic value of the house area, The average is1000, Then you canx1 Proceed as follows. Another example, The number of bedrooms in the room is[0,5], An average house has two bedrooms, Then it can be normalized as followsx2.

Like this, Then we can work out new featuresx1,x2, So that their range can be[-0.5,0.5] Between.
More general, Mean normalization can be expressed as the following formula:
amongx1 It is characteristic.u1 Is the characteristic of all samples in the training setx1 Average value.S1 yesx1 Scope, Namely . Standard deviation can also be used asS1.

Now you can see, If pressedS1= Maximum value- minimum value, So the denominator above5 Will become4 了, But it doesn't matter, As long as this number can make the range of features closer, it is OK. So, Feature scaling does not need to be too precise, Just to make the gradient drop run faster.
Next section, Another approach will be introduced, Make the gradient drop run faster.
Four,Gradient Descent in Practice II-Learing Rate

— Gradient descent method practice2 Learning rate
This section will introduce some other skills about gradient descent algorithm, Around learning rateα Expand discussion.
Here are the update rules of gradient descent algorithm. first, We will show you how to debug, And some skills to make gradient descent algorithm work correctly. Second, Learning how to choose learning rateα.

Here's how to ensure that the gradient descent algorithm works correctly.
The task of gradient descent algorithm is toθ Find a value. To make the cost functionJ(θ) Take the minimum value.

In order to judge whether the gradient descent algorithm converges or not, You can draw it.J(θ) Curve changing with the number of iterations, Abscissa is the iteration step of gradient descent algorithm( Be careful, Not parameterθ), Ordinate is cost functionJ(θ) Value, Each point in the graph corresponds to aθ value. It can be seen that, Steps in iteration300~400 Between, The cost function is hardly decreasing, therefore, It can be said that the cost functionJ(θ) The number of steps in the iteration is equal to400 Time convergence.
This graph can help us to see if the cost function converges,

Some automatic convergence tests can also be performed, In other words, an algorithm is used to tell you whether the gradient descent algorithm converges.
A typical example is, If the cost functionJ(θ) Value, Reduce to a very small numberε, So we can think that the function has converged, For example, you can chooseε=e^(-3).
But actually we need to choose a suitable thresholdε It's very difficult, So in order to judge whether the gradient descent algorithm converges or not, The most common is the simple drawing method above.

in addition,“J(θ)— Iteration steps” Curves can also be used when the algorithm is not running properly, Warn in advance. for example, if“J(θ)— Iteration steps” The curve looks like this, NamelyJ(θ) It increases with the number of iteration steps, So obviously, At this time, the gradient descent algorithm does not work correctly. This usually means that a smaller learning rate should be usedα.

IfJ(θ) On the rise, So the most common reason isα Too large, It is easy to stagger the minimum value in iteration, Centered on minimum, Iterative, Increasing. Obviously, The solution is to reduce the learning rateα. Of course, it could be a code error, So we need to check it carefully.
It may also be as shown in the lower left corner, NamelyJ(θ) Constantly decreasing, Enlarge, Wavy change, The reason for this is also likely to beα Too large, The solution is naturally smallerα了.
But that's not to sayα The smaller the better. Because ifα Too small, The convergence of gradient descent algorithm may be very slow.

Sum up, Learning rateα Too small, The convergence of gradient descent algorithm may be very slow; Learning rateα too big, It may cause gradient descent algorithm not to decline in a few iterations, Increase instead, Can't even converge. Of courseα too big, It may also lead to slow convergence of the algorithm. And to find out what happened, You can draw it.J(θ) Curve changing with the number of iterations.
In the actual work, Can try多选几个α值试试,分别画出其“J(θ)—迭代步数”曲线,选择使算法收敛最快的α作为最终值.通常,可以选择α的值,间隔3倍.例如...

五 Features and Polynomial Regression 


不要忘了我们这节的目的,是为了讲解选择特征的方法.那么思考一下,只能像上面这样选择特征吗? 当然不是,还有其他的选择方法.例如我们令Area =
frontage * depth,那么就可以只用Area这一个变量作为模型的特征. 即 hθ(x) = θ0 + θ1*x .x即面积Area.
与选择特征密切相关的一个概念是多项式回归(Polynomial Regression).假如有如下住房价格数据集,为了拟合数据,可
假设我们使用三次函数作为选用的模型. 按照以前的假设形式(hθ(x) = θ0 +θ1x1 +θ2x2 +θ3x3 —— 假设公式
而如果我们想拟合下面的三次模型(θ0 +θ1x + θ2x^2 + θ3x^3)呢? 我们现在讨论的是预测房子的价格,房价 =θ0 +θ1*(房子面积) +
θ2*(房子面积)^2 + θ3*(房子面积)^3,即 房价 =θ0 +θ1*(size) + θ2*(size)^2 + θ3*(size)^3 ——

x1 = (size)

x2 = (size)^2

x3 = (size)^3
这种思想总结起来就是:将特征像上面这样设置,再应用线性回归(hθ(x) = θ0 +θ1x1 + θ2x2 + θ3x3)的方法,就可以 拟合三次函数模型.
如果像这样(x1 = ,x2 = ,x3 = )选择数据,那么特征值的归一化就更为重要了.因为size ∈ [0,10^3] ,size^2∈[1,10^6],

向下面这样,一个二次模型可能不能对数据进行很好的拟合,但除了转而使用三次模型外,我们还可以采用另外的方法.例如 将公式改为下面的形式可能就可以了.

我们还讨论了如何选择特征,例如我们不使用房屋的临街宽度和纵向深度,而是使用它们的乘积,从而得到房屋的 土地面积这个特征.
六 正规方程——Normal Equation 
到目前为止,在线性回归问题中,为了减小代价函数,我们一直使用的是梯度下降算法. (线性回归,梯度下降,正规方程,这三者之间有什么关系呢?可以参看网址:
机器学习_线性回归,梯度下降算法与正规方程 <>
故,我们只需确定使得cost最小的参数即可.求使cost最小的参数,可以使用梯度下降算法或正规方程法. )


举例说明该方法,假设有一个非常简单(简单在θ是实数)的代价函数J(θ) = αθ^2 + bθ +
那么如何最小化这个二次函数呢? 从微积分的角度来说,就是求导,令导数为0,求解此等式就可以得到使J(θ)最小的参数θ.

那应该怎么做呢?举例说明. 如下,一个训练集中有m=4个训练样本,为了实现正规方程法,首先
维向量,m是训练样本数量,n是特征变量数,n+1是因为加了一个额外的她特征变量x0. 最后,计算下面的公式

接下来构建矩阵X,这也被称为设计矩阵(design matrix).构建矩阵的方法是:  



最后,什么时候应该用梯度下降法,什么时候应该用正规方程法呢?下面是他们的优缺点. 假设有m个训练样本,n个特征变量.
* 需要选择学习速率α  这意味着需要运行多次,尝试不同的α,找到运行效果最好的那个.这是一种额外的工作和麻烦.

* 需要进行多次迭代计算  某些细节问题可能还会导致迭代的很慢.

* 不需要选择学习速率α
* 不需要进行多次迭代,一次计算即可.  所以,也不需要画出J(θ)的曲线,来检查收敛性.也不需要采取其他的额外步骤.

梯度下降法的优点: 即便有很多特征,也能运行地很好

七 Normal Equation Noninvertibility — 正规方程的不可逆性
对于计算θ的公式θ = (X^T.*X)^(-1).*X^T.*y

一种可能的原因是,在学习问题中有多余的功能,例如在预测住房价格时,如果x1是以英尺计算的住房面积,x2是以平方米计算的住房面积,我们知道1m =
3.28英尺,所以x1,x2之间始终能满足某种转换,即X1 = (3.28)^2 * X2.这样就导致(X^T.*X)不可逆.
 另一种原因是,使用了过多的特征(eg.m <= n),例如有m=10个样本,选用了n=100个特征,加上x0,就是101个特征了,