One ,Multiple Features — Multidimensional characteristics
This section will introduce a more effective linear regression form. This form is applicable to multiple variables or features.
So far, We discussed univariate/ Regression model of characteristics, as follows. Housing areax Forecast house pricey. The following formula is what we call“ hypothesis”, amongx Is the only characteristic quantity.
Now we add more features to the house price model, For example, number of rooms, floors, etc, Construct a model with multiple variables, The features in the model are（x1,x2,...xn）.

n Number of representative features

x(i) Representative number i Training examples, It's the number one in the feature matrixi That's ok, It's a vector（vector）.

For example, Pictured above,

In the representative characteristic matrix i Row number j Characteristic, That is the first. i Training example No j Characteristic.

For example, it represents the second example in the above figure3 Characteristic quantity.

Support multivariate hypothesis h Expressed as：

There aren+1 Parameter sumn Variables, thisn Variables represent then Characteristic quantity. In order to simplify the formula, Introducex0=1, Then the formula is converted to：

for example： Companyk\$

It can be used to predict the price of a house after a period of time： The starting price of the house is80K ,0.1x1 It means the price per unit area rises after a period of time0.1K\$,
The price of the house will follow the number of houses（ usex2 Express） Increased by0.01*x2, Will increase with the number of floors3*x3, -2*X4 It means that the house price will depreciate with the increase of use time.

The parameter in the model is an+1 dimension（ First dimensionX0 It's a constant1, Of course, It can also be said that this is an additional characteristic quantity defined by us） Vector, Any training example is alson+1 Dimension vector, Characteristic matrix
X The dimension is m*(n+1). At this point, our eigenvectors and parameter vectors can be expressed in the following forms：

The hypothesis can be rewritten as：

So the formula can be simplified as：, SuperscriptT Transposition of representative matrix.

This is the hypothetical form in the case of multiple eigenvalues, Another name is multiple linear regression.“ Multivariate” It refers to multiple characteristic quantities or variables used for prediction, It's just a little more pleasant.
In the previous section, we discussed multivariable（ Or multiple features） Hypothesis form of linear regression, This section describes how to set parameters for this assumption, Especially, how to use gradient descent to deal with multiple linear regression.
Quickly summarize the variable marks, as follows：  Hypothesis Is the hypothesis form of multiple linear regression, According to conventionx0 =
1. Parameters for this model includeθ0~θn, We don't see them asn Independent variables, But as a
n+1 Dimension vector. So we can take the parameters of this model as a vector of the model itself. The cost function is specified by the sum of the squares of the error terms, But not.J As a beingn+1 Functions with arguments, Instead, treat it as a parameter ofθ Functions of vectors.

Here's the gradient drop, We need to update everyθj. amongα It's the rate of learning（learning rate）, Derivative part is cost function to parameterθj Partial derivative：
Now let's see what it looks like to use gradient descent method.
Below, On the left isN=1 Gradient descent method of time, There are two independent update rules, Corresponding parametersθ0 andθ1. Circle part is equivalent to cost functionJ Yesθ0 Partial derivative. On the right is
N>=1 Gradient descent method of time, Wreath part is equivalent to cost functionJ Yesθj Partial derivative.
It should be explained why the above two algorithms are the same, Why gradient descent algorithm. Let's look at the following example, We have3 Eigenvaluesθ0~θ2, Update with three update rulesθ0~θ2.
observationθ0 Update rules for, Can be found, It andN=1 Hourθ0 The update rules are actually the same.（ The reason for the starting price is, In our symbolic Convention, Yesx(i)0 = 1 Convention）
observationθ1 Update rules for, Can be found, It andN=1 Hourθ1 The update rules are actually the same. We just used new symbolsx(i)1 To represent the first feature.

Three,Gradient Descent in Practice I-Feature Scaling

— Gradient descent method practice1 Feature scaling

This section and the next section will explain some practical skills in gradient operation, Make the operation effect of gradient descent method better. In order to speed up the convergence of gradient descent method, In this section, we will explain a method called feature scaling（feature
scaling） Method.
Now there's a machine learning problem, Multiple features. What you need to do is make sure that the values of these characteristics have similar ranges, So when using gradient descent method in this problem, It will converge faster.
Examples are as follows, There are two other characteristics of the following questionsx1,x2. amongx1 It's the size of the house, Andx1∈(0,2000);x2 Number of rooms, Andx2∈(1,5).

Draw the cost functionJ(θ) Outline drawing. Cost functionJ(θ) It's aboutθ0,θ1,θ2 Function, becauseθ0 It's a constant, Only the position of the contour map in the coordinate system will be affected, Does not affect its shape, So not for nowθ0. Draw onlyJ(θ) aboutθ1,θ2 Graphics.

Becausex1 The value range of is far greater thanx2 Value range of, So the contour of the cost function is flat and oblique,2000:5 The scale of will make the ellipse in the contour more slender. If we run the gradient descent algorithm on the cost function. It may take a long time to converge to the global minimum, Right-hand chart.
If you exaggerate the picture, If left, It could be worse, It might even oscillate back and forth, To find a way to the global minimum.

Primary school, Here's a question, That is, why the convergence path in the figure above is not the figure below（ Editor's own thinking） Medium orange, Go straight to the bottom, Why take a turn first, Now in retrospect, too“ wet behind the ears” La：

After thinking later, Maybe that's why： Think about a problem, Using gradient descent method, At a certain point, Decide which direction to go next, How to choose the direction? The point is on a 3D image, So it should be360° Tangent lines in all directions, The direction we choose is the direction with the largest tangent slope. On image, Following chart, Suppose a person is standing at the black spot in the picture, The area is too large, His field of vision can only be in the black circle in the picture, He is in the circle. At current location, The steepest direction you can see is the direction marked by the big red arrow.
therefore, When the proportion of cost function parameters is too large, An effective method is feature scaling.
say concretely, You can put featuresx1 Defined as house area/2000, Featuresx2 Defined as number of rooms/5, Such eigenvaluesx1,x2 It's all in[0,1] Inside. At this time, the contour of the cost function is a regular circle, The gradient descent method will soon find a shortcut to the minimum value. therefore, Scale by minimum, The range of eigenvalues can be eliminated.

On the whole, The purpose of feature scaling is to constrain the value of feature to[-1,+1] Within limits.
Featuresx0 Always value is1, So it always satisfiesx0∈[-1,+1]. As for other characteristics, It may need to be treated in some way（ such as, Divide each by a different number）, Keep it in the same range.
Be careful：-1,+1 These two numbers, It's not that important. Features such asx1∈(0,3), And characteristicsx2∈(-1,2), It doesn't matter, Because it's very close[-1,1] The scope of the. If characteristicx3∈[-100,+100], thatx3 And[-1,+1] It's very far away, Already beenO(10^2) There's an order of magnitude gap.X3 It's probably a feature that doesn't scale very well. Of course, If characteristicx4∈（-0.00001,+0.00001）, It's not appropriate.
But worry about using too much, Is the range of eigenvalues too large or too small, Because as long as they're close enough, So the gradient descent method can work normally.

In feature scaling, In addition to dividing the feature by the maximum value, Mean normalization is also possible（mean normalization）.
Mean normalization means, For featurexi, Can usexi-ui To replace it. So that the average value of all features is0.

aboutx0=1, This processing is not required, Because its value is equal to1, Mean cannot be equal to0. For other features, For example, features representing the area of a housex1∈（0,2000）, If the characteristic value of the house area, The average is1000, Then you canx1 Proceed as follows. Another example, The number of bedrooms in the room is[0,5], An average house has two bedrooms, Then it can be normalized as followsx2.

Like this, Then we can work out new featuresx1,x2, So that their range can be[-0.5,0.5] Between.
More general, Mean normalization can be expressed as the following formula：
amongx1 It is characteristic.u1 Is the characteristic of all samples in the training setx1 Average value.S1 yesx1 Scope, Namely . Standard deviation can also be used asS1.

Now you can see, If pressedS1= Maximum value- minimum value, So the denominator above5 Will become4 了, But it doesn't matter, As long as this number can make the range of features closer, it is OK. So, Feature scaling does not need to be too precise, Just to make the gradient drop run faster.
Next section, Another approach will be introduced, Make the gradient drop run faster.
Four,Gradient Descent in Practice II-Learing Rate

— Gradient descent method practice2 Learning rate
This section will introduce some other skills about gradient descent algorithm, Around learning rateα Expand discussion.
Here are the update rules of gradient descent algorithm. first, We will show you how to debug, And some skills to make gradient descent algorithm work correctly. Second, Learning how to choose learning rateα.

Here's how to ensure that the gradient descent algorithm works correctly.
The task of gradient descent algorithm is toθ Find a value. To make the cost functionJ(θ) Take the minimum value.

In order to judge whether the gradient descent algorithm converges or not, You can draw it.J(θ) Curve changing with the number of iterations, Abscissa is the iteration step of gradient descent algorithm（ Be careful, Not parameterθ）, Ordinate is cost functionJ(θ) Value, Each point in the graph corresponds to aθ value. It can be seen that, Steps in iteration300~400 Between, The cost function is hardly decreasing, therefore, It can be said that the cost functionJ(θ) The number of steps in the iteration is equal to400 Time convergence.
This graph can help us to see if the cost function converges,

Some automatic convergence tests can also be performed, In other words, an algorithm is used to tell you whether the gradient descent algorithm converges.
A typical example is, If the cost functionJ(θ) Value, Reduce to a very small numberε, So we can think that the function has converged, For example, you can chooseε=e^(-3).
But actually we need to choose a suitable thresholdε It's very difficult, So in order to judge whether the gradient descent algorithm converges or not, The most common is the simple drawing method above.

in addition,“J(θ)— Iteration steps” Curves can also be used when the algorithm is not running properly, Warn in advance. for example, if“J(θ)— Iteration steps” The curve looks like this, NamelyJ(θ) It increases with the number of iteration steps, So obviously, At this time, the gradient descent algorithm does not work correctly. This usually means that a smaller learning rate should be usedα.

IfJ(θ) On the rise, So the most common reason isα Too large, It is easy to stagger the minimum value in iteration, Centered on minimum, Iterative, Increasing. Obviously, The solution is to reduce the learning rateα. Of course, it could be a code error, So we need to check it carefully.
It may also be as shown in the lower left corner, NamelyJ(θ) Constantly decreasing, Enlarge, Wavy change, The reason for this is also likely to beα Too large, The solution is naturally smallerα了.
But that's not to sayα The smaller the better. Because ifα Too small, The convergence of gradient descent algorithm may be very slow.

Sum up, Learning rateα Too small, The convergence of gradient descent algorithm may be very slow; Learning rateα too big, It may cause gradient descent algorithm not to decline in a few iterations, Increase instead, Can't even converge. Of courseα too big, It may also lead to slow convergence of the algorithm. And to find out what happened, You can draw it.J(θ) Curve changing with the number of iterations.
In the actual work, Can try多选几个α值试试,分别画出其“J(θ)—迭代步数”曲线,选择使算法收敛最快的α作为最终值.通常,可以选择α的值,间隔3倍.例如...
0.001,0.003,0.01,0.03,0.1,0.3,1,通常先找到两个端点值,例如不能比1再大,不能比0.001再小.通常选取,这组数中尽可能大的值,例如最大的值1,或者比最大值1略小一点的0.3.

——特征和多项式回归

以预测房价为例,假设有两个特征,分别是房子临街的宽度（frontage）和纵向深度（depth）.其中临街宽度其实就是房

frontage * depth,那么就可以只用Area这一个变量作为模型的特征. 即 hθ(x) = θ0 + θ1*x .x即面积Area.

）,我们知道如何用该模型对数据进行拟合.

θ2*（房子面积）^2 + θ3*(房子面积)^3,即 房价 =θ0 +θ1*（size） + θ2*（size）^2 + θ3*(size)^3 ——

x1 = (size)

x2 = (size)^2

x3 = (size)^3

size^3∈[1,10^9],这三个特征的范围有很大不同.

c,θ∈R.假设θ只是一个标量或只有一行,它是一个数字不是向量.假设我们的代价函数是这个实参数θ的二次函数.

* 需要选择学习速率α  这意味着需要运行多次,尝试不同的α,找到运行效果最好的那个.这是一种额外的工作和麻烦.

* 需要进行多次迭代计算  某些细节问题可能还会导致迭代的很慢.

* 不需要选择学习速率α
* 不需要进行多次迭代,一次计算即可.  所以,也不需要画出J(θ)的曲线,来检查收敛性.也不需要采取其他的额外步骤.

,若矩阵(X^T.*X)是不可逆的呢?（在线性代数中,不可逆矩阵又被称为奇异(singular)矩阵或退化(degenerate)矩阵）

3.28英尺,所以x1,x2之间始终能满足某种转换,即X1 = (3.28)^2 * X2.这样就导致(X^T.*X)不可逆.
另一种原因是,使用了过多的特征（eg.m <= n）,例如有m=10个样本,选用了n=100个特征,加上x0,就是101个特征了,
,试图从10个样例中找出满足101个参数的值,这需要花费很长时间.稍后将介绍如何使用称为“正则化”的线性代数方法来通过

30天阅读排行