One ,Multiple Features — Multidimensional features
This section will introduce a more effective linear regression form . This form is applicable to multiple variables or features .
So far , We discussed univariate / Regression model of characteristics , as follows . Housing area x Forecast house price y. The following formula is what we call “ hypothesis ”, among x Is the only characteristic quantity .
Now we add more features to the house price model , For example, number of rooms, floors, etc , Construct a model with multiple variables , The features in the model are （x1,x2,...xn）.

After adding more features , We introduce a new set of comments ：

n Number of representative features

x(i) For and on behalf of i Training examples , It's the number one in the feature matrix i That's ok , It's a vector （vector）.

Let's say , Above ,

In the representative characteristic matrix i Line No j Features , That is to say i Training example No j Features .

For example, it represents the second example in the above figure 3 Characteristic quantity .

Support multivariate hypothesis h Expressed as ：

There are n+1 Parameters and n Variables , this n Variables represent the n Characteristic quantity . In order to simplify the formula , introduce x0=1, Then the formula is converted to ：

for example ： Company k\$

It can be used to predict the price of a house after a period of time ： The starting price of the house is 80K ,0.1x1 It means the price per unit area rises after a period of time 0.1K\$,
The price of the house will vary with the number of houses （ use x2 express ） Increased by 0.01*x2, Will increase with the number of floors 3*x3, -2*X4 Indicates that the price of the house will depreciate with the increase of use time .

The parameter in the model is a n+1 dimension （ First dimension X0 It's a constant 1, of course , It can also be said that this is an additional characteristic quantity defined by us ） Vector of , Any training example is also n+1 Vector of dimension , Characteristic matrix
X The dimension of is  m*(n+1). At this point, our eigenvectors and parameter vectors can be expressed in the following forms ：

The hypothesis can be rewritten as ：

So the formula can be simplified as ：, Superscript T Transposition of representative matrix .

This is the hypothetical form in the case of multiple eigenvalues , Another name is multiple linear regression .“ multivariate ” It refers to multiple characteristic quantities or variables used for prediction , It's just a little more pleasant .
In the previous section, we discussed multivariable （ Or multiple features ） Hypothesis form of linear regression , This section describes how to set parameters for this assumption , Especially, how to use gradient descent to deal with multiple linear regression .
Quickly summarize the variable marks , as follows ：  Hypothesis Is the hypothesis form of multiple linear regression , By convention x0 =
1. Parameters for this model include θ0~θn, We don't see them as n Independent variables , But as a
n+1 Vector of dimension . So we can take the parameters of this model as a vector of the model itself . The cost function is specified by the sum of the squares of the error terms , But not J As a being n+1 Functions with arguments , Instead, treat it as a parameter of θ Functions of vectors .

Here's the gradient drop , We need to update every θj. among α Is the rate of learning （learning rate）, Derivative part is cost function to parameter θj Find partial derivative ：
Now let's see what it looks like to use gradient descent method .
below , On the left is N=1 Gradient descent method of time , There are two independent update rules , Corresponding parameters θ0 and θ1. Circle part is equivalent to cost function J Yes θ0 Find partial derivative . On the right
N>=1 Gradient descent method of time , Wreath part is equivalent to cost function J Yes θj Find partial derivative .
It should be explained why the above two algorithms are the same , Why gradient descent algorithm . Let's look at the following example , We have 3 Eigenvalues θ0~θ2, Update with three update rules θ0~θ2.
observation θ0 Update rules for , It can be found that , It is associated with N=1 Temporal θ0 Update rules are the same .（ The reason for the starting price is , In our symbolic Convention , Yes x(i)0 = 1 Agreement of ）
observation θ1 Update rules for , It can be found that , It is associated with N=1 Temporal θ1 The update rules are actually the same . We just used new symbols x(i)1 To represent the first feature .

Three ,Gradient Descent in Practice I-Feature Scaling

— Gradient descent method practice 1 Feature scaling

This section and the next section will explain some practical skills in gradient operation , Make the operation effect of gradient descent method better . In order to speed up the convergence of gradient descent method , In this section, we will explain a method called feature scaling （feature
scaling） Method of .
Now there's a machine learning problem , Multiple features . What you need to do is make sure that the values of these characteristics have similar ranges , So when using gradient descent method in this problem , It will converge faster .
For example: , There are two other characteristics of the following questions x1,x2. among x1 It's the size of the house , And x1∈(0,2000);x2 It's the number of rooms , And x2∈(1,5).

Draw the cost function J(θ) Outline of . Cost function J(θ) It's about θ0,θ1,θ2 Function of , because θ0 It's a constant , Only the position of the contour map in the coordinate system will be affected , Does not affect its shape , So not for now θ0. Just draw J(θ) about θ1,θ2 Figure of .

because x1 The value range of is far greater than x2 Value range of , So the contour of the cost function is flat and oblique ,2000:5 The scale of will make the ellipse in the contour more slender . If we run the gradient descent algorithm on the cost function . It can take a long time to converge to the global minimum , As shown on the right .
If you exaggerate the picture , As shown on the left , It could be worse , It might even oscillate back and forth , To find a way to the global minimum .

Primary school , Here's a question , That is, why the convergence path in the figure above is not the figure below （ Editor's own thinking ） Medium orange , Go straight to the bottom , Why take a turn first , Now in retrospect, too “ wet behind the ears ” La ：

After thinking later , Maybe that's why ： Think about a problem , Using gradient descent method , At some point , When deciding which direction to go next , How to choose the direction ? The point is on a 3D image , So it should be 360° Tangent lines in all directions , The direction we choose is the direction with the largest tangent slope . On image , As shown below , Suppose a person is standing at the black spot in the picture , The area is too large , His field of vision can only be in the black circle in the picture , He's in the circle , At current location , The steepest direction you can see is the direction marked by the big red arrow .
therefore , When the proportion of cost function parameters is too large , An effective method is feature scaling .
say concretely , You can put features x1 Defined as house area /2000, features x2 Defined as number of rooms /5, Such eigenvalues x1,x2 It's all in [0,1] Inside . At this time, the contour of the cost function is a regular circle , The gradient descent method will soon find a shortcut to the minimum value . therefore , Scale by minimum , The range of eigenvalues can be eliminated .

on the whole , The purpose of feature scaling is to constrain the value of feature to [-1,+1] In scope .
features x0 Always value is 1, So it always satisfies x0∈[-1,+1]. As for other characteristics , It may need to be treated in some way （ such as , Divide each by a different number ）, Keep it in the same range .
be careful ：-1,+1 These two numbers , It's not that important . Such as features x1∈(0,3), And characteristics x2∈(-1,2), It doesn't matter , Because it's very close [-1,1] The scope of . If characteristic x3∈[-100,+100], that x3 And [-1,+1] It's very far away , Already O(10^2) There's an order of magnitude gap .X3 It's probably a feature that doesn't scale very well . of course , If characteristic x4∈（-0.00001,+0.00001）, It's not appropriate .
But worry about using too much , Is the range of eigenvalues too large or too small , Because as long as they're close enough , So the gradient descent method can work normally .

In feature scaling , In addition to dividing the feature by the maximum value , Mean normalization is also possible （mean normalization）.
Mean normalization means , For features xi, It can be used xi-ui To replace it , So that the average value of all features is 0.

about x0=1, This processing is not required , Because its value is equal to 1, Mean cannot be equal to 0. For other features , For example, features representing the area of a house x1∈（0,2000）, If the characteristic value of the house area , The average is 1000, Then you can x1 Proceed as follows . Another example , The number of bedrooms in the room is [0,5], An average house has two bedrooms , Then it can be normalized as follows x2.

like this , Then we can work out new features x1,x2, So that their range can be [-0.5,0.5] between .
More general , Mean normalization can be expressed as the following formula ：
among x1 Is characteristic ,u1 Is the characteristic of all samples in the training set x1 Average of .S1 yes x1 Scope of , Namely . Standard deviation can also be used as S1.

Now you can see , If you press S1= Maximum - minimum value , So the denominator above 5 It will become 4 了 , But it doesn't matter , As long as this number can make the range of features closer, it is OK . So , Feature scaling does not need to be too precise , Just to make the gradient drop run faster .
Next section , Another approach will be introduced , Make the gradient drop run faster .
Four ,Gradient Descent in Practice II-Learing Rate

— Gradient descent method practice 2 Learning rate
This section will introduce some other skills about gradient descent algorithm , Around learning rate α Expand discussion .
Here are the update rules of gradient descent algorithm . first , We will show you how to debug , And some skills to make gradient descent algorithm work correctly . second , Learning how to choose learning rate α.

Here's how to ensure that the gradient descent algorithm works correctly .
The task of gradient descent algorithm is to θ Find a value , To make the cost function J(θ) Take the minimum value .

In order to judge whether the gradient descent algorithm converges or not , You can draw it J(θ) Curve changing with the number of iterations , Abscissa is the iteration step of gradient descent algorithm （ be careful , Not a parameter θ）, Ordinate is cost function J(θ) Value of , Each point in the graph corresponds to a θ value . It can be seen that , Steps in iteration 300~400 between , The cost function is hardly decreasing , therefore , It can be said that the cost function J(θ) The number of steps in the iteration is equal to 400 Time convergence .
This graph can help us to see if the cost function converges ,

Some automatic convergence tests can also be performed , In other words, an algorithm is used to tell you whether the gradient descent algorithm converges .
A typical example is , If the cost function J(θ) Value of , Reduce to a very small number ε, Then we can think that the function has converged , For example, you can choose ε=e^(-3).
But actually we need to choose a suitable threshold ε It's very difficult , So in order to judge whether the gradient descent algorithm converges or not , The most common is the simple drawing method above .

in addition ,“J(θ)— Iteration steps ” Curves can also be used when the algorithm is not running properly , Warn in advance . for example , if “J(θ)— Iteration steps ” The curve looks like this , Namely J(θ) It increases with the number of iteration steps , So obviously , At this time, the gradient descent algorithm does not work correctly . This usually means that a smaller learning rate should be used α.

If J(θ) On the rise , So the most common reason is α Too large , It is easy to stagger the minimum value in iteration , Centered on minimum , Iterative , Growing . Obviously , The solution is to reduce the learning rate α. Of course, it could be a code error , So we need to check it carefully .
It may also be as shown in the lower left corner , Namely J(θ) Decreasing , enlarge , Wavy change , The reason for this is also likely to be α Too large , The solution is naturally smaller α了.
But that's not to say α The smaller the better , Because if α Too small , The convergence of gradient descent algorithm may be very slow .

To summarize , Learning rate α Too small , The convergence of gradient descent algorithm may be very slow ; Learning rate α too big , It may cause gradient descent algorithm not to decline in a few iterations , On the contrary, it increases , Can't even converge . of course α too big , It may also lead to slow convergence of the algorithm . And to find out what happened , You can draw it J(θ) Curve changing with the number of iterations .
In the actual work , You can try 多选几个α值试试,分别画出其“J(θ)—迭代步数”曲线,选择使算法收敛最快的α作为最终值.通常,可以选择α的值,间隔3倍.例如...
0.001,0.003,0.01,0.03,0.1,0.3,1,通常先找到两个端点值,例如不能比1再大,不能比0.001再小.通常选取,这组数中尽可能大的值,例如最大的值1,或者比最大值1略小一点的0.3.

——特征和多项式回归

以预测房价为例,假设有两个特征,分别是房子临街的宽度（frontage）和纵向深度（depth）.其中临街宽度其实就是房

frontage * depth,那么就可以只用Area这一个变量作为模型的特征. 即 hθ(x) = θ0 + θ1*x .x即面积Area.

）,我们知道如何用该模型对数据进行拟合.

θ2*（房子面积）^2 + θ3*(房子面积)^3,即 房价 =θ0 +θ1*（size） + θ2*（size）^2 + θ3*(size)^3 ——

x1 = (size)

x2 = (size)^2

x3 = (size)^3

size^3∈[1,10^9],这三个特征的范围有很大不同.

c,θ∈R.假设θ只是一个标量或只有一行,它是一个数字不是向量.假设我们的代价函数是这个实参数θ的二次函数.

* 需要选择学习速率α  这意味着需要运行多次,尝试不同的α,找到运行效果最好的那个.这是一种额外的工作和麻烦.

* 需要进行多次迭代计算  某些细节问题可能还会导致迭代的很慢.

* 不需要选择学习速率α
* 不需要进行多次迭代,一次计算即可.  所以,也不需要画出J(θ)的曲线,来检查收敛性.也不需要采取其他的额外步骤.

,若矩阵(X^T.*X)是不可逆的呢?（在线性代数中,不可逆矩阵又被称为奇异(singular)矩阵或退化(degenerate)矩阵）

3.28英尺,所以x1,x2之间始终能满足某种转换,即X1 = (3.28)^2 * X2.这样就导致(X^T.*X)不可逆.
另一种原因是,使用了过多的特征（eg.m <= n）,例如有m=10个样本,选用了n=100个特征,加上x0,就是101个特征了,
,试图从10个样例中找出满足101个参数的值,这需要花费很长时间.稍后将介绍如何使用称为“正则化”的线性代数方法来通过