One ,Multiple Features — Multidimensional features  
This section will introduce a more effective linear regression form . This form is applicable to multiple variables or features .
So far , We discussed univariate / Regression model of characteristics , as follows . Housing area x Forecast house price y. The following formula is what we call “ hypothesis ”, among x Is the only characteristic quantity .
Now we add more features to the house price model , For example, number of rooms, floors, etc , Construct a model with multiple variables , The features in the model are (x1,x2,...xn).


After adding more features , We introduce a new set of comments :

n Number of representative features

x(i) For and on behalf of i Training examples , It's the number one in the feature matrix i That's ok , It's a vector (vector).

Let's say , Above ,

In the representative characteristic matrix i Line No j Features , That is to say i Training example No j Features .

For example, it represents the second example in the above figure 3 Characteristic quantity .




Support multivariate hypothesis h Expressed as :

There are n+1 Parameters and n Variables , this n Variables represent the n Characteristic quantity . In order to simplify the formula , introduce x0=1, Then the formula is converted to :

for example : Company k$

It can be used to predict the price of a house after a period of time : The starting price of the house is 80K ,0.1x1 It means the price per unit area rises after a period of time 0.1K$,
The price of the house will vary with the number of houses ( use x2 express ) Increased by 0.01*x2, Will increase with the number of floors 3*x3, -2*X4 Indicates that the price of the house will depreciate with the increase of use time .

The parameter in the model is a n+1 dimension ( First dimension X0 It's a constant 1, of course , It can also be said that this is an additional characteristic quantity defined by us ) Vector of , Any training example is also n+1 Vector of dimension , Characteristic matrix
X The dimension of is  m*(n+1). At this point, our eigenvectors and parameter vectors can be expressed in the following forms :




The hypothesis can be rewritten as :




So the formula can be simplified as :, Superscript T Transposition of representative matrix .


This is the hypothetical form in the case of multiple eigenvalues , Another name is multiple linear regression .“ multivariate ” It refers to multiple characteristic quantities or variables used for prediction , It's just a little more pleasant .
Two ,Gradient Descent for Multiple Variables — Multivariable gradient descent
In the previous section, we discussed multivariable ( Or multiple features ) Hypothesis form of linear regression , This section describes how to set parameters for this assumption , Especially, how to use gradient descent to deal with multiple linear regression .
Quickly summarize the variable marks , as follows :  Hypothesis Is the hypothesis form of multiple linear regression , By convention x0 =
1. Parameters for this model include θ0~θn, We don't see them as n Independent variables , But as a
n+1 Vector of dimension . So we can take the parameters of this model as a vector of the model itself . The cost function is specified by the sum of the squares of the error terms , But not J As a being n+1 Functions with arguments , Instead, treat it as a parameter of θ Functions of vectors .



Here's the gradient drop , We need to update every θj. among α Is the rate of learning (learning rate), Derivative part is cost function to parameter θj Find partial derivative :
Now let's see what it looks like to use gradient descent method .
below , On the left is N=1 Gradient descent method of time , There are two independent update rules , Corresponding parameters θ0 and θ1. Circle part is equivalent to cost function J Yes θ0 Find partial derivative . On the right
N>=1 Gradient descent method of time , Wreath part is equivalent to cost function J Yes θj Find partial derivative .
It should be explained why the above two algorithms are the same , Why gradient descent algorithm . Let's look at the following example , We have 3 Eigenvalues θ0~θ2, Update with three update rules θ0~θ2.
observation θ0 Update rules for , It can be found that , It is associated with N=1 Temporal θ0 Update rules are the same .( The reason for the starting price is , In our symbolic Convention , Yes x(i)0 = 1 Agreement of )
observation θ1 Update rules for , It can be found that , It is associated with N=1 Temporal θ1 The update rules are actually the same . We just used new symbols x(i)1 To represent the first feature .


Three ,Gradient Descent in Practice I-Feature Scaling

— Gradient descent method practice 1 Feature scaling

This section and the next section will explain some practical skills in gradient operation , Make the operation effect of gradient descent method better . In order to speed up the convergence of gradient descent method , In this section, we will explain a method called feature scaling (feature
scaling) Method of .
Now there's a machine learning problem , Multiple features . What you need to do is make sure that the values of these characteristics have similar ranges , So when using gradient descent method in this problem , It will converge faster .
For example: , There are two other characteristics of the following questions x1,x2. among x1 It's the size of the house , And x1∈(0,2000);x2 It's the number of rooms , And x2∈(1,5).


Draw the cost function J(θ) Outline of . Cost function J(θ) It's about θ0,θ1,θ2 Function of , because θ0 It's a constant , Only the position of the contour map in the coordinate system will be affected , Does not affect its shape , So not for now θ0. Just draw J(θ) about θ1,θ2 Figure of .

because x1 The value range of is far greater than x2 Value range of , So the contour of the cost function is flat and oblique ,2000:5 The scale of will make the ellipse in the contour more slender . If we run the gradient descent algorithm on the cost function . It can take a long time to converge to the global minimum , As shown on the right .
If you exaggerate the picture , As shown on the left , It could be worse , It might even oscillate back and forth , To find a way to the global minimum .


Primary school , Here's a question , That is, why the convergence path in the figure above is not the figure below ( Editor's own thinking ) Medium orange , Go straight to the bottom , Why take a turn first , Now in retrospect, too “ wet behind the ears ” La :

After thinking later , Maybe that's why : Think about a problem , Using gradient descent method , At some point , When deciding which direction to go next , How to choose the direction ? The point is on a 3D image , So it should be 360° Tangent lines in all directions , The direction we choose is the direction with the largest tangent slope . On image , As shown below , Suppose a person is standing at the black spot in the picture , The area is too large , His field of vision can only be in the black circle in the picture , He's in the circle , At current location , The steepest direction you can see is the direction marked by the big red arrow .
therefore , When the proportion of cost function parameters is too large , An effective method is feature scaling .
say concretely , You can put features x1 Defined as house area /2000, features x2 Defined as number of rooms /5, Such eigenvalues x1,x2 It's all in [0,1] Inside . At this time, the contour of the cost function is a regular circle , The gradient descent method will soon find a shortcut to the minimum value . therefore , Scale by minimum , The range of eigenvalues can be eliminated .



on the whole , The purpose of feature scaling is to constrain the value of feature to [-1,+1] In scope .
features x0 Always value is 1, So it always satisfies x0∈[-1,+1]. As for other characteristics , It may need to be treated in some way ( such as , Divide each by a different number ), Keep it in the same range .
be careful :-1,+1 These two numbers , It's not that important . Such as features x1∈(0,3), And characteristics x2∈(-1,2), It doesn't matter , Because it's very close [-1,1] The scope of . If characteristic x3∈[-100,+100], that x3 And [-1,+1] It's very far away , Already O(10^2) There's an order of magnitude gap .X3 It's probably a feature that doesn't scale very well . of course , If characteristic x4∈(-0.00001,+0.00001), It's not appropriate .
But worry about using too much , Is the range of eigenvalues too large or too small , Because as long as they're close enough , So the gradient descent method can work normally .

In feature scaling , In addition to dividing the feature by the maximum value , Mean normalization is also possible (mean normalization).
Mean normalization means , For features xi, It can be used xi-ui To replace it , So that the average value of all features is 0.


about x0=1, This processing is not required , Because its value is equal to 1, Mean cannot be equal to 0. For other features , For example, features representing the area of a house x1∈(0,2000), If the characteristic value of the house area , The average is 1000, Then you can x1 Proceed as follows . Another example , The number of bedrooms in the room is [0,5], An average house has two bedrooms , Then it can be normalized as follows x2.

like this , Then we can work out new features x1,x2, So that their range can be [-0.5,0.5] between .
More general , Mean normalization can be expressed as the following formula :
among x1 Is characteristic ,u1 Is the characteristic of all samples in the training set x1 Average of .S1 yes x1 Scope of , Namely . Standard deviation can also be used as S1.


Now you can see , If you press S1= Maximum - minimum value , So the denominator above 5 It will become 4 了 , But it doesn't matter , As long as this number can make the range of features closer, it is OK . So , Feature scaling does not need to be too precise , Just to make the gradient drop run faster .
Next section , Another approach will be introduced , Make the gradient drop run faster .
Four ,Gradient Descent in Practice II-Learing Rate

— Gradient descent method practice 2 Learning rate
This section will introduce some other skills about gradient descent algorithm , Around learning rate α Expand discussion .
Here are the update rules of gradient descent algorithm . first , We will show you how to debug , And some skills to make gradient descent algorithm work correctly . second , Learning how to choose learning rate α.

Here's how to ensure that the gradient descent algorithm works correctly .
The task of gradient descent algorithm is to θ Find a value , To make the cost function J(θ) Take the minimum value .

In order to judge whether the gradient descent algorithm converges or not , You can draw it J(θ) Curve changing with the number of iterations , Abscissa is the iteration step of gradient descent algorithm ( be careful , Not a parameter θ), Ordinate is cost function J(θ) Value of , Each point in the graph corresponds to a θ value . It can be seen that , Steps in iteration 300~400 between , The cost function is hardly decreasing , therefore , It can be said that the cost function J(θ) The number of steps in the iteration is equal to 400 Time convergence .
This graph can help us to see if the cost function converges ,

Some automatic convergence tests can also be performed , In other words, an algorithm is used to tell you whether the gradient descent algorithm converges .
A typical example is , If the cost function J(θ) Value of , Reduce to a very small number ε, Then we can think that the function has converged , For example, you can choose ε=e^(-3).
But actually we need to choose a suitable threshold ε It's very difficult , So in order to judge whether the gradient descent algorithm converges or not , The most common is the simple drawing method above .

in addition ,“J(θ)— Iteration steps ” Curves can also be used when the algorithm is not running properly , Warn in advance . for example , if “J(θ)— Iteration steps ” The curve looks like this , Namely J(θ) It increases with the number of iteration steps , So obviously , At this time, the gradient descent algorithm does not work correctly . This usually means that a smaller learning rate should be used α.

If J(θ) On the rise , So the most common reason is α Too large , It is easy to stagger the minimum value in iteration , Centered on minimum , Iterative , Growing . Obviously , The solution is to reduce the learning rate α. Of course, it could be a code error , So we need to check it carefully .
It may also be as shown in the lower left corner , Namely J(θ) Decreasing , enlarge , Wavy change , The reason for this is also likely to be α Too large , The solution is naturally smaller α了.
But that's not to say α The smaller the better , Because if α Too small , The convergence of gradient descent algorithm may be very slow .

To summarize , Learning rate α Too small , The convergence of gradient descent algorithm may be very slow ; Learning rate α too big , It may cause gradient descent algorithm not to decline in a few iterations , On the contrary, it increases , Can't even converge . of course α too big , It may also lead to slow convergence of the algorithm . And to find out what happened , You can draw it J(θ) Curve changing with the number of iterations .
In the actual work , You can try 多选几个α值试试,分别画出其“J(θ)—迭代步数”曲线,选择使算法收敛最快的α作为最终值.通常,可以选择α的值,间隔3倍.例如...
0.001,0.003,0.01,0.03,0.1,0.3,1,通常先找到两个端点值,例如不能比1再大,不能比0.001再小.通常选取,这组数中尽可能大的值,例如最大的值1,或者比最大值1略小一点的0.3.

五 Features and Polynomial Regression 

——特征和多项式回归
 
在前面,我们已经了解了多变量的线性回归.本节我们介绍选择特征的方法,以及如何得到不同的学习算法.当选择了合适的特征后,这些算法往往是非常有效的.同时也会讲解多项式线性回归,它使得能够使用线性回归的方法来拟合非常复杂的函数甚至是非线性函数.
 以预测房价为例,假设有两个特征,分别是房子临街的宽度(frontage)和纵向深度(depth).其中临街宽度其实就是房
子占地面积的宽度,纵向深度也就是占地面积的长度.建立下面的线性回归模型,其中临街宽度是第一个特征,纵深宽度是第二个特征.

不要忘了我们这节的目的,是为了讲解选择特征的方法.那么思考一下,只能像上面这样选择特征吗? 当然不是,还有其他的选择方法.例如我们令Area =
frontage * depth,那么就可以只用Area这一个变量作为模型的特征. 即 hθ(x) = θ0 + θ1*x .x即面积Area.
与选择特征密切相关的一个概念是多项式回归(Polynomial Regression).假如有如下住房价格数据集,为了拟合数据,可
能有多种解决问题的模型.例如,可以选用二次函数模型,考虑到二次函数终会降下来,但我们并不认为房价在高到一定程
度后也会降下来.因此我们可能会转而使用另一种模型,比如三次函数模型
那么到底应该如何使用模型拟合数据呢?使用多元线性回归的方法,可以将我们的模型进行简单的修改.
假设我们使用三次函数作为选用的模型. 按照以前的假设形式(hθ(x) = θ0 +θ1x1 +θ2x2 +θ3x3 —— 假设公式
),我们知道如何用该模型对数据进行拟合.
而如果我们想拟合下面的三次模型(θ0 +θ1x + θ2x^2 + θ3x^3)呢? 我们现在讨论的是预测房子的价格,房价 =θ0 +θ1*(房子面积) +
θ2*(房子面积)^2 + θ3*(房子面积)^3,即 房价 =θ0 +θ1*(size) + θ2*(size)^2 + θ3*(size)^3 ——
模型公式.因此该式子与三次模型式子(绿框内)是相等的.

为了将假设公式与三次模型公式对应起来,使
x1 = (size)

x2 = (size)^2

x3 = (size)^3
这种思想总结起来就是:将特征像上面这样设置,再应用线性回归(hθ(x) = θ0 +θ1x1 + θ2x2 + θ3x3)的方法,就可以 拟合三次函数模型.
如果像这样(x1 = ,x2 = ,x3 = )选择数据,那么特征值的归一化就更为重要了.因为size ∈ [0,10^3] ,size^2∈[1,10^6],
size^3∈[1,10^9],这三个特征的范围有很大不同.

最后再举一个例子,说明如何真正选择出要使用的特征.
向下面这样,一个二次模型可能不能对数据进行很好的拟合,但除了转而使用三次模型外,我们还可以采用另外的方法.例如 将公式改为下面的形式可能就可以了.


总结:在本节中,我们讨论了多项式回归,也就是如何将一个多项式,如一个二次函数,或三次函数拟合到数据上.
我们还讨论了如何选择特征,例如我们不使用房屋的临街宽度和纵向深度,而是使用它们的乘积,从而得到房屋的 土地面积这个特征.
实际上,由于特征有很多,对我们来说是很难选择的.在后面将介绍一些算法,它们能够自动选择要使用什么特征.
因此我们可以使用算法观察给出的数据,并自动选择用一个二次函数,三次还是别的函数.
六 正规方程——Normal Equation 
前面我们讲解了梯度下降算法,但是对于某些线性方程问题,用正规方程法求解参数θ的最优值,效果将会更好.
到目前为止,在线性回归问题中,为了减小代价函数,我们一直使用的是梯度下降算法. (线性回归,梯度下降,正规方程,这三者之间有什么关系呢?可以参看网址:
机器学习_线性回归,梯度下降算法与正规方程 <http://blog.csdn.net/matrix5267/article/details/53768983>
线性回归问题:给出若干的训练集,然后拟合为一条直线,使得cost最小.
故,我们只需确定使得cost最小的参数即可.求使cost最小的参数,可以使用梯度下降算法或正规方程法. )



梯度下降算法确定参数θ的原理是:经过多次迭代,使得代价函数最终收敛到最小值.




相反地,正规方程法提供了一种求θ的解析(analytically)解法.即不用再经过多次迭代,可以直接一次性求出θ的值.所以,一步就可以得到优化值.
举例说明该方法,假设有一个非常简单(简单在θ是实数)的代价函数J(θ) = αθ^2 + bθ +
c,θ∈R.假设θ只是一个标量或只有一行,它是一个数字不是向量.假设我们的代价函数是这个实参数θ的二次函数.
那么如何最小化这个二次函数呢? 从微积分的角度来说,就是求导,令导数为0,求解此等式就可以得到使J(θ)最小的参数θ.
上面的例子中,θ只是一个简单的实数,若θ是一个n+1维的参数向量,代价函数是关于该向量的函数呢?我们如何最小化这种代价函数呢?
在微积分中可以对每个参数θ,求代价函数J的偏导数,然后令它们全部为0,分别求解出θ0~θn的值.但是这种偏微分计算方法太过复杂了,并不实用.

那应该怎么做呢?举例说明. 如下,一个训练集中有m=4个训练样本,为了实现正规方程法,首先
在训练集中加上额外一列对应额外特征变量的x0,就是取值恒为1的特征.然后,建立一个矩阵x,包含训练样本的所有特征值.接着,对将要预测的值,建立一个向量,称为向量y.其中x是一个m*(n+1)维矩阵,y是一个m
维向量,m是训练样本数量,n是特征变量数,n+1是因为加了一个额外的她特征变量x0. 最后,计算下面的公式
就能得到使代价函数最小的θ.

上面使用了一个例子来讲解,现在使用更通用的形式.
在一般情况下,假设有m个训练样本,每个样本有n个特征变量.所以每一个训练样本都有一个与之相对应的n+1维的特征向量.
接下来构建矩阵X,这也被称为设计矩阵(design matrix).构建矩阵的方法是:  
首先,取第一个训练样本,也就是一个向量,取它的转置.让x1的转置作为矩阵的第一行.按行依次添加其他样本的特征向量的转置.这样矩阵X就是一个m*(n+1)维的矩阵.


举个例子,假设除了x0以外,只有一个特征变量,且x0恒等于1.如下x的下标代表第几个样例,x的商标代表第几个特征变量.这样设计的矩阵就是一个m*2维矩阵.向量y包含了训练集中的所有标签(或说正确答案,例子中即房子价格).
构建了矩阵X和向量y后,就能通过计算公式来得到θ.



之前说的特征方程归一化以及让特征变量在相似范围内,如果使用正规方程法,就不需要进行这些处理了.即便各个特征值的范围相差很大数量级也没关系.若使用的是梯度下降法,特征归一化就非常重要了.

最后,什么时候应该用梯度下降法,什么时候应该用正规方程法呢?下面是他们的优缺点. 假设有m个训练样本,n个特征变量.
梯度下降法的缺点:
* 需要选择学习速率α  这意味着需要运行多次,尝试不同的α,找到运行效果最好的那个.这是一种额外的工作和麻烦.

* 需要进行多次迭代计算  某些细节问题可能还会导致迭代的很慢.


对应的正规方程法的优点即:
* 不需要选择学习速率α
* 不需要进行多次迭代,一次计算即可.  所以,也不需要画出J(θ)的曲线,来检查收敛性.也不需要采取其他的额外步骤.

梯度下降法的优点: 即便有很多特征,也能运行地很好
正规方程法的缺点:  
为了求解参数θ,需要计算(X^(T).*X)^(-1),其中X,X^(T)的维度是n*n,并且记性逆运算的计算量大致是矩阵维度的三次方.所以,如果特征变量很多,那么项的计算量会非常大,正规方程法会很慢.
所以,建议在特征变量数1万以下时使用正规方程法,在大于1万时,使用梯度下降法.
总结:在特征变量不多(<1万)时,使用正规方程法计算参数θ,会很方便.
以后,在讲到一些复杂的算法,例如分类算法时,并不能使用正规方程法,那时将不得不使用梯度下降法.所以梯度下降法是一种很重要的算法.

七 Normal Equation Noninvertibility — 正规方程的不可逆性
举一个关于正规方程和线性回归的例子,如下:
对于计算θ的公式θ = (X^T.*X)^(-1).*X^T.*y
,若矩阵(X^T.*X)是不可逆的呢?(在线性代数中,不可逆矩阵又被称为奇异(singular)矩阵或退化(degenerate)矩阵)

使用程序进行计算时,例如Octave,即便项(X^T.*X)是不可逆的,也可以得到正确解.在这门编程语言中有两个函数可以计算矩阵的逆,伪逆函数pinv()和逆函数inv().
那么为什么会出现(X^T.*X)不可逆的情况呢?  
一种可能的原因是,在学习问题中有多余的功能,例如在预测住房价格时,如果x1是以英尺计算的住房面积,x2是以平方米计算的住房面积,我们知道1m =
3.28英尺,所以x1,x2之间始终能满足某种转换,即X1 = (3.28)^2 * X2.这样就导致(X^T.*X)不可逆.
 另一种原因是,使用了过多的特征(eg.m <= n),例如有m=10个样本,选用了n=100个特征,加上x0,就是101个特征了,
,试图从10个样例中找出满足101个参数的值,这需要花费很长时间.稍后将介绍如何使用称为“正则化”的线性代数方法来通过
小样本数据获得这101个参数值,通过删除某些特征或者是某些技术来解决m<=n时的情况.