Derivative and gradient

Derivatives: The derivative of a univariate function at a certain point describes the rate of change of the function near this point.

f′(a)=limh→0f(a+h)−f(a)h


gradient: Derivative of multivariate function is gradient.

* First derivative, Gradient(gradient):

∇f(X)=∂f(X)∂X=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂f(X)∂x1∂f(X)∂x2⋮∂f(X)∂xn⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥


* Two derivative,Hessian matrix:
H(x)=∇2f(X)=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂2f(X)∂x12∂2f(X)∂x2∂x1⋮∂2f(X)∂xn∂x1∂2f(X)∂x1∂x2∂2f(X
)∂x22⋮∂2f(X)∂xn∂x2⋯⋯⋱⋯∂2f(X)∂x1∂xn∂2f(X)∂x2∂xn⋮∂2f(X)∂xn2⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
The first derivative and the second derivative are often recorded asf′(x) andf′′(x)

Taylor expansion: Taylor expansion of unary function:

f(xk+δ)≈f(xk)+f′(xk)δ+12f′′(xk)δ2+⋯+1n!f(n)(xk)δn
Taylor expansion of multivariate functions( Only the first three items):
f(xk+δ)≈f(xk)+∇Tf(xk)δ+12δTf′′(xk)δ


If∇Tf(xk)=0​, bexk
Be called“ Stationary point”, If it's a unary function, So this point must be a local extreme point, Maximum or minimum local extremum, Iff A convex function is a global minimum, Convex functions are briefly introduced in the next section.

If it's a multivariate function,∇2f(xk)≻0 Positive definite, That is, all eigenvalues are positive, Then the third term of the above formula is positive, be xk Is a strict local minimum point( Conversely,∇2f(xk)≺0
Negative definite strict local minimum). More complicated, If the eigenvalue of the second derivative has positive or negative, So it's uncertain, This timexk
Is a saddle point, That is, some dimensions are local minima, Some are local maxima, Saddle point is one of the core difficulties in neural network training, I will write later in other blogs, Let's go back to basics.

Taylor expansion is the core of many mathematical problems, Let's expand a little bit here:
problem: Why to choose gradient direction in optimization, Why is gradient direction the fastest changing direction?

The first two terms of Taylor series expansionf(xk+δ)≈f(xk)+∇Tf(xk)δ knowable, Whenδ Is a vector whose modulus is fixed but whose direction is uncertain,f(xk+δ)−f(xk)≈∇Tf(xk)
δ, here∇Tf(xk)δ=||∇Tf(xk)||⋅||δ||cos(θ), Maximum incos(θ)=1 Fetch, Namelyδ
Take gradient direction or negative gradient direction. If it's a minimum, So it's the gradient descent method,δ Take negative gradient direction, bringf(x) The fastest fall.

Matrix derivation summary

(1) Derivation of scalar

* Scalar about Scalarx Derivation:
∂y∂x
* Vector about Scalarx Derivation:
vectory=⎡⎣⎢⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥⎥ On scalarx The derivation of y Each element ofx Derivation, It can be expressed as
∂y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y1∂x∂y2∂x⋮∂yn∂x⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
* matrix· On scalarx Derivation:
The derivative of a matrix to a scalar is similar to that of a vector to a scalar, That is to say, each element of a matrix has a scalarx Derivation
∂Y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y11∂x∂y21∂x⋮∂yn1∂x∂y12∂x∂y22∂x⋮∂yn2∂x⋯⋯⋱⋯∂y1n∂x∂y2n∂x⋮∂ynn∂x
⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
(2) Derivation of vectors

* Scalar about vectorx Derivative
scalary About vector x=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥ The derivation of can be expressed as
=[∂y∂x1 ∂y∂x2 ⋯ ∂y∂xn]
* Vectors about vectors x Derivative
Vector function( That is, the vector composed of functions)y=⎡⎣⎢⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥⎥ aboutx=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥ Derivative
∂y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y1∂x1∂y2∂x1⋮∂yn∂x1∂y1∂x2∂y2∂x2⋮∂yn∂x2⋯⋯⋱⋯∂y1∂xn∂y2∂xn⋮∂yn∂x
n⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
Matrix obtained at this time∂y∂x Be calledJacobian matrix.

*
Matrix on the derivative of vector
matrixY=⎡⎣⎢⎢⎢⎢⎢y11y21⋮yn1y12y22⋮yn2⋯⋯⋱⋯y1ny2n⋮ynn⎤⎦⎥⎥⎥⎥⎥ about x=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥
The derivative of is the most complicated one in derivation, Expressed as

∂Y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y11∂x1∂y21∂x1⋮∂yn1∂x1∂y1n∂x2∂y22∂x2⋮∂yn2∂x2⋯⋯⋱⋯∂y1n∂xn∂y2n∂
xn⋮∂ynn∂xn⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
(3) Derivation of matrix


In general, only scalar derivatives of matrices are considered, Scalar quantityy Pair matrix X Derivative, The derivative is a gradient matrix, It can be expressed as the following formula:

∂y∂X=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y∂x11∂y∂x12⋮∂y∂x1n∂y∂x21∂y∂x22⋮∂y∂x2n⋯⋯⋱⋯∂y∂xn1∂y∂xn2⋮∂y∂xnn
⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥


The following figure is a common matrix derivation form in machine learning, For reference



The next one is aboutHessian Basic concepts of matrix and convex function, To be continued.

[1] http://blog.csdn.net/u010976453/article/details/54342895
<http://blog.csdn.net/u010976453/article/details/54342895>
[2] http://blog.csdn.net/u010976453/article/details/78482502
<http://blog.csdn.net/u010976453/article/details/78482502>
[3] http://blog.csdn.net/u010976453/article/details/54381248
<http://blog.csdn.net/u010976453/article/details/54381248>
[4] Jacobian Matrix sumHessian matrix
http://jacoxu.com/jacobian%e7%9f%a9%e9%98%b5%e5%92%8chessian%e7%9f%a9%e9%98%b5/
<http://jacoxu.com/jacobian%e7%9f%a9%e9%98%b5%e5%92%8chessian%e7%9f%a9%e9%98%b5/>
[5] https://en.wikipedia.org/wiki/Norm_(mathematics)
<https://en.wikipedia.org/wiki/Norm_(mathematics)>
[6] https://en.wikipedia.org/wiki/Matrix_norm
<https://en.wikipedia.org/wiki/Matrix_norm>
[7] Matrix derivation of linear algebra in machine learning http://blog.csdn.net/u010976453/article/details/54381248
<http://blog.csdn.net/u010976453/article/details/54381248>
[8] Newton method andHessian matrixhttp://blog.csdn.net/linolzhang/article/details/60151623
<http://blog.csdn.net/linolzhang/article/details/60151623>