Derivative and gradient

Derivatives: The derivative of a univariate function at a certain point describes the rate of change of the function near this point.

f′(a)=limh→0f(a+h)−f(a)h


gradient: Derivative of multivariate function is gradient.

* First derivative, Gradient(gradient):

∇f(X)=∂f(X)∂X=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂f(X)∂x1∂f(X)∂x2⋮∂f(X)∂xn⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥


* Two derivative,Hessian matrix:
H(x)=∇2f(X)=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂2f(X)∂x12∂2f(X)∂x2∂x1⋮∂2f(X)∂xn∂x1∂2f(X)∂x1∂x2∂2f(X
)∂x22⋮∂2f(X)∂xn∂x2⋯⋯⋱⋯∂2f(X)∂x1∂xn∂2f(X)∂x2∂xn⋮∂2f(X)∂xn2⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
The first derivative and the second derivative are often recorded asf′(x) andf′′(x)

Taylor expansion: Taylor expansion of unary function:

f(xk+δ)≈f(xk)+f′(xk)δ+12f′′(xk)δ2+⋯+1n!f(n)(xk)δn
Taylor expansion of multivariate functions( Only the first three items):
f(xk+δ)≈f(xk)+∇Tf(xk)δ+12δTf′′(xk)δ


If∇Tf(xk)=0​, bexk
Be called“ Stationary point”, If it's a unary function, So this point must be a local extreme point, Maximum or minimum local extremum, Iff A convex function is a global minimum, Convex functions are briefly introduced in the next section.

If it's a multivariate function,∇2f(xk)≻0 Positive definite, That is, all eigenvalues are positive, Then the third term of the above formula is positive, be xk Is a strict local minimum point( Conversely,∇2f(xk)≺0
Negative definite strict local minimum). More complicated, If the eigenvalue of the second derivative has positive or negative, So it's uncertain, This timexk
Is a saddle point, That is, some dimensions are local minima, Some are local maxima, Saddle point is one of the core difficulties in neural network training, I'll write later in other blogs, Let's go back to basics.

Taylor expansion is the core of many mathematical problems, Let's expand a little bit here:
problem: Why to choose gradient direction in optimization, Why is gradient direction the fastest changing direction?

The first two terms of Taylor series expansionf(xk+δ)≈f(xk)+∇Tf(xk)δ knowable, Whenδ Is a vector whose modulus is fixed but whose direction is uncertain,f(xk+δ)−f(xk)≈∇Tf(xk)
δ, here∇Tf(xk)δ=||∇Tf(xk)||⋅||δ||cos(θ), Maximum incos(θ)=1 Fetch, Namelyδ
Take gradient direction or negative gradient direction. If it's a minimum, So it's the gradient descent method,δ Take negative gradient direction, bringf(x) The fastest fall.

Matrix derivation summary

(1) Derivation of scalar

* Scalar about Scalarx Derivation:
∂y∂x
* Vector about Scalarx Derivation:
vectory=⎡⎣⎢⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥⎥ On scalarx The derivation of y Each element ofx Derivation, It can be expressed as
∂y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y1∂x∂y2∂x⋮∂yn∂x⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
* matrix· On scalarx Derivation:
The derivation of a matrix from a scalar is similar to that of a vector from a scalar, That is to say, each element of a matrix has a scalarx Derivation
∂Y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y11∂x∂y21∂x⋮∂yn1∂x∂y12∂x∂y22∂x⋮∂yn2∂x⋯⋯⋱⋯∂y1n∂x∂y2n∂x⋮∂ynn∂x
⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
(2) Derivation of vectors

* Scalar about vectorx Derivative
scalary About vector x=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥ The derivation of can be expressed as
=[∂y∂x1 ∂y∂x2 ⋯ ∂y∂xn]
* Vectors about vectors x Derivative
Vector function( I.e. vector composed of function)y=⎡⎣⎢⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥⎥ aboutx=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥ Derivative
∂y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y1∂x1∂y2∂x1⋮∂yn∂x1∂y1∂x2∂y2∂x2⋮∂yn∂x2⋯⋯⋱⋯∂y1∂xn∂y2∂xn⋮∂yn∂x
n⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
Matrix obtained at this time∂y∂x Be calledJacobian matrix.

*
Matrix on the derivative of vector
matrixY=⎡⎣⎢⎢⎢⎢⎢y11y21⋮yn1y12y22⋮yn2⋯⋯⋱⋯y1ny2n⋮ynn⎤⎦⎥⎥⎥⎥⎥ about x=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥
The derivative of is the most complicated one in derivation, Expressed as

∂Y∂x=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y11∂x1∂y21∂x1⋮∂yn1∂x1∂y1n∂x2∂y22∂x2⋮∂yn2∂x2⋯⋯⋱⋯∂y1n∂xn∂y2n∂
xn⋮∂ynn∂xn⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥
(3) Derivation of matrix


In general, only scalar derivatives of matrices are considered, Scalar quantityy Pair matrix X Derivative, The derivative is a gradient matrix, It can be expressed as the following formula:

∂y∂X=⎡⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢∂y∂x11∂y∂x12⋮∂y∂x1n∂y∂x21∂y∂x22⋮∂y∂x2n⋯⋯⋱⋯∂y∂xn1∂y∂xn2⋮∂y∂xnn
⎤⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥


The following figure is a common matrix derivation form in machine learning, For reference



The next one is aboutHessian Basic concepts of matrix and convex function, To be continued.

[1] http://blog.csdn.net/u010976453/article/details/54342895
<http://blog.csdn.net/u010976453/article/details/54342895>
[2] http://blog.csdn.net/u010976453/article/details/78482502
<http://blog.csdn.net/u010976453/article/details/78482502>
[3] http://blog.csdn.net/u010976453/article/details/54381248
<http://blog.csdn.net/u010976453/article/details/54381248>
[4] Jacobian Matrix sumHessian matrix
http://jacoxu.com/jacobian%e7%9f%a9%e9%98%b5%e5%92%8chessian%e7%9f%a9%e9%98%b5/
<http://jacoxu.com/jacobian%e7%9f%a9%e9%98%b5%e5%92%8chessian%e7%9f%a9%e9%98%b5/>
[5] https://en.wikipedia.org/wiki/Norm_(mathematics)
<https://en.wikipedia.org/wiki/Norm_(mathematics)>
[6] https://en.wikipedia.org/wiki/Matrix_norm
<https://en.wikipedia.org/wiki/Matrix_norm>
[7] Matrix derivation of linear algebra in machine learning http://blog.csdn.net/u010976453/article/details/54381248
<http://blog.csdn.net/u010976453/article/details/54381248>
[8] Newton method andHessian matrixhttp://blog.csdn.net/linolzhang/article/details/60151623
<http://blog.csdn.net/linolzhang/article/details/60151623>