Derivative and gradient

Derivatives: The derivative of a univariate function at a certain point describes the rate of change of the function near this point.


gradient: Derivative of multivariate function is gradient.

* First derivative, Gradient(gradient):


* Two derivative,Hessian matrix:
The first derivative and the second derivative are often recorded asf′(x) andf′′(x)

Taylor expansion: Taylor expansion of unary function:

Taylor expansion of multivariate functions( Only the first three items):

If∇Tf(xk)=0​, bexk
Be called“ Stationary point”, If it's a unary function, So this point must be a local extreme point, Maximum or minimum local extremum, Iff A convex function is a global minimum, Convex functions are briefly introduced in the next section.

If it's a multivariate function,∇2f(xk)≻0 Positive definite, That is, all eigenvalues are positive, Then the third term of the above formula is positive, be xk Is a strict local minimum point( Conversely,∇2f(xk)≺0
Negative definite strict local minimum). More complicated, If the eigenvalue of the second derivative has positive or negative, So it's uncertain, This timexk
Is a saddle point, That is, some dimensions are local minima, Some are local maxima, Saddle point is one of the core difficulties in neural network training, I will write later in other blogs, Let's go back to basics.

Taylor expansion is the core of many mathematical problems, Let's expand a little bit here:
problem: Why to choose gradient direction in optimization, Why is gradient direction the fastest changing direction?

The first two terms of Taylor series expansionf(xk+δ)≈f(xk)+∇Tf(xk)δ knowable, Whenδ Is a vector whose modulus is fixed but whose direction is uncertain,f(xk+δ)−f(xk)≈∇Tf(xk)
δ, here∇Tf(xk)δ=||∇Tf(xk)||⋅||δ||cos(θ), Maximum incos(θ)=1 Fetch, Namelyδ
Take gradient direction or negative gradient direction. If it's a minimum, So it's the gradient descent method,δ Take negative gradient direction, bringf(x) The fastest fall.

Matrix derivation summary

(1) Derivation of scalar

* Scalar about Scalarx Derivation:
* Vector about Scalarx Derivation:
vectory=⎡⎣⎢⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥⎥ On scalarx The derivation of y Each element ofx Derivation, It can be expressed as
* matrix· On scalarx Derivation:
The derivative of a matrix to a scalar is similar to that of a vector to a scalar, That is to say, each element of a matrix has a scalarx Derivation
(2) Derivation of vectors

* Scalar about vectorx Derivative
scalary About vector x=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥ The derivation of can be expressed as
=[∂y∂x1 ∂y∂x2 ⋯ ∂y∂xn]
* Vectors about vectors x Derivative
Vector function( That is, the vector composed of functions)y=⎡⎣⎢⎢⎢⎢⎢y1y2⋮yn⎤⎦⎥⎥⎥⎥⎥ aboutx=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥ Derivative
Matrix obtained at this time∂y∂x Be calledJacobian matrix.

Matrix on the derivative of vector
matrixY=⎡⎣⎢⎢⎢⎢⎢y11y21⋮yn1y12y22⋮yn2⋯⋯⋱⋯y1ny2n⋮ynn⎤⎦⎥⎥⎥⎥⎥ about x=⎡⎣⎢⎢⎢⎢x1x2⋮xn⎤⎦⎥⎥⎥⎥
The derivative of is the most complicated one in derivation, Expressed as

(3) Derivation of matrix

In general, only scalar derivatives of matrices are considered, Scalar quantityy Pair matrix X Derivative, The derivative is a gradient matrix, It can be expressed as the following formula:


The following figure is a common matrix derivation form in machine learning, For reference

The next one is aboutHessian Basic concepts of matrix and convex function, To be continued.

[4] Jacobian Matrix sumHessian matrix
[7] Matrix derivation of linear algebra in machine learning
[8] Newton method andHessian matrix