One, Preface

This is an error（Error） Leading back propagation（Back
Propagation） motion, To obtain the optimal global parameter matrix, Then, the multi-layer neural network is applied to classification or regression tasks.

Forward transmission of input signal until output error, Update weight matrix of back propagation error information. These two sentences well describe the flow direction of information, Weight can be optimized in the two-way flow of information, It reminds me of the night view of Beijing, The flow of vehicles never stops, Heavy traffic on the street, You come to me.(*
॑꒳ ॑* )⋆*.

As for why back propagation algorithm is proposed, I apply gradient descent directly（Gradient
Descent） No way?? I think you must have had such a question. The answer must be no, Even if the gradient drops, the mind is vast, But not everything. Gradient descent can deal with the case with definite derivative function, Or we can deal with the situation where we can find the error, Like logical regression（Logistic
Regression）, We can think of it as a network without a hidden layer; But for multi hidden layer neural network, The output layer can directly calculate the error to update the parameters, But the error of hidden layer does not exist, So we can't apply gradient descent directly to it, First, the error is propagated back to the hidden layer, And then apply gradient descent, The chain rule is needed to transfer the error from the end layer to the front layer（Chain
Rule） Help, So back propagation algorithm can be said to be the application of gradient descent in chain rule.

Two, Raise a chestnut.

To help better understand the concept of back propagation, Have an intuitive understanding of it, Next, let's take the number guessing game as a chestnut.

2.1 Two guesses

This process is similar to neural network without hidden layer, Like logical regression, The small yellow cap represents the output layer node, Left side receives input signal, Output on the right, The little blue cat represents the error, Guide parameters to adjust in a better direction. Because little blue cat can directly feedback the error to little yellow hat, At the same time, only one parameter matrix is directly connected with the small yellow hat, So we can optimize the parameters directly by error（ Real longitudinal line）, Several rounds of iteration, The error will be minimized.

2.2 Three guesses

This process is analogous to a three-layer neural network with a hidden layer, Where the little girl represents the hidden layer node, Small yellow hat still represents output layer node, Little girl left receives input signal, Output results through hidden layer nodes, Little blue cat represents error, Guide parameters to adjust in a better direction. Because little blue cat can directly feedback the error to little yellow hat, Therefore, the left parameter matrix directly connected with the small yellow hat can be directly optimized by error（ Real longitudinal line）; The left parameter matrix directly connected with the little girl can not be directly optimized because it can't get the direct feedback from the little blue cat（ Virtual brown line）. But because of the back-propagation algorithm, the feedback of the little blue cat can be transmitted to the little girl, resulting in indirect errors, So the left weight matrix directly connected to the little girl can be updated by indirect error, Several rounds of iteration, The error will be minimized.

Three, Complete process

The chestnuts on the top understand back propagation from an intuitive point of view, Next, we will introduce the forward propagation and back propagation of two processes in detail, Unify the marks before introducing.

3.1 Mathematical markers

3.2 Forward propagation

How to transmit the input layer signal to the hidden layer, To hide layer nodesc take as an example, Standing nodec Look back（ Input layer direction）, You can see that there are two arrows pointing to the nodec, thereforea,b The node's information is passed toc, At the same time, each arrow has a certain weight, So forc Node, The input signal is：

Empathy, noded The input signal of is：

Because computers are good at tasks with cycles, So we can express it by matrix multiplication：

therefore, The output of hidden layer nodes after nonlinear transformation is shown as follows：

Empathy, The input signal of the output layer is expressed as the weight matrix multiplied by the output of the previous layer：

same, The final output of the output layer nodes after nonlinear mapping is expressed as：

Input signal with the help of weighting matrix, Get the output of each layer, Finally reach the output layer. So, The weight matrix plays the role of transporter in the process of forward signal transmission, Serve as a link between the preceding and the following.

3.3 Back propagation

Since gradient descent requires a definite error in each layer to update the parameters, So the next focus is how to propagate the error of the output layer back to the hidden layer.

Output layer, The error of hidden layer nodes is shown in the figure, Output layer error known, Next, for the first node of the hidden layerc Make error analysis. Still standing at the nodec upper, The difference is that this time it's looking forward（ Direction of output layer）, You can see the pointc The two thick blue arrows of the node are from the nodee Nodef Beginning, So for nodesc The error of must be the node of the output layere andf Of.

Not hard to find, Nodes of output layere There are arrows pointing to the nodes of the hidden layerc andd, So for hidden nodese Error of cannot be hiddenc Hegemony for oneself, But to obey the principle of distribution according to work（ Distribute by weight）, Empathy nodef The error of must obey such principle, So for hidden layer nodesc The error is：

Empathy, For hidden layer nodesd The error is：

In order to reduce the workload, We'd like to write it in the form of matrix multiplication：

You'll find this matrix cumbersome, It would be better if it could be simplified to the form of forward propagation. Actually, we can do that, As long as we don't destroy their proportions, So we can ignore the denominator part, So the matrix form is：

Observe carefully, You'll find the weight matrix, In fact, it is the weight matrix of forward propagationw Transpose, So the abbreviation is as follows：

Not hard to find, Output layer error with the help of transposed weight matrix, Passed to hidden layer, In this way, we can use the indirect error to update the weight matrix connected with the hidden layer. So, The weight matrix also plays the role of transporter in the process of back propagation, But this time, it's the output error of transportation, Not the input signal( We don't produce errors, It's just the wrong porter(っ̯
-｡)).

Four, chain rule

The third part introduces the forward propagation of input information and the backward propagation of output error, Next, update the parameters according to the obtained error.

First, thew11 Update parameters, Before updating, let's deduce from the back to the front, Until foreseenw11 Until：

So the error pairw11 The derivation is as follows：

The derivation is as follows（ All values known）：

Empathy, Error forw12 The partial derivatives of are as follows：

same, Derivationw12 Evaluation formula of：

Empathy, The error for bias is as follows：

Bring in the above formula as：

Next, for the input layerw11 Update parameters, Before the update, we still deduce from the back to the front, Until the first levelw11 Until（ But this time it's going to take a little longer）：

So the error of the input layerw11 The derivation is as follows：
The derivation is as follows（ A little long(ฅ́˘ฅ̀)）：

Empathy, The other three parameters of the input layer can be calculated by the same method, No more details here.

When the partial derivative of each parameter is clear, Just bring in the gradient descent formula（ Not in focus）：

thus, The task of updating the parameters of each layer by using the chain rule has been completed.

Five, Introducedelta

Using the chain rule to update weights, you'll find it's easy, But it's too long. Since the process of updating can be seen as updating from the input layer to the output layer of the network from the front to the back, The error of the node needs to be recalculated every time it is updated, So there will be some unnecessary double counting. Actually, we can use the calculated nodes directly, So we can look at it again, Update from back to front. Update the weight of the next edge first, Then, on this basis, the middle value generated by the weight of the updated edge is used to update the earlier parameters. This intermediate variable is described belowdelta variable, To simplify the formula, Second, reduce the amount of calculation, A bit of catching up with dynamic planning.

Let's talk about the facts, Let's take a closer look at the errors in the fourth part of the chain derivation for the output layerw11 And hidden layerw11 The process of deriving and deriving bias, You will find, The three formulas have the same parts, At the same time, the partial formula of the output layer parameter will be used in the process of the hidden layer parameter derivation, That's what we've introduceddelta Reasons for（ In fact, the formula of red box isdelta Definition）.

Let's take a look at the classic books《 Neural network and deep learning》 Medium fordelta Is described in sectionl Tier Ij Errors on neurons, It is defined as the deviation of the error to the current weighted input, The mathematical formula is as follows：

So the error of output layer can be expressed as（ Red box formula above）：

The error of hidden layer can be expressed as（ Blue box formula above）：

At the same time, the representation of weight update is（ Green box formula above）：

In fact, the update of offset is expressed as（ Red box above）：

Above4 A formula is《 Neural network and deep learning》 Back propagation of legends in books4 Big formula（ Detailed derivation can be proved in this book）：

Observe carefully, You will findBP1 AndBP2 The combination of them can give full play to the greatest effect, The error of any layer can be calculated, Just use it firstBP1 Formula to calculate the output layer error, Then make use ofBP2 Layer by layer transmission, It's invincible. That's why the error back propagation algorithm. And for weightw And biasb We can go throughBP3 andBP4 Formula to calculate.

thus, We introduced the knowledge of back propagation, At the beginning, I always felt relatively independent when I looked at the reverse communication materials, This textbook says so, Another blog, another way of speaking, I can't understand the meaning of it very well, up to now, Relatively clear thinking. Let's first introduce the background of back propagation from the general process, Then we use the chain derivative method to calculate the weight and the bias derivative, And then we come to the same conclusion as the classic works, So I think it's more detailed, It should be useful for beginners, I hope it can help you.

See my knowledge for more articles Zhang Xiaolei

======== This is a line of praise======

It is not easy to draw and insert formula of code word, Don't like the collection ೖ(⑅σ̑ᴗσ̑)ೖ

Nielsen M A. Neural networks and deep learning[M]. 2015.

Rashid T. Make your own neural network[M]. CreateSpace IndependentPublishing
Platform, 2016.
author： Zhang Xiaolei link：https://www.jianshu.com/p/964345dddb70 Source： Brief book