One , preface

This is an error （Error） Leading back propagation （Back
Propagation） motion , To obtain the optimal global parameter matrix , Then, the multi-layer neural network is applied to classification or regression tasks .

Forward transmission of input signal until output error , Update weight matrix of back propagation error information . These two sentences well describe the flow direction of information , Weight can be optimized in the two-way flow of information , It reminds me of the night view of Beijing , The flow of vehicles never stops , heavy traffic on the street , You come and I go (*
॑꒳ ॑* )⋆*.

As for why back propagation algorithm is proposed , I apply gradient descent directly （Gradient
Descent） No way? ? I think you must have had such a question . The answer must be no , Even if the gradient drops, the mind is vast , But not everything . Gradient descent can deal with the case with definite derivative function , Or we can deal with the situation where we can find the error , Like logical regression （Logistic
Regression）, We can think of it as a network without a hidden layer ; But for multi hidden layer neural network , The output layer can directly calculate the error to update the parameters , But the error of hidden layer does not exist , So we can't apply gradient descent directly to it , First, the error is propagated back to the hidden layer , And then apply gradient descent , The chain rule is needed to transfer the error from the end layer to the front layer （Chain
Rule） Help for , So back propagation algorithm can be said to be the application of gradient descent in chain rule .

Two , Take a chestnut

To help better understand the concept of back propagation , Have an intuitive understanding of it , Next, let's take the number guessing game as a chestnut .

2.1 Two guesses

This process is similar to neural network without hidden layer , Like logical regression , The small yellow cap represents the output layer node , Left side receives input signal , Output on the right , The little blue cat represents the error , Guide parameters to adjust in a better direction . Because little blue cat can directly feedback the error to little yellow hat , At the same time, only one parameter matrix is directly connected with the small yellow hat , So we can optimize the parameters directly by error （ Solid vertical line ）, Iterations , The error will be minimized .

2.2 Three guesses

This process is analogous to a three-layer neural network with a hidden layer , Where the little girl represents the hidden layer node , Small yellow hat still represents output layer node , Little girl left receives input signal , Output results through hidden layer nodes , Little blue cat represents error , Guide parameters to adjust in a better direction . Because little blue cat can directly feedback the error to little yellow hat , Therefore, the left parameter matrix directly connected with the small yellow hat can be directly optimized by error （ Solid vertical line ）; The left parameter matrix directly connected with the little girl can not be directly optimized because it can't get the direct feedback from the little blue cat （ Virtual brown line ）. But because of the back-propagation algorithm, the feedback of the little blue cat can be transmitted to the little girl, resulting in indirect errors , So the left weight matrix directly connected to the little girl can be updated by indirect error , Iterations , The error will be minimized .

Three , Complete process

The chestnuts on the top understand back propagation from an intuitive point of view , Next, we will introduce the forward propagation and back propagation of two processes in detail , Unify the marks before introducing .

3.1 Mathematical marker

3.2 Forward propagation

How to transmit the input layer signal to the hidden layer , To hide layer nodes c take as an example , Standing at a node c Look back （ Input layer direction ）, You can see that there are two arrows pointing to the node c, therefore a,b The node's information is passed to c, At the same time, each arrow has a certain weight , So for c For nodes , The input signal is ：

Homology , node d The input signal of is ：

Because computers are good at tasks with cycles , So we can express it by matrix multiplication ：

therefore , The output of hidden layer nodes after nonlinear transformation is shown as follows ：

Homology , The input signal of the output layer is expressed as the weight matrix multiplied by the output of the previous layer ：

same , The final output of the output layer nodes after nonlinear mapping is expressed as ：

Input signal with the help of weighting matrix , Get the output of each layer , Finally reach the output layer . so , The weight matrix plays the role of transporter in the process of forward signal transmission , Serve as a link between the preceding and the following .

3.3 Back propagation

Since gradient descent requires a definite error in each layer to update the parameters , So the next focus is how to propagate the error of the output layer back to the hidden layer .

Output layer , The error of hidden layer nodes is shown in the figure , Output layer error known , Next, for the first node of the hidden layer c Make error analysis . Still standing at the node c upper , The difference is that this time it's looking forward （ Direction of output layer ）, You can see the point c The two thick blue arrows of the node are from the node e And nodes f Starting , So for nodes c The error of must be the node of the output layer e and f of .

Not hard to find , Nodes of output layer e There are arrows pointing to the nodes of the hidden layer c and d, So for hidden nodes e Error of cannot be hidden c Bully for yourself , But to obey the principle of distribution according to work （ Distribute by weight ）, Homomorphic node f The error of must obey such principle , So for hidden layer nodes c The error of is ：

Homology , For hidden layer nodes d The error of is ：

In order to reduce the workload , We'd like to write it in the form of matrix multiplication ：

You'll find this matrix cumbersome , It would be better if it could be simplified to the form of forward propagation . Actually, we can do that , As long as we don't destroy their proportions , So we can ignore the denominator part , So the matrix form is ：

Observe carefully , You'll find the weight matrix , In fact, it is the weight matrix of forward propagation w Transposition of , So the abbreviation is as follows ：

Not hard to find , Output layer error with the help of transposed weight matrix , Passed to hidden layer , In this way, we can use the indirect error to update the weight matrix connected with the hidden layer . so , The weight matrix also plays the role of transporter in the process of back propagation , But this time, it's the output error of transportation , Not the input signal ( We don't produce errors , It's just the wrong porter (っ̯
-｡)).

Four , chain rule

The third part introduces the forward propagation of input information and the backward propagation of output error , Next, update the parameters according to the obtained error .

First, the w11 Update parameters , Before updating, let's deduce from the back to the front , Until foreseen w11 until ：

So the error pair w11 The derivation is as follows ：

The derivation is as follows （ All values known ）：

Homology , Error for w12 The partial derivatives of are as follows ：

same , To derive w12 Evaluation formula of ：

Homology , The error for bias is as follows ：

Bring in the above formula as ：

Next, for the input layer w11 Update parameters , Before the update, we still deduce from the back to the front , Until the first level w11 until （ But this time it's going to take a little longer ）：

So the error of the input layer w11 The derivation is as follows ：
The derivation is as follows （ It's a little long (ฅ́˘ฅ̀)）：

Homology , The other three parameters of the input layer can be calculated by the same method , No more details here .

When the partial derivative of each parameter is clear , Just bring in the gradient descent formula （ Not in focus ）：

thus , The task of updating the parameters of each layer by using the chain rule has been completed .

Five , introduce delta

Using the chain rule to update weights, you'll find it's easy , But it's too long . Since the process of updating can be seen as updating from the input layer to the output layer of the network from the front to the back , The error of the node needs to be recalculated every time it is updated , So there will be some unnecessary double counting . Actually, we can use the calculated nodes directly , So we can look at it again , Update from back to front . Update weight after , Then, on this basis, the middle value generated by the weight of the updated edge is used to update the earlier parameters . This intermediate variable is described below delta variable , To simplify the formula , Second, reduce the amount of calculation , A bit of catching up with dynamic planning .

Let's talk about the facts , Let's take a closer look at the errors in the fourth part of the chain derivation for the output layer w11 And hidden layer w11 The process of deriving and deriving bias , You will find , The three formulas have the same parts , At the same time, the partial formula of the output layer parameter will be used in the process of the hidden layer parameter derivation , That's what we've introduced delta Why （ In fact, the formula of red box is delta Definition of ）.

Let 's take a look at the classic books 《 Neural network and deep learning 》 For delta Is described in section l Layer j Errors on neurons , It is defined as the deviation of the error to the current weighted input , The mathematical formula is as follows ：

So the error of output layer can be expressed as （ Red box formula above ）：

The error of hidden layer can be expressed as （ Blue box formula above ）：

At the same time, the representation of weight update is （ Green box formula above ）：

In fact, the update of offset is expressed as （ Red box above ）：

above 4 A formula is 《 Neural network and deep learning 》 Back propagation of legends in books 4 Big formula （ Detailed derivation can be proved in this book ）：

Observe carefully , You will find BP1 And BP2 The combination of them can give full play to the greatest effect , The error of any layer can be calculated , Just use it first BP1 Formula to calculate the output layer error , And then use BP2 Layer by layer transmission , It's invincible , That's why the error back propagation algorithm . And for weight w And offset b We can go through BP3 and BP4 Formula to calculate .

thus , We introduced the knowledge of back propagation , At the beginning, I always felt relatively independent when I looked at the reverse communication materials , This textbook says so , Another blog, another way of speaking , I can't understand the meaning of it very well , up to now , Relatively clear thinking . Let's first introduce the background of back propagation from the general process , Then we use the chain derivative method to calculate the weight and the bias derivative , And then we come to the same conclusion as the classic works , So I think it's more detailed , It should be useful for beginners , I hope it can help you .

See my knowledge for more articles Zhang Xiaolei