One , The origin of back propagation

Before we start DL Before , We need to ANN— Artificial neural networks and bp A simple explanation of the algorithm .

about ANN Structure of , I won't say more , There are a lot of learning materials on the Internet , The main thing is to find out some nouns ：

Input layer / Input neuron , Output layer / Output neuron , Cryptic layer / Saphenous neurons , Weight , bias , Activation function

Next we need to know ANN How to train , hypothesis ANN The network has been set up , In all application problems （ No matter the network structure , How training means change ） Our goal will not change , That is, the weights and offsets of the network eventually become the best values , This value allows us to get the ideal output from the input , So the problem becomes y=f(x,w,b)（x Is input ,w Is the weight ,b Is offset , All these quantities can have multiple , Like multiple x1,x2,x3…… last f() It's like our network. It must be represented by a function , We don't need to know f(x) What kind of function is it , Since childhood, we have thought that as long as it is a function, it must be representable , image f(x)=sin(x) equally , But let's get rid of this misconception , We just need to know a series of w and b Determined a function f（x）, This function allows us to calculate the reasonable y）

The final goal is to try something different w,b value , Make the last y=f（x） Infinitely close to the value we want t

But the problem is still complicated , Let's simplify it , Give Way （y-t）^2 As small as possible . So the original problem turned into C（w,b）=（f（x,w,b）-t）^2 Take a value as small as possible . This problem is not a difficult one , No matter how complex the function is , If C Reduced to a value that can no longer be lowered , Then we get to the minimum （ Suppose we don't consider the local minimum ）

How to descend ? Mathematics tells us that for a multivariable function f(a,b,c,d,……) for , We can find a vector , It's called the gradient of this function , It should be noted that , Gradient is a direction vector , It represents the direction of the maximum rate of change of this function at this point （ This theorem will not be explained in detail , It can be found in Higher Mathematics Textbooks ） therefore C（w,b） Variation of ΔC It can be expressed as

among

It's a small change at that point , We can specify these small changes at will , Just make sure ΔC<0 That's it , But for a faster descent , Why don't we choose to change the gradient direction ?

in fact , That's how the idea of gradient descent works , We guarantee C Decreasing all the time , And for w For each update .

ok, Come here , It seems that all the problems have been solved , Let's rearrange our thoughts , We've transformed the problem into many steps ：

The problem of network weight offset updating ==> f（x,w,b） Result approximation of t ==> C（w,b）=（f（x,w,b）-t）^2 Minimum problem ==>

C（w,b） Descending by gradient ==> Take the minimum , Network optimization

Don't forget a little !! The derivation is based on a premise ： We know the gradient of the current point in advance . But that's not the case !!

It's a problem NN Researchers for many years ,1969 year M.Minsky and S.Papert Written by 《 Perceptron 》 Publication of a Book , It makes an in-depth analysis of the single-layer neural network , And it is proved mathematically that the function of this kind of network is limited , Can't even solve the problem " Exclusive or " Such a simple logic operation problem . meanwhile , They also found that there are many patterns that can't be trained with a single layer network , However, there is no effective low complexity algorithm for multilayer networks , Finally, they even think that neural networks can't deal with nonlinear problems . However, in 1974 year ,Paul

Werbos For the first time, the learning algorithm of how to train the general network is given —back propagation. This algorithm can efficiently calculate the gradient of each iteration , Let's realize the above derivation !!

Unfortunately , At that time, no one in the whole artificial neural network community knew it Paul The proposed learning algorithm . until 80 Mid-s ,BP It's the algorithm David Rumelhart,Geoffrey

Hinton and Ronald Williams,David Parker and Yann

LeCun Independent discovery , And got a lot of attention , It caused the second upsurge in the field of artificial neural network .

<> Two , Introduction of principle

As mentioned above , So called back propagation , It's a way to calculate gradients . For back propagation , Don't rush to introduce its principle , Many articles directly introduce formulas , It's hard for us to understand . Here is the answer of a great God .

source ： Zhihu https://www.zhihu.com/question/27239198?rf=24827633

<https://www.zhihu.com/question/27239198?rf=24827633>

Hypothetical input a=2,b=1, under these circumstances , It is easy to find the partial derivative relationship between adjacent nodes

Using the chain rule ：

as well as

Value of equals from a reach e The product of partial derivatives on the path of , and

Value of equals from b reach e Path to 1(b-c-e) The product of the partial derivative on plus the path 2(b-d-e) Product of partial derivatives on . in other words , For upper nodes p And lower nodes q, Required

, Need to find q Node to p All paths of nodes , And for each path , Find the product of all partial derivatives on this path , And then “ product ” A sum of values .

In this case, the partial derivative can be easily obtained , Because we already know the function relation of network ,e=（a+b）*（b+1）, It's a no weight intervention , Network with known relationship between input and output . In fact, we just know e Relationship with output , That's what it says C=（y-t）^2, And there will be thousands of weights and biases to intervene in the derivation process . So change your mind , Can we find the partial derivative of the output to the result ?

Reuse the relationship above . node c Yes e Partial Guide 2 And pile up the results , node d Yes e Partial Guide 3 And pile up the results , So far, the second floor is over , Calculate the total stacking capacity of each node and continue to send to the next level . node c towards a send out 2*1 And stack them up , node c towards b send out 2*1 And pile them up , node d towards b send out 3*1 And pile them up , So far, the third floor is finished , node a The quantity stacked is 2, node b The quantity stacked is 2*1+3*1=5,

I.e. vertex e Yes b The partial derivative of is 5. A brief summary , From the top node e start , Process in layers . about e All children of the next level of , take 1 multiply e Partial derivative on a path to a node , And put the results “ Stacking ” In this child node . etc. e After the layer is propagated in this way , Every node in the second layer “ Stacking " Some values , Then we focus on each node , Put everything in it “ Stacking ” Sum of values of , You get the vertex e Partial derivation to this node . Then the nodes of the second layer are taken as the starting vertices respectively , Initial set to vertex e Their partial derivatives , with " layer " Repeat the above communication process for the unit , We can find the vertex e Partial derivatives for nodes of each layer .

<> Three , A good example

Now? , Let's take weight into account , Here is a good example , It helps us understand back propagation source ：Charlotte77 Blog of

http://www.cnblogs.com/charlotte77/p/5629865.html

<http://www.cnblogs.com/charlotte77/p/5629865.html>

hypothesis , You have such a network layer ：

The first layer is the input layer , Contains two neurons i1,i2, And intercept items b1; The second layer is the hidden layer , Contains two neurons h1,h2 And intercept items b2, The third layer is output o1,o2, Marked on each line wi Is the weight of the connection between layers , Activation function we default to sigmoid function .

Now give them the initial value , As shown below ：

among , input data i1=0.05,i2=0.10;

output data o1=0.01,o2=0.99;

Initial weight w1=0.15,w2=0.20,w3=0.25,w4=0.30;

w5=0.40,w6=0.45,w7=0.50,w8=0.88

target ： Give input data i1,i2(0.05 and 0.10), Make the output as close to the original as possible o1,o2(0.01 and 0.99) near .

Step 1 Forward propagation

1. Input layer ----> Hidden layer ：

Computational neuron h1 Input weighted sum of ：

neuron h1 Output of o1:( The activation function used here is sigmoid function )：

Homology , Neurons can be calculated h2 Output of o2：

2. Hidden layer ----> Output layer ：

Compute output layer neurons o1 and o2 Value of ：

So the process of forward propagation is over , We get an output of [0.75136079 , 0.772928465], And actual value [0.01 ,

0.99] It's a long way off , Now we're going back to error propagation , Update weights , Recalculate output .

Step 2 Back propagation

1. Total calculation error

Total error ：(square error)

But there are two outputs , So calculate separately o1 and o2 Error of , The total error is the sum of the two ：

2. Hidden layer ----> Weight update of output layer ：

With weight parameters w5 take as an example , If we want to know w5 How much influence does it have on the overall error , You can use the overall error pair w5 Find the partial derivative ：（ Chain rule ）

The following figure can be more intuitive to see how the error propagates in reverse ：

Now let's calculate the value of each formula separately ：

calculation ：

calculation ：

（ This step is actually right sigmoid Derivation of function , Simple , You can deduce it yourself ）

calculation ：

Multiplication of the last three ：

So we can calculate the overall error E(total) Yes w5 Partial derivative of .

Look back at the formula above , We found that ：

For convenience of expression , Error used to represent the output layer ：

therefore , Overall error E(total) Yes w5 The partial derivative formula of can be written as ：

If the output layer error is negative , It can also be written as ：

Finally, let's update w5 Value of ：

（ among , Is the rate of learning , Here we take 0.5）

Homology , renewable w6,w7,w8:

3. Hidden layer ----> Weight update of hidden layer ：

The method is similar to the one mentioned above , But there's a place that needs to change , Calculate the total error pair above w5 Partial derivative of , From out(o1)---->net(o1)---->w5, But when the weights between hidden layers are updated , yes out(h1)---->net(h1)---->w1, and out(h1) Will accept E(o1) and E(o2) Errors from two places , So both of them have to be calculated in this place .

calculation ：

Calculate first ：

Homology , Calculated ：

The sum of the two ：

Recalculation ：

Recalculation ：

last , Multiplication of the three ：

To simplify the formula , use sigma(h1) Represent hidden layer unit h1 Error of ：

last , to update w1 Weight of ：

Homology , Amount renewable w2,w3,w4 Weight of ：

In this way, the error back propagation method is completed , Finally, we recalculate the updated weights , Iterating on and on , After the first iteration in this example , Total error E(total) from 0.298371109 Drop to 0.291027924. iteration 10000 After , The total error is 0.000035085, Output as [0.015912196,0.984065734]( Original input is [0.01,0.99]), It turns out it's good

Back propagation algorithm （Backpropagation） Is currently used to train artificial neural networks （Artificial Neural

Network,ANN） The most common and effective algorithm of . Its main idea is ：

（1） Input training set data to ANN Input layer of , Through hidden layers , Finally reach the output layer and output the result , This is ANN Forward propagation process of ;

（2） because ANN The output result of is in error with the actual result , Then calculate the error between the estimated value and the actual value , And the error is propagated back from the output layer to the hidden layer , Until it propagates to the input layer ;

（3） In the process of back propagation , Adjust the values of various parameters according to the error ; Iterative process , Until convergence .

The idea of back propagation algorithm is easy to understand , But the concrete formula needs to be deduced step by step , Therefore, this paper focuses on the derivation process of the formula .

<>1. Variable definition

Above is a three-layer artificial neural network ,layer1 to layer3 Input layer , Hidden and output layers . As shown in the figure , Define some variables first ： Represents the layer of

The weight of the first neuron connected to the layer ; Represents the bias of the first neuron in the layer ; Represents the input of the first neuron in the layer , Namely ： Indicates No

Output of the first neuron of the layer , Namely ： Where is the activation function .

<>2. Cost function

Cost function is used to calculate ANN Error between output value and actual value . The common cost function is quadratic cost function （Quadratic cost function）：

among , Sample for input , Represents the actual classification , Represents the output of the forecast , Represents the maximum number of layers of neural network .

<>3. Formula and derivation

This section describes the back propagation algorithm used 4 Formulas , And deduce . If you don't want to understand the formula derivation process , Please look directly at section 4 Algorithm steps of section . first , The layer will be

Errors in neurons （ That is, the error between the actual value and the predicted value ） Defined as ：

This article will take an input sample as an example to illustrate , In this case, the cost function is expressed as ：

formula 1（ Calculating the errors produced by the last layer of neural network ）：

among , express Hadamard product , Point to point multiplication between matrices or vectors . formula 1 The derivation process of 如下：

公式2（由后往前,计算每一层神经网络产生的错误）：

推导过程：

公式3（计算权重的梯度）：

推导过程：

公式4（计算偏置的梯度）：

推导过程：

<>4. 反向传播算法伪代码

* 输入训练集

* 对于训练集中的每个样本x,设置输入层（Input layer）对应的激活值：

* 前向传播：,

* 计算输出层产生的错误：

* 反向传播错误：

* 使用梯度下降（gradient descent）,训练参数：