If in this era of artificial intelligence , As an aspiring programmer , Or students , amateur , Don't understand the hot topic of deep learning , Seems to be out of touch with the times .

however , The requirement of deep learning for Mathematics , Including calculus , Linear algebra, probability theory, mathematical statistics, etc , Let most of the ambitious young people hesitate to move forward . So here comes the question... , Understanding deep learning , Do you need this knowledge or not ?

On deep learning , There is a lot of information on the Internet , But most of them are not suitable for beginners . Mr. Yang summed up several reasons :

Deep learning really needs a certain mathematical foundation . If you don't have to go deep and talk about local laws , Some readers will be afraid of difficulties , So it's easy to give up too early .

Books or articles written by Chinese or Americans , Generally difficult .

The mathematics foundation needed for deep learning is not so difficult as we think , You just need to know the concept of derivatives and related functions . If you haven't studied advanced mathematics , very good , This article is actually for liberal arts students to understand , Just need to learn junior high school mathematics .

Don't be afraid of difficulties , I admire Li Shufu's spirit , In a TV interview , Li Shufu said : Who says the Chinese can't build cars ? Why is it difficult to build a car , Four wheels and two rows of sofas . of course , His conclusion is biased , But the spirit is admirable .

“ Wang Xiaoer sells pigs ” On the derivative of deep learning

What is derivative ?

It's just the rate of change , such as : Wang Xiaoer sold it this year 100 Pig , Sold last year 90 head , Sold the year before last 80 head ... What is the rate of change or growth ? Annual growth 10 Pig , How simple .

Notice that there's a time variable --- year . The growth rate of Wang Xiaoer's pig sales is 10 head / year , in other words , Derivative is 10.

function y = f(x) = 10x + 30, Here we assume that Wang Xiaoer sold it in the first year 30 head , Annual growth in the future 10 head ,x Representative time ( year ),y Represents the number of pigs .

of course , This is a fixed growth rate situation , In real life , Many times , And the amount of change is not fixed , That is to say, the growth rate is not constant .

such as , The function might look like this : y = f(x) = 5x² + 30, here x and y
Still represents time and number of heads , But the growth rate has changed , How to calculate the growth rate , Let's talk about it later . Or you can just remember some formulas for derivation .

Deep learning also has an important mathematical concept : partial derivative

How to understand the partial of partial derivative ? A partial headache , Or I won't let you guide , You need to guide ?

None of them , We also take Wang Xiaoer as an example , We just talked about ,x
Variable is time ( year ), But sold pigs , It's not just about time , As the business grows , Wang Xiaoer not only expanded pig farm , And hired a lot of employees to raise pigs .

So the equation changed again :y = f(x) = 5x₁² + 8x₂ + 35x₃ + 30

here x₂ Representative area ,x₃ Number of representative employees , of course x₁ Or time .

Take flirting with girls as an example , Interpretation of deep learning “ partial derivative ”

What is the partial derivative

When the partial derivative is just multiple variables , Rate of change for a variable . In the formula above , If for x₃ Find partial derivative , in other words , How much employees contribute to the growth rate of pigs .

Or say , along with ( each ) Employee growth , How much more pigs , This is equal to 35--- Every additional employee , Sell more 35 Pig .

When calculating the partial derivative , Other variables can be treated as constants , This is very important , The constant rate of change is 0, So the derivative is 0, So it's right 35x₃ Find derivative , be equal to 35. about x₂
Find partial derivative , It's similar .

Find partial derivative , We use a symbol : such as y / x₃ It means y Yes x₃ Find partial derivative .

Nonsense for a long time , What does this have to do with deep learning ? Of course it does , Deep learning uses neural network , Used to solve the problem of linear indivisibility .

Here I mainly talk about the relationship between mathematics and deep learning . Let me show you some pictures first :

chart 1: Deep learning , It's a neural network with many hidden layers

chart 2: When single output , How to find the partial derivative

chart 3: Multiple output , How to find the partial derivative

The next two pictures are from the Japanese book about deep learning . The so-called entrance layer , Output layer , Mesosphere , Corresponding to Chinese : Input layer , Output layer , And hidden layer .

Don't be scared by these pictures , It's very simple , Take flirting with girls for example . We can roughly divide the love between men and women into three stages :

First love . Input layer equivalent to deep learning . Others attract you , There must be many factors , such as : height , figure , cheek , education , Character, etc , These are input layer parameters , Weight may not be the same for everyone .

Infatuation period . Let's make it correspond to the hidden layer ! This period , All kinds of running in , daily necessaries .

Stable period . Corresponding output layer , Is it suitable , It depends on how the running in is . As you all know , Running in is very important , How to break in ? It's a process of continuous learning, training and revision !

For example, my girlfriend likes strawberry cake , You bought Blueberry , Her feedback was negative, Don't buy blueberries next time , Strawberry .

After reading this , Some guys may start to transfer to their girlfriends . A little uneasy , So add . Flirting is like deep learning , To prevent under fitting , Also prevent over fitting .

So called under fitting , For deep learning , It's not enough training , Insufficient data , It's like , You're inexperienced in flirting .
To fit , Of course, sending flowers is the most basic , Other aspects need to be improved , such as , Improve your sense of humor, etc . I need to mention something here , Under fitting is not good , But over fitting is even more inappropriate .

Over fitting is opposite to under fitting , one side , If over fitting , She will think you have the potential of Miss Edison Chen , what's more , Everyone's different , It's like deep learning , Training set works well , But the test set doesn't work !

In terms of flirting , She's going to think you've been ( Training set ) It's a big impact , This is taboo ! If you give her that impression , You're getting bored later , Remember to remember !

Deep learning is also a process of continuous running in , Just started to define a standard parameter ( These are empirical values , It's like Valentine's day and birthday have to send flowers ), And then fix it again and again , Drawing 1 Weight between each node .

Why do we have to run in like this ? Just think about it , We assume that deep learning is a child , How can we teach him to read ?

You must show him the picture first , And tell him the right answer , Need a lot of pictures , Teaching him constantly , Train him , This training process , In fact, it is similar to the process of solving the weight of neural network .
When testing later , All you have to do is give him a picture , He knew what was in the picture .

So training set , It's about showing children pictures with the right answers , For deep learning , Training set is used to solve the weight of neural network , The final model ; And test set , It is used to verify the accuracy of the model .

For trained models , As shown in the figure below , weight (w1,w2...) All known .

chart 4

chart 5

Like above , It's easy to work out from left to right . But vice versa , Test set has pictures , There's also the right answer , To turn it around w1,w2...... What should I do? ?

How to find the partial derivative ?

For a long time , Finally, it's time to ask for partial guidance . The current situation is :

Let's assume that a neural network has been defined , For example, how many layers , How many nodes are there in each layer , There are also default weights and activation functions . input ( image ) In case of determination , The output value can only be changed by adjusting the parameters . How to adjust , How to run in ?

Each parameter has a default value , We will add a certain value to each parameter ∆, And see what happens ? If the parameter is increased , The gap has also widened , Then it has to be reduced ∆, Because our goal is to make the gap smaller ; vice versa .

So in order to optimize the parameters , We need to understand the rate of change of the error for each parameter , Isn't that to find the partial derivative of the error to the parameter ?

There are two points here :
One is the activation function , The main purpose is to make the whole network nonlinear . We mentioned it earlier , In many cases , Linear functions can't classify inputs properly ( In many cases, recognition is mainly classified ).

So we should let the network learn a nonlinear function , Here we need to activate the function , Because it's nonlinear , So the whole network has nonlinear characteristics .

in addition , The activation function also keeps the output value of each node within a controllable range , Easy to calculate .

It seems that this explanation is not popular , In fact, we can use flirting as a metaphor : Girls don't like boiled water , Because it's linear , Of course, we need some romantic feelings in our life , This activation function , I feel like a little romance in my life , Little surprise .

Every stage of getting along , Need to activate from time to time , Make a little romance , Little surprise .
such as , Ordinary girls see cute little cups , Porcelain and all that , Then give her a special style on her birthday , Moved her to tears .

As mentioned earlier, men should be humorous , This is to make her laugh , And make her cry when it's right . Cry and smile , A few more rounds , She can't leave you . Because your nonlinearity is too strong .

of course , going beyond the limit is as bad as falling short , The more small surprises, the better , But if it doesn't, it's boiled water . It's like every layer
Can add activation function , of course , You don't have to add activation functions to every layer , But not at all , That's not going to work .

The key is how to find the partial derivative . chart 2 Sum graph 3
The derivation methods are given respectively , It's very simple , It's OK to take a partial lead from right to left . The derivation of adjacent layers is very simple , Because it's linear , So the partial derivative is actually the parameter itself , Just follow the solution x₃
The partial derivative of . Then multiply the partial derivatives .

There are two points here : One is the activation function , In fact, the activation function is nothing , So that the output of each node is 0 reach 1 Interval of , It's easy to calculate , So there's another layer of mapping on the results , It's all one-on-one .

Due to the existence of the activation function , In the process of partial derivation , You have to count it in , Activation function , General use sigmoid, It can also be used Relu etc. . The derivation of activation function is very simple :

Derivation : f'(x)=f(x)*[1-f(x)]

This aspect , If you have time, you can take a look at high numbers , If there's no time , Just remember . as for Relu, That's easier , namely f(x) When x<0 When y be equal to 0, Other times ,y
be equal to x.

of course , You can also define your own Relu function , such as x Greater than or equal to 0 When ,y be equal to 0.01x, it's fine too .

What is learning coefficient ?

The other is the learning coefficient , Why is it called learning coefficient ?

We talked about it just now ∆ increment , How much is suitable for each increase ? Is it equivalent to partial derivative ( Rate of change )?

Experience tells us , Need to multiply by a percentage , This is the learning coefficient , and , With the deepening of training , This coefficient can be changed .

of course , There are also some very important basic knowledge , such as SGD( Random gradient descent ),mini batch and epoch( Selection for training set ).

Contents described above , Mainly about how to adjust the parameters , It's in the primary stage . As mentioned above , Before parameter adjustment , All have default network models and parameters , How to define the initial model and parameters ? We need to know more about it .

however , For general engineering , Just call the parameter on the default network , Equivalent to using algorithms ; For scholars and scientists , They will invent algorithms , It's very difficult . Salute them !

come 源:知乎Jacky Yang

分享朋友圈 也是另一种赞赏

The more we share, The more we have