vernacular AI： Is it really so difficult to understand deep learning ? Junior high school mathematics , Just use it 10 minute
If in this era of artificial intelligence , As an aspiring programmer , Or students , amateur , Don't understand the hot topic of deep learning , Seems to be out of touch with the times .
however , The requirement of deep learning for Mathematics , Including calculus , Linear algebra, probability theory, mathematical statistics, etc , Let most of the ambitious young people hesitate to move forward . So here comes the question... , Understanding deep learning , Do you need this knowledge or not ?
On deep learning , There is a lot of information on the Internet , But most of them are not suitable for beginners . Mr. Yang summed up several reasons ：
Deep learning really needs a certain mathematical foundation . If you don't have to go deep and talk about local laws , Some readers will be afraid of difficulties , So it's easy to give up too early .
Books or articles written by Chinese or Americans , Generally difficult .
The mathematics foundation needed for deep learning is not so difficult as we think , You just need to know the concept of derivatives and related functions . If you haven't studied advanced mathematics , very good , This article is actually for liberal arts students to understand , Just need to learn junior high school mathematics .
Don't be afraid of difficulties , I admire Li Shufu's spirit , In a TV interview , Li Shufu said ： Who says the Chinese can't build cars ? Why is it difficult to build a car , Four wheels and two rows of sofas . of course , His conclusion is biased , But the spirit is admirable .
“ Wang Xiaoer sells pigs ” On the derivative of deep learning
What is derivative ?
It's just the rate of change , such as ： Wang Xiaoer sold it this year 100 Pig , Sold last year 90 head , Sold the year before last 80 head ... What is the rate of change or growth ? Annual growth 10 Pig , How simple .
Notice that there's a time variable --- year . The growth rate of Wang Xiaoer's pig sales is 10 head / year , in other words , Derivative is 10.
function y = f(x) = 10x + 30, Here we assume that Wang Xiaoer sold it in the first year 30 head , Annual growth in the future 10 head ,x Representative time （ year ）,y Represents the number of pigs .
of course , This is a fixed growth rate situation , In real life , Many times , And the amount of change is not fixed , That is to say, the growth rate is not constant .
such as , The function might look like this : y = f(x) = 5x² + 30, here x and y
Still represents time and number of heads , But the growth rate has changed , How to calculate the growth rate , Let's talk about it later . Or you can just remember some formulas for derivation .
Deep learning also has an important mathematical concept ： partial derivative
How to understand the partial of partial derivative ? A partial headache , Or I won't let you guide , You need to guide ?
None of them , We also take Wang Xiaoer as an example , We just talked about ,x
Variable is time （ year ）, But sold pigs , It's not just about time , As the business grows , Wang Xiaoer not only expanded pig farm , And hired a lot of employees to raise pigs .
So the equation changed again ：y = f(x) = 5x₁² + 8x₂ + 35x₃ + 30
here x₂ Representative area ,x₃ Number of representative employees , of course x₁ Or time .
Take flirting with girls as an example , Interpretation of deep learning “ partial derivative ”
What is the partial derivative
When the partial derivative is just multiple variables , Rate of change for a variable . In the formula above , If for x₃ Find partial derivative , in other words , How much employees contribute to the growth rate of pigs .
Or say , along with （ each ） Employee growth , How much more pigs , This is equal to 35--- Every additional employee , Sell more 35 Pig .
When calculating the partial derivative , Other variables can be treated as constants , This is very important , The constant rate of change is 0, So the derivative is 0, So it's right 35x₃ Find derivative , be equal to 35. about x₂
Find partial derivative , It's similar .
Find partial derivative , We use a symbol ： such as y / x₃ It means y Yes x₃ Find partial derivative .
Nonsense for a long time , What does this have to do with deep learning ? Of course it does , Deep learning uses neural network , Used to solve the problem of linear indivisibility .
Here I mainly talk about the relationship between mathematics and deep learning . Let me show you some pictures first ：
chart 1： Deep learning , It's a neural network with many hidden layers
chart 2： When single output , How to find the partial derivative
chart 3： Multiple output , How to find the partial derivative
The next two pictures are from the Japanese book about deep learning . The so-called entrance layer , Output layer , Mesosphere , Corresponding to Chinese ： Input layer , Output layer , And hidden layer .
Don't be scared by these pictures , It's very simple , Take flirting with girls for example . We can roughly divide the love between men and women into three stages ：
First love . Input layer equivalent to deep learning . Others attract you , There must be many factors , such as ： height , figure , cheek , education , Character, etc , These are input layer parameters , Weight may not be the same for everyone .
Infatuation period . Let's make it correspond to the hidden layer ! This period , All kinds of running in , daily necessaries .
Stable period . Corresponding output layer , Is it suitable , It depends on how the running in is . As you all know , Running in is very important , How to break in ? It's a process of continuous learning, training and revision !
For example, my girlfriend likes strawberry cake , You bought Blueberry , Her feedback was negative, Don't buy blueberries next time , Strawberry .
After reading this , Some guys may start to transfer to their girlfriends . A little uneasy , So add . Flirting is like deep learning , To prevent under fitting , Also prevent over fitting .
So called under fitting , For deep learning , It's not enough training , Insufficient data , It's like , You're inexperienced in flirting .
To fit , Of course, sending flowers is the most basic , Other aspects need to be improved , such as , Improve your sense of humor, etc . I need to mention something here , Under fitting is not good , But over fitting is even more inappropriate .
Over fitting is opposite to under fitting , one side , If over fitting , She will think you have the potential of Miss Edison Chen , what's more , Everyone's different , It's like deep learning , Training set works well , But the test set doesn't work !
In terms of flirting , She's going to think you've been ( Training set ) It's a big impact , This is taboo ! If you give her that impression , You're getting bored later , Remember to remember !
Deep learning is also a process of continuous running in , Just started to define a standard parameter （ These are empirical values , It's like Valentine's day and birthday have to send flowers ）, And then fix it again and again , Drawing 1 Weight between each node .
Why do we have to run in like this ? Just think about it , We assume that deep learning is a child , How can we teach him to read ?
You must show him the picture first , And tell him the right answer , Need a lot of pictures , Teaching him constantly , Train him , This training process , In fact, it is similar to the process of solving the weight of neural network .
When testing later , All you have to do is give him a picture , He knew what was in the picture .
So training set , It's about showing children pictures with the right answers , For deep learning , Training set is used to solve the weight of neural network , The final model ; And test set , It is used to verify the accuracy of the model .
For trained models , As shown in the figure below , weight （w1,w2...） All known .
Like above , It's easy to work out from left to right . But vice versa , Test set has pictures , There's also the right answer , To turn it around w1,w2...... What should I do? ?
How to find the partial derivative ?
For a long time , Finally, it's time to ask for partial guidance . The current situation is ：
Let's assume that a neural network has been defined , For example, how many layers , How many nodes are there in each layer , There are also default weights and activation functions . input （ image ） In case of determination , The output value can only be changed by adjusting the parameters . How to adjust , How to run in ?
Each parameter has a default value , We will add a certain value to each parameter ∆, And see what happens ? If the parameter is increased , The gap has also widened , Then it has to be reduced ∆, Because our goal is to make the gap smaller ; vice versa .
So in order to optimize the parameters , We need to understand the rate of change of the error for each parameter , Isn't that to find the partial derivative of the error to the parameter ?
There are two points here ：
One is the activation function , The main purpose is to make the whole network nonlinear . We mentioned it earlier , In many cases , Linear functions can't classify inputs properly （ In many cases, recognition is mainly classified ）.
So we should let the network learn a nonlinear function , Here we need to activate the function , Because it's nonlinear , So the whole network has nonlinear characteristics .
in addition , The activation function also keeps the output value of each node within a controllable range , Easy to calculate .
It seems that this explanation is not popular , In fact, we can use flirting as a metaphor ： Girls don't like boiled water , Because it's linear , Of course, we need some romantic feelings in our life , This activation function , I feel like a little romance in my life , Little surprise .
Every stage of getting along , Need to activate from time to time , Make a little romance , Little surprise .
such as , Ordinary girls see cute little cups , Porcelain and all that , Then give her a special style on her birthday , Moved her to tears .
As mentioned earlier, men should be humorous , This is to make her laugh , And make her cry when it's right . Cry and smile , A few more rounds , She can't leave you . Because your nonlinearity is too strong .
of course , going beyond the limit is as bad as falling short , The more small surprises, the better , But if it doesn't, it's boiled water . It's like every layer
Can add activation function , of course , You don't have to add activation functions to every layer , But not at all , That's not going to work .
The key is how to find the partial derivative . chart 2 Sum graph 3
The derivation methods are given respectively , It's very simple , It's OK to take a partial lead from right to left . The derivation of adjacent layers is very simple , Because it's linear , So the partial derivative is actually the parameter itself , Just follow the solution x₃
The partial derivative of . Then multiply the partial derivatives .
There are two points here ： One is the activation function , In fact, the activation function is nothing , So that the output of each node is 0 reach 1 Interval of , It's easy to calculate , So there's another layer of mapping on the results , It's all one-on-one .
Due to the existence of the activation function , In the process of partial derivation , You have to count it in , Activation function , General use sigmoid, It can also be used Relu etc. . The derivation of activation function is very simple ：
Derivation ： f'(x)=f(x)*[1-f(x)]
This aspect , If you have time, you can take a look at high numbers , If there's no time , Just remember . as for Relu, That's easier , namely f(x) When x<0 When y be equal to 0, Other times ,y
be equal to x.
of course , You can also define your own Relu function , such as x Greater than or equal to 0 When ,y be equal to 0.01x, it's fine too .
What is learning coefficient ?
The other is the learning coefficient , Why is it called learning coefficient ?
We talked about it just now ∆ increment , How much is suitable for each increase ? Is it equivalent to partial derivative （ Rate of change ）?
Experience tells us , Need to multiply by a percentage , This is the learning coefficient , and , With the deepening of training , This coefficient can be changed .
of course , There are also some very important basic knowledge , such as SGD（ Random gradient descent ）,mini batch and epoch（ Selection for training set ）.
Contents described above , Mainly about how to adjust the parameters , It's in the primary stage . As mentioned above , Before parameter adjustment , All have default network models and parameters , How to define the initial model and parameters ? We need to know more about it .
however , For general engineering , Just call the parameter on the default network , Equivalent to using algorithms ; For scholars and scientists , They will invent algorithms , It's very difficult . Salute them !
come 源：知乎Jacky Yang
The more we share, The more we have