If in this era of artificial intelligence, As an aspiring programmer, Or students, Amateur, Don't understand the hot topic of deep learning, Seems to be out of touch with the times.




however, The requirement of deep learning for Mathematics, Including calculus, Linear algebra, probability theory, mathematical statistics, etc, Let most of the ambitious young people hesitate to move forward. So here comes the question... Understanding deep learning, Do you need this knowledge or not?











On deep learning, There is a lot of information on the Internet, But most of them are not suitable for beginners. Mr. Yang summed up several reasons:

*
Deep learning really needs a certain mathematical foundation. If you don't have to go deep and talk about local laws, Some readers will be afraid of difficulties, So it's easy to give up too early.

*
Books or articles written by Chinese or Americans, Generally difficult.





The mathematics foundation needed for deep learning is not as difficult as expected, You just need to know the concept of derivatives and related functions. If you haven't studied advanced mathematics, very good, This article is actually for liberal arts students to understand, Just need to learn junior high school mathematics.





Don't be afraid of difficulties, I admire Li Shufu's spirit, In a TV interview, Li Shufu said: Who says the Chinese can't build cars? Why is it difficult to build a car, Four wheels and two rows of sofas. Of course, His conclusion is biased, But the spirit is admirable.




“ Wang Xiaoer sells pigs” On the derivative of deep learning

What is derivative?

It's just the rate of change, such as: Wang Xiaoer sold it this year 100 Head pig, Sold last year. 90 head, Sold the year before last. 80 head... What is the rate of change or growth? Annual growth 10 Head pig, How simple.




Notice that there's a time variable--- year. The growth rate of Wang Xiaoer's pig sales is 10 head/ year, In other words, Derivative is 10.




function y = f(x) = 10x + 30, Here we assume that Wang Xiaoer sold it in the first year 30 head, Annual growth in the future 10 head,x Representative time( year),y Represents the number of pigs.




Of course, This is a fixed growth rate situation, In real life, Many times, And the amount of change is not fixed, That is to say, the growth rate is not constant.




such as, The function might look like this: y = f(x) = 5x² + 30, Here x and y
Still represents time and number of heads, But the growth rate has changed, How to calculate the growth rate, Let's talk about it later. Or you can just remember some formulas for derivation.







Deep learning also has an important mathematical concept: partial derivative

How to understand the partial of partial derivative? A partial headache, Or I won't let you guide, You must guide.?




None of them, We also take Wang Xiaoer as an example, We just talked about,x
Variable is time( year), But sold pigs, It's not just about time, As the business grows, Wang Xiaoer not only expanded pig farm, And hired a lot of employees to raise pigs.




So the equation changed again:y = f(x) = 5x₁² + 8x₂ + 35x₃ + 30




Here x₂ Representative area,x₃ Number of representative employees, Of course x₁ Or time?.




Take flirting with girls as an example, Interpretation of deep learning“ partial derivative”

What is the partial derivative

When the partial derivative is just multiple variables, Rate of change for a variable. In the formula above, If targeted x₃ Partial derivative, In other words, How much employees contribute to the growth rate of pigs.




Or say, along with( each) Employee growth, How much more pigs, Here is equal to 35--- Every additional employee, Sell more 35 Head pig.




When calculating the partial derivative, Other variables can be treated as constants, This is very important, The constant rate of change is 0, So the derivative is 0, So it's right 35x₃ Derivation of derivative, Be equal to 35. about x₂
Partial derivative, It's similar.




Partial derivative, We use a symbol: such as y / x₃ Express y Yes x₃ Partial derivative.







Bullshit for a long time, What does this have to do with deep learning? Of course it does, Deep learning uses neural network, Used to solve the problem of linear indivisibility.




Here I mainly talk about the relationship between mathematics and deep learning. Let me show you some pictures first:



chart1: Deep learning, It's a neural network with many hidden layers



chart2: When single output, How to find the partial derivative



chart3: Multiple output, How to find the partial derivative




The next two pictures are from the Japanese book about deep learning. The so-called entrance layer, Output layer, Middle layer, Corresponding to Chinese: Input layer, Output layer, Hidden layer.




Don't be scared by these pictures, It's very simple, Take flirting with girls for example. We can roughly divide love into three stages:

*
First love period. Input layer equivalent to deep learning. Others attract you, There must be many factors, such as: height, figure, Cheek, Education, Character, etc. These are input layer parameters, Weight may not be the same for everyone.

*
Infatuation. Let's make it correspond to the hidden layer! During this period, All kinds of running in, daily necessaries.

*
Stable period. Corresponding output layer, Is it appropriate? It depends on how the running in is. As you all know, Running in is very important, How to break in? It's a process of continuous learning, training and revision!

For example, my girlfriend likes strawberry cake, You bought Blueberry, Her feedback was negative, Don't buy blueberries next time, Strawberry changed..




After reading this, Some guys may start to transfer to their girlfriends. A little uneasy, So add. Flirting is like deep learning, To prevent under fitting, Also prevent over fitting.




So called under fitting, For deep learning, It's not enough training, Insufficient data, Just like, You're inexperienced in flirting.
To fit, Of course, sending flowers is the most basic, Other aspects need to be improved, such as, Improve your sense of humor, etc. I need to mention something here, Under fitting is not good, But over fitting is even more inappropriate.




Over fitting is opposite to under fitting, One side, If over fitting, She will think you have the potential of Miss Edison Chen, what's more, Everyone's different, It's like deep learning, Training set works well, But the test set doesn't work!




In terms of flirting, She's going to think you've been( Training set) Great influence, This is a taboo.! If you give her that impression, You're getting bored later, Remember!




Deep learning is also a process of continuous running in, Just started to define a standard parameter( These are empirical values, It's like Valentine's day and birthday have to send flowers), And then fix it again and again, Draw a diagram 1 Weight between each node.




Why do we have to run in like this? Just think about it. We assume that deep learning is a child, How can we teach him to read?




You have to show him the picture first, And tell him the right answer, Need a lot of pictures, Teaching him constantly, Train him, This training process, In fact, it is similar to the process of solving the weight of neural network.
When testing later, All you have to do is give him a picture, He knew what was in the picture.




So training set, It's about showing children pictures with the right answers, For deep learning, Training set is used to solve the weight of neural network, The final model; Test Suite, It is used to verify the accuracy of the model.




For trained models, As shown in the figure below, weight(w1,w2...) All known.



chart4



chart5




Like above, It's easy to work out from left to right. But in turn, Test set has pictures, There's also the right answer, To turn it around w1,w2...... How to do it??




How to find the partial derivative?




For a long time, Finally, it's time to ask for partial guidance. The current situation is:






Let's assume that a neural network has been defined, For example, how many layers, How many nodes are there in each layer, There are also default weights and activation functions. input( image) In case of determination, The output value can only be changed by adjusting the parameters. How to adjust, How to run in?




Each parameter has a default value, We will add a certain value to each parameter∆, And see what happens? If the parameter is increased, The gap has also widened, Then it has to be reduced∆, Because our goal is to make the gap smaller; Vice versa.




So in order to optimize the parameters, We need to understand the rate of change of the error for each parameter, Isn't that to find the partial derivative of the error to the parameter?




There are two points here:
One is the activation function, The main purpose is to make the whole network nonlinear. We mentioned it earlier, In many cases, Linear functions can't classify inputs properly( In many cases, recognition is mainly classified).




So we should let the network learn a nonlinear function, Here we need to activate the function, Because it's nonlinear, So the whole network has nonlinear characteristics.




in addition, The activation function also keeps the output value of each node within a controllable range, Easy to calculate.





It seems that this explanation is not popular, In fact, we can use flirting as a metaphor: Girls don't like boiled water, Because it's linear, Of course, we need some romantic feelings in our life, This activation function, I feel like a little romance in my life, Little surprise.




Every stage of getting along, Need to activate from time to time, Make a little romance, Little surprise.
such as, Ordinary girls see cute little cups, Porcelain and all that, Then give her a special style on her birthday, Moved her to tears.




As mentioned earlier, men should be humorous, This is to make her laugh, And make her cry when it's right. Laugh and cry, A few more rounds, She can't leave you. Because your nonlinearity is too strong.




Of course, Going beyond the limit is as bad as falling short, The more small surprises, the better, But if it doesn't, it's boiled water. It's like every layer
Can add activation function, Of course, You don't have to add activation functions to every layer, But not at all, That's not going to work.




The key is how to find the partial derivative. chart 2 Sum graph 3
The derivation methods are given respectively, It's very simple, It's OK to take a partial lead from right to left. The derivation of adjacent layers is very simple, Because it's linear, So the partial derivative is actually the parameter itself, Just follow the solution x₃
The partial derivative of. Then multiply the partial derivatives.




There are two points here: One is the activation function, In fact, the activation function is nothing, So that the output of each node is 0 reach 1 Interval, It's easy to calculate, So there's another layer of mapping on the results, It's all one-on-one.




Due to the existence of the activation function, In the process of partial derivation, You have to count it in, Activation function, General use sigmoid, It can also be used. Relu etc.. The derivation of activation function is very simple:






Derivation: f'(x)=f(x)*[1-f(x)]




This aspect, If you have time, you can take a look at high numbers, If there's no time, Just remember. As for Relu, That's easier, Namely f(x) When x<0 When y Be equal to 0, Other times,y
Be equal to x.




Of course, You can also define your own Relu function, such as x Greater than or equal to 0 When,y Be equal to 0.01x, It's fine too.




What is learning coefficient?




The other is the learning coefficient, Why is it called learning coefficient?




We talked about it just now∆ increment, How much is suitable for each increase? Is it equivalent to partial derivative( Rate of change)?




Experience tells us, Need to multiply by a percentage, This is the learning coefficient, And, With the deepening of training, This coefficient can be changed.





Of course, There are also some very important basic knowledge, such as SGD( Random gradient descent),mini batch and epoch( Selection for training set).




Contents described above, Mainly about how to adjust the parameters, It's in the primary stage. As mentioned above, Before parameter adjustment, All have default network models and parameters, How to define the initial model and parameters? We need to know more about it.




However, For general engineering, Just call the parameter on the default network, Equivalent to using algorithms; For scholars and scientists, They will invent algorithms, It's very difficult. Salute them!




come源:知乎Jacky Yang







分享朋友圈 也是另一种赞赏

The more we share, The more we have

 



欢迎加入数据君高效数据分析社区




加我私人微信进入大数据干货群:tongyuannow 
























目前100000+人已关注加入我们