After deep learning , Face recognition technology is really available . This is because in the previous machine learning technology , It is difficult to extract proper eigenvalues from images . outline ? colour ? eye ? So many faces , And with age , light , Shooting angle , Air color , expression , Make up , The difference in accessories, pendants, etc , The face pictures of the same person are very different on the pixel level , It is difficult to get high accuracy eigenvalues with experts' experience and trial and error , Naturally, there is no way to further classify these eigenvalues . The biggest advantage of deep learning is that the weight of parameters is adjusted by the training algorithm , Construct a high accuracy f(x) function , Given a picture, we can get the characteristic value , Further reclassification . In this paper, the author tries to use popular language to discuss face recognition technology , Firstly, face recognition technology is summarized , Then it discusses the reason why deep learning is effective and why gradient descent can train appropriate weight parameters , The final description is based on CNN Face recognition based on convolutional neural network .
One , Overview of face recognition technology
Face recognition technology consists of two parts: face detection and face recognition .
Why face detection is necessary , It's not just about detecting faces in photos , What's more, delete the irrelevant part of the face in the picture , Otherwise, the pixels of the whole picture will be passed to f(x) The recognition function is definitely not available . Face detection does not necessarily use deep learning technology , Because the technical requirements here are relatively low , Just know if there is a face and the approximate position of the face in the picture . Generally, we consider using OpenCV,dlib Face detection function of open source database （ Traditional eigenvalue method based on expert experience has less computation and faster speed ）, You can also use techniques based on deep learning, such as MTCNN（ When the neural network is deep and wide, the computation is large and slow ）.
In face detection , We focus on detection rate , Missing rate , Three indexes of false detection rate , among ：
Detection rate ： The proportion of the detected and existing face images in all the existing face images ;
Missing rate ： The proportion of the images with face but not detected in all the existing face images ;
noise factor ： Proportion of images without face but with face detected in all non face images .
of course , Speed is also important . Face detection is not described in this paper .
Two , Principles of deep learning technology
Pixel value matrix transformed from clear face image , What functions should be designed f(x) Convert to eigenvalue ? The answer to this question depends on the classification question . Namely , Don't talk about eigenvalues first , First of all, how to classify the photo collection correctly according to people ? We need to talk about machine learning . Machine learning considers that the algorithm can be generalized well from the limited training set samples . therefore , Let's find the limited training set first , Design the initial function f(x;w), And the training concentration has been quantified x->y. If data x It's low dimensional , ordinary , For example, only two-dimensional , So the classification is very simple , As shown in the figure below ：
Two dimensional data in the figure above x There are only two categories: square and round y, Good points , We need to learn the most simple classification function f(x,y)=ax+by+c It can show the classification line . for example f(x,y) greater than 0 Hour for circle , less than 0 Time means square .
Given random number as a,c,b Initial value of , We constantly optimize parameters through training data a,b,c, Put the inappropriate L1,L3 The equal classification function is gradually trained into L2, In this way L2 We can get better results by dealing with the generalized test data . However, if there are multiple categories , You need multiple classification lines to separate them , As shown in the figure below ：
This is equivalent to the implementation of multiple classification functions and &&, or || Results after operation . It may be used at this time f1>0 && f2<0 &&
f3>0 Such a classification function , But if it's more complicated , For example, its own characteristics are not obvious and do not converge , This way of finding features doesn't work , As shown in the figure below , Different colors represent different classifications , At this time, the training data is completely
Nonlinear separable state ：
As can be seen from the example above , Although the input picture is cat , But the score belongs to the dog's score 437.9 highest , But how much taller than the cat and the boat ? It's hard to measure ! If we translate the score into 0-100 Percentage probability of , It's easy to measure . Here we can use sigmoid function , As shown in the figure below ：
It can be seen from the above formula and figure ,sigmoid You can convert any real number to 0-1 As a probability . but sigmoid Probability is not uniform
, That is to say, we need to ensure that the sum of the probabilities of the input photos in all categories is 1, So we also need to press softmax Do the following ：
Given like this x You can get x Probability under each category . Suppose the scores of the three categories are 3,1,-3, According to the above formula, the probability is [0.88,0.12,0], The calculation process is shown in the figure below ：
But in fact x The corresponding probability is actually the first type , such as [1,0,0], The probability we have now （ Or likelihood ） yes [0.88,0.12,0]. How far is the gap between them ? This gap is the loss value loss. How to get the loss value ? stay softmax We use
Minimum calculation of cross entropy loss function （ Convenient derivation ）, As follows ：
among i It's the right classification , For example, in the example above loss Value is -ln0.88. So we have the loss function f(x) after , How to adjust x To make the loss Minimum value ? This involves differential derivatives .
It can be seen vividly , When the value of slope is positive , hold x Move left smaller ,f(x) The value of ; When the value of slope is negative , hold x Move to the right a little bigger ,f(x) The value of , As shown in the figure below ：
such , The slope is 0 We actually get the function f The minimum value can be obtained at this point . that , hold x Move left or right a little bit , How much does it move ? If you move too much , It may have moved , If it moves very little , It may take a long time to find the smallest point . There's another problem , If f(x) Operation function has multiple local minimum points , Global minimum , If x Very small move , It may lead to finding only a local minimum point which is not small enough through derivative . As shown in the figure below ：
In the above, we use one-dimensional data to see the gradient decline , But our photos are multidimensional , How to find derivative at this time ? And how to gradient down ? Now we need to use the concept of partial derivative . In fact, it's very similar to derivative , because x It's a multidimensional vector , So let's assume that Xi When the derivative of ,x Other values on remain unchanged , This is it. Xi Partial derivative of . At this time, the gradient descent method is applied as shown in the figure below ,θ It's two-dimensional , We separately seek θ0 and θ1 Derivative of , You can also θ0 and θ1 Move the corresponding step in both directions , Find the lowest point , As shown in the figure below ：
Four , be based on CNN Face recognition based on convolutional neural network
Let's start with the all connected network .Google Of TensorFlow Playground
It can intuitively experience the power of full connection neural network , This is the website of the amusement park ：http://playground.tensorflow.org/, You can do neural network training in the browser , And visualization of process and results . As shown in the figure below ：
CNN It is considered that only one rectangular window of the whole picture can be fully connected （ It can be called convolution kernel ）, Slide this window to the same weight parameter w After traversing the whole picture , You can get the next level of input , As shown in the figure below ：
CNN Weight parameters in the same layer can be shared , Because different areas of the same picture have certain similarity . In this way, the problem of too much computation in the original full connection is solved , As shown in the figure below ：
Combining the previous function forward operation and matrix , Let's take a visual look at the forward operation process with a dynamic picture ：
Here the size of convolution kernel and the step length of its movement stride, The output depth determines the size of the next layer network . meanwhile , Nuclear size vs stride When the step size causes the matrix of the previous layer to be not large enough , Need to use padding To mend 0（ As shown in gray above 0）. It's called
Convolution operation , Such a layer of neurons is called a convolution layer . Above W0 and W1 Indicates the depth is 2.
CNN Convolution networks usually add an excitation layer after each convolution layer , The incentive layer is a function , It converts the output value of the convolution layer to another value in a non-linear way , Keep size relation and limit value range , So that the whole network can be trained . In face recognition , Usually used Relu Function as excitation layer ,Relu The function is max(0,x), As follows ：
so Relu It's very small !
CNN There is also a pool layer in , When the data output of a certain layer is too large , Data dimension can be reduced through pooling layer , Reduce the amount of data while retaining features , For example 4*4 The dimension of matrix is reduced to 2*2 matrix ：
In the above figure, the largest number is pooled by filtering each color block , To reduce the amount of calculated data .
Generally, the last layer of the network is the full connection layer , It's so general CNN The network structure is as follows ：
CONV It's the convolution layer , each CONV It will be carried later RELU layer . It's just a schematic , The actual network is much more complex . Currently open source Google FaceNet Yes resnet
v1 Face recognition based on Internet , about resnet Network please refer to the paper https://arxiv.org/abs/1602.07261, Its complete network is more complex , It's not listed here , You can also view the TensorFlow Realized Python code https://github.com/davidsandberg/facenet/blob/master/src/models/inception_resnet_v1.py, be careful slim.conv2d contain Relu Incentive layer .
The above is just for general use CNN network , Because face recognition is not directly classified , There's a registration phase , You need to take out the characteristic value of the picture . If you take it directly softmax The effect of data before classification as eigenvalue is not good , For example, the following figure is to directly convert the output of full connection layer to 2D vector , Visual representation of classification through color representation on two-dimensional plane ：
The visible effect is not good , The middle sample is too close . adopt centor loss After method processing , The distance between eigenvalues can be extended , As shown in the figure below ：
In this way, the effect of extracted eigenvalues will be much better .
Official account recommendation ：
official account ：VOA Listen to English every day
wechat number : voahk01
Long press to scan , thank you