After deep learning, Face recognition technology is really available. This is because in the previous machine learning technology, It is difficult to extract proper eigenvalues from images. outline? colour? Eye? So many faces, And with age, light, Shooting angle, Complexion, Expression, Makeup, The difference in accessories, pendants, etc, The face pictures of the same person are very different on the pixel level, It is difficult to get high accuracy eigenvalues with experts' experience and trial and error, Naturally, there is no way to further classify these eigenvalues. The biggest advantage of deep learning is that the weight of parameters is adjusted by the training algorithm, Construct a high accuracyf(x) function, Given a picture, we can get the characteristic value, Further reclassification. In this paper, the author tries to use popular language to discuss face recognition technology, Firstly, face recognition technology is summarized, Then it discusses the reason why deep learning is effective and why gradient descent can train appropriate weight parameters, The final description is based onCNN Face recognition based on convolutional neural network.

One, Overview of face recognition technology

Face recognition technology consists of two parts: face detection and face recognition.


Why face detection is necessary, It's not just about detecting faces in photos, What's more, delete the irrelevant part of the face in the picture, Otherwise, the pixels of the whole picture will be passed tof(x) The recognition function is definitely not available. Face detection does not necessarily use deep learning technology, Because the technical requirements here are relatively low, Just know if there is a face and the approximate position of the face in the picture. Generally, we consider usingOpenCV,dlib Face detection function of open source database( Traditional eigenvalue method based on expert experience has less computation and faster speed), You can also use techniques based on deep learning, such asMTCNN( When the neural network is deep and wide, the computation is large and slow).

In face detection, We focus on detection rate, Missing detection rate, Three indexes of false detection rate, among:

*
Detection rate: The proportion of the detected and existing face images in all the existing face images;

*
Missing detection rate: The proportion of the images with face but not detected in all the existing face images;

*
noise factor: Proportion of images without face but with face detected in all non face images.

Of course, Speed is also important. Face detection is not described in this paper.







Two, Principles of deep learning technology


Pixel value matrix transformed from clear face image, What functions should be designedf(x) Convert to eigenvalue? The answer to this question depends on the classification question. Namely, Don't talk about eigenvalues first, First of all, how to classify the photo collection according to people correctly? We need to talk about machine learning. Machine learning considers that the algorithm can be generalized well from the limited training set samples. therefore, Let's find the limited training set first, Design the initial functionf(x;w), And the training concentration has been quantifiedx->y. If datax It's low dimensional. Ordinary, For example, only two-dimensional, So the classification is very simple, As shown in the figure below:




Two dimensional data in the figure abovex There are only two categories: square and roundy, Very good points, We need to learn the most simple classification functionf(x,y)=ax+by+c It can show the classification line. for examplef(x,y) greater than0 Hour for circle, less than0 Time means square.


Given random number asa,c,b Initial value, We constantly optimize parameters through training dataa,b,c, Put the inappropriateL1,L3 The equal classification function is gradually trained intoL2, In this wayL2 We can get better results by dealing with the generalized test data. However, if there are multiple categories, You need multiple classification lines to separate them, As shown in the figure below:



This is equivalent to the implementation of multiple classification functions and&&, or|| Results after operation. It may be used at this timef1>0 && f2<0 &&
f3>0 Such a classification function, But if it's more complicated, For example, its own characteristics are not obvious and do not converge, This way of finding features doesn't work, As shown in the figure below, Different colors represent different classifications, At this time, the training data is completely
Nonlinear separable state:








As can be seen from the example above, Although the input picture is cat, But the score belongs to the dog's score437.9 Highest, But how much taller than the cat and the boat? It's hard to measure.! If we translate the score into0-100 Percentage probability of, It's easy to measure. Here we can usesigmoid function, As shown in the figure below:



It can be seen from the above formula and figure,sigmoid You can convert any real number to0-1 As a probability. butsigmoid Probability is not uniform
, That is to say, we need to ensure that the sum of the probabilities of the input photos in all categories is1, So we also need to presssoftmax Do the following:




Given this wayx You can getx Probability under each category. Suppose the scores of the three categories are3,1,-3, According to the above formula, the probability is[0.88,0.12,0], The calculation process is shown in the figure below:




But in factx The corresponding probability is actually the first type, such as[1,0,0], The probability we have now( Or likelihood) yes[0.88,0.12,0]. How far is the gap between them? This gap is the loss valueloss. How to get the loss value? staysoftmax We use
Minimum calculation of cross entropy loss function( Convenient derivation), As shown below:




amongi It's the right classification, For example, in the example aboveloss The value is-ln0.88. So we have the loss functionf(x) after, How to adjustx To make theloss What is the minimum value?? This involves differential derivatives.






It can be seen vividly, When the value of slope is positive, holdx Move left smaller,f(x) The value of; When the value of slope is negative, holdx Move to the right a little bigger,f(x) The value of, As shown in the figure below:




such, Slope is0 We actually get the functionf The minimum value can be obtained at this point. that, holdx Move left or right a little bit, How much does it move? If you move too much, It may have moved, If it moves very little, It may take a long time to find the smallest point. There's another problem, Iff(x) Operation function has multiple local minimum points, Global minimum, Ifx Very small move, It may lead to finding only a local minimum point which is not small enough through derivative. As shown in the figure below:










In the above, we use one-dimensional data to see the gradient decline, But our photos are multidimensional, How to find derivative at this time? And how to gradient down? Now we need to use the concept of partial derivative. In fact, it's very similar to derivative, becausex It's a multidimensional vector, So let's assume thatXi Derivative time,x Other values on remain unchanged, This is it.Xi Partial derivative. At this time, the gradient descent method is applied as shown in the figure below,θ It's two-dimensional. We seekθ0 andθ1 Derivative, You can alsoθ0 andθ1 Move the corresponding step in both directions, Find the lowest point, As shown in the figure below:






Four, Be based onCNN Face recognition based on convolutional neural network

Let's start with the all connected network.Google OfTensorFlow Playground
It can intuitively experience the power of full connection neural network, This is the website of the amusement park:http://playground.tensorflow.org/, You can do neural network training in the browser, And visualization of process and results. As shown in the figure below:





CNN It is considered that only one rectangular window of the whole picture can be fully connected( It can be called convolution kernel), Slide this window to the same weight parameterw After traversing the whole picture, You can get the next level of input, As shown in the figure below:



CNN Weight parameters in the same layer can be shared, Because different areas of the same picture have certain similarity. In this way, the problem of too much computation in the original full connection is solved, As shown in the figure below:



Combining the previous function forward operation and matrix, Let's take a visual look at the forward operation process with a dynamic picture:




Here the size of convolution kernel and the step length of its movementstride, The output depth determines the size of the next layer network. meanwhile, Nuclear size andstride When the step size causes the matrix of the previous layer to be not large enough, Need to usepadding Supplement0( As shown in gray above0). The above is called
Convolution operation, Such a layer of neurons is called a convolution layer. Above pictureW0 andW1 Indicates the depth is2.


CNN Convolution networks usually add an excitation layer after each convolution layer, The incentive layer is a function, It converts the output value of the convolution layer to another value in a non-linear way, Keep size relation and limit value range, So that the whole network can be trained. In face recognition, Usually usedRelu Function as excitation layer,Relu The function ismax(0,x), As shown below:



So Relu It's very small!


CNN There is also a pool layer in, When the data output of a certain layer is too large, Data dimension can be reduced through pooling layer, Reduce the amount of data while retaining features, For example4*4 The dimension of matrix is reduced to2*2 matrix:



In the above figure, the largest number is pooled by filtering each color block, To reduce the amount of calculated data.

Generally, the last layer of the network is the full connection layer, It's so generalCNN The network structure is as follows:



CONV It's the convolution layer, eachCONV Will carry laterRELU layer. It's just a schematic, The actual network is much more complex. Currently open sourceGoogle FaceNet Is usedresnet
v1 Face recognition based on Internet, aboutresnet Network please refer to the paperhttps://arxiv.org/abs/1602.07261, Its complete network is more complex, It's not listed here, You can also view theTensorFlow RealizedPython Codehttps://github.com/davidsandberg/facenet/blob/master/src/models/inception_resnet_v1.py, Be carefulslim.conv2d ContainRelu Incentive layer.


The above is just for general useCNN network, Because face recognition is not directly classified, There's a registration phase, You need to take out the characteristic value of the picture. If you take it directlysoftmax The effect of data before classification as eigenvalue is not good, For example, the following figure is to directly convert the output of full connection layer to 2D vector, Visual representation of classification through color representation on two-dimensional plane:



The visible effect is not good, The middle sample is too close. adoptcentor loss After method processing, The distance between eigenvalues can be extended, As shown in the figure below:




In this way, the effect of extracted eigenvalues will be much better.



Public recommendation:


official account:VOA Listen to English every day


Wechat number: voahk01

Long press to scan, Thank you