# CRNN principle

CRNN yes 《An End-to-End Trainable Neural Network for Image-based Sequence

Recognition and Its Application to Scene Text Recognition》 The model proposed in , Solve the problem of character recognition in image .

Paper address ：https://arxiv.org/abs/1507.05717 <https://arxiv.org/abs/1507.05717>

github address ：https://github.com/bgshih/crnn <https://github.com/bgshih/crnn>

1： Structure diagram

It mainly includes ： Convolution layer （CNN）, Circulation layer （RNN）, Transcription layer （CTC）

2： Characteristics of the whole network model

*

We can learn directly from sequence tags , You don't have to label every character , Just put a sequence tag on a picture , for example ： It's in the picture “abc123”, The label is “abc123”, You don't have to label each character individually .

* utilize CNN Extracted image features

* utilize RNN Feature sequence of training input , The output is a sequence tag

* There is no length limit on the images to be trained , But the height of the image should be normalized .

* Few parameters

* Although combined CNN and RNN, But in the end, one loss function （CTC LOSS）, The end-to-end training is realized .

3：CRNN Key points in , that is CTC(connectionist temporal classification) Original text from 《Connectionist

Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent

Neural Networks

<https://link.zhihu.com/?target=ftp%3A//ftp.idsia.ch/pub/juergen/icml2006.pdf>》

first ,CRNN The first step is to send pictures into the training CNNji Feature extraction by convolution , obtain feature map

then , take feature map Each column of （ Here we simplify the process , Think of a column as one LSTM time series ,CRNN Back analysis of network structure ）

Or each column can be used as a time series input feature , Send in RNN（ two-way LSTM）

Let's take a look at the flow chart ：

* CNN feature map

hypothesis CNN Got it feature map The size is , The time series obtained below All from start , Namely .

Enter the RNN The time series in is defined as ：,

Each column is ：

* RNN（ two-way LSTM）

LSTM After each time slice of softmax, The output is a posterior probability matrix , The size is n*T,T Represents the total of time series T column ,n In the dictionary n Tags , Defined as ：

Each column is ：

Because we get probability , So the sum of the probabilities of all character tags in each column is 1, Namely , Take the maximum value of each column , You can get the category of the output characters for each column , That is, for each column operation .

Put the LSTM Network as a mapping , among w by LSTM Parameters of .m*T Of CNN Got it feature map, Input channel LSTM The output is

n*T, So we did the following transformation ：

Suppose the original character label is L =

{a,b,c,......x,y,z} Of 26 And letters , Because you want to add a default space character blank（ The effect is similar to the default background label in target detection ）, The new character label is L‘.

definition B Transformation ：

, Represents simple compression

explain ： hypothesis T = 12, That is to say, there are 12 column , The path is , The length of the path is T.

As shown above 5 Paths , It can be seen that there are rules ： If there is a space between two characters blank Time , Then it will not be merged , If there is no space , Combine the same characters into one .

When you get LSTM after softmax（ Note that softmax no softmax loss, Because it's used here CTC

loss） After getting the character probability score matrix , through argmax() Output after operation y, after B After transformation , You can get the output , obviously B Transformation is not one

The relation of one mapping . Now we can get through feature

map Do it with each column x Input to RNN in , Get the probability matrix , and argmax() Operation gets output , Then use the output of this step to go through B The final character prediction output is obtained by transforming the rules . But the output is different from the number of columns input at that time （ Obviously different , In other words, the phenomenon of misalignment occurred .）, Training needs back propagation to update parameters , therefore , How to use the predicted output to calculate loss What about .

If the output is used softmax The probability matrix is obtained , Each column corresponds to the output

All correspond to one character in the label . In training, each sample image needs to mark the position of each character in the picture , Re pass CNN Sense field aligned to Feature

map Each column of , As shown in the figure 3, To train .

in application , Marking such aligned samples is very difficult , The workload is very heavy , therefore CTC In this paper, we propose a method that does not need to be aligned Loss computing method , For training network .

Let's calculate it first , Number of paths ： hypothesis T = 30, The final output is “state”, Five characters , Common paths can be compressed into 【state】. The formula for calculating the number of paths is .

CTC How to do it :

about LSTM Given input In the case of , The output is The probability of ：

The above formula is explained ： In the case of input , The probability of output is ： The condition is, and the input is , Yes B The sum of the probabilities of all paths whose output is .

So what is the probability of each path ：

The above formula is explained ： The probability of each path is the product of its output at each time .

Our objective function is equivalent to

But the number of paths is huge , It can't be calculated directly , therefore CTC Method is borrowed from HMM Forward backward algorithm （forward-backward algorithm） To calculate .

CTC Training process , It's essentially through gradients adjustment LSTM Parameters of , So that the input sample is happen now and then Get the biggest .

Define all The result of the transformation is And in the The time result is （ Recorded as ） The path set for is .

Derivation ：

Pay attention to the second term and irrelevant , therefore ：

First, let's understand the formula above ： yes time The output characters are The probability of . In the derivation formula above ： Because the total number of routes is huge , I just need to calculate the derivation here time The output characters are

All paths for . The schematic diagram is as follows ：

Suppose it was t=6 time ,k = a character . So according to the figure above, there are two paths through this point （ There are still a few that haven't been drawn ）, Suppose there are four paths through the point , Respectively

, All similar to after The result of the transformation is And in the The set of paths for is represented as .

observation . remember Blue is , The red path is . that It can be expressed as ：

It can be After passing through the intersection, the lower half of the path is exchanged ：

calculation ：

In order to observe the law , Separate calculation .

order ：

Then it can be expressed as ：

Let's promote it , All through Change to And The path of （ Namely ） It can be written as follows ：

Further promotion , All through Change to And The path of （ Namely ） You can also write ：

therefore , definition forward ：

For a length of The path of , among Represents before the path Characters , After the representative Characters .

among Before Characters after Converted to The first half of the subpath .

Observation chart 7 You'll find that for There is a recurrence relationship as follows ：

chart 6

So more general ：

How do you understand the above ：

Time character point marked by blue rectangle , its forward By the three directions of its previous moment （ Yellow arrow ） Forecast of . So there are ：

same , definition backword ：

according to

Similarly, for There is a recurrence relationship as follows ：

that forward and backward Multiply by ：

or ：

be careful , It can be shown in the figure 6 The relation correspondence of , as ,.

contrast :

You can get it ：

or ：

train CTC

about LSTM, There is a training set , among It's a picture passing by CNN Calculated Feature map,

It's the picture ocr character label（label It's not in it blank character ）.

Now what we have to do is ： By gradient adjustment LSTM Parameters of , Make the input sample have Get the biggest . So how to calculate the gradient is the core .

Look at it alone LSTM output A value in a matrix （ be careful And The meaning is the same , It's all in Time The probability of ）：

Get the gradient It's easy to do after that , Among them It can be calculated quickly by recursion , So gradient descent, you know .

CTC summary

CTC It's a kind of Loss computing method , use CTC replace Softmax Loss, Training samples do not need to be aligned .

CTC characteristic ：

* At the same time blank character , Solve the problem that there are no characters in some positions

* By recursion , Fast calculation of gradient