CRNN yes 《An End-to-End Trainable Neural Network for Image-based Sequence
Recognition and Its Application to Scene Text Recognition》 The model proposed in , Solve the problem of character recognition in image .

Paper address : <> 

github address : <>

1: Structure diagram

It mainly includes : Convolution layer (CNN), Circulation layer (RNN), Transcription layer (CTC)

2: Characteristics of the whole network model

We can learn directly from sequence tags , You don't have to label every character , Just put a sequence tag on a picture , for example : It's in the picture “abc123”, The label is “abc123”, You don't have to label each character individually .
* utilize CNN Extracted image features
* utilize RNN Feature sequence of training input , The output is a sequence tag
* There is no length limit on the images to be trained , But the height of the image should be normalized .
* Few parameters
* Although combined CNN and RNN, But in the end, one loss function (CTC LOSS), The end-to-end training is realized .
3:CRNN Key points in , that is CTC(connectionist temporal classification) Original text from 《Connectionist
Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent
Neural Networks

first ,CRNN The first step is to send pictures into the training CNNji Feature extraction by convolution , obtain feature map

then , take feature map Each column of ( Here we simplify the process , Think of a column as one LSTM time series ,CRNN Back analysis of network structure )
Or each column can be used as a time series input feature , Send in RNN( two-way LSTM)

Let's take a look at the flow chart :


* CNN feature map
            hypothesis CNN Got it feature map The size is  , The time series obtained below    All from    start , Namely  .

Enter the RNN The time series in is defined as :,

Each column is :                              


* RNN( two-way LSTM)
 LSTM After each time slice of softmax, The output is a posterior probability matrix , The size is n*T,T Represents the total of time series T column ,n In the dictionary n Tags , Defined as :

Each column is :


            Because we get probability , So the sum of the probabilities of all character tags in each column is 1, Namely  , Take the maximum value of each column , You can get the category of the output characters for each column , That is, for each column   operation .

            Put the LSTM Network as a mapping   , among w by LSTM Parameters of .m*T Of CNN Got it feature map, Input channel LSTM The output is
n*T,  So we did the following transformation :

            Suppose the original character label is L =
{a,b,c,......x,y,z} Of 26 And letters , Because you want to add a default space character blank( The effect is similar to the default background label in target detection ), The new character label is L‘.

            definition B Transformation :

                                                            , Represents simple compression

explain : hypothesis T = 12, That is to say, there are 12 column , The path is , The length of the path is T.


        As shown above 5 Paths , It can be seen that there are rules : If there is a space between two characters blank Time , Then it will not be merged , If there is no space , Combine the same characters into one .

        When you get LSTM after softmax( Note that softmax no softmax loss, Because it's used here CTC
loss) After getting the character probability score matrix , through argmax() Output after operation y, after B After transformation , You can get the output , obviously B Transformation is not one
The relation of one mapping . Now we can get through feature
map Do it with each column x Input to RNN in , Get the probability matrix , and argmax() Operation gets output , Then use the output of this step to go through B The final character prediction output is obtained by transforming the rules . But the output is different from the number of columns input at that time ( Obviously different , In other words, the phenomenon of misalignment occurred .), Training needs back propagation to update parameters , therefore , How to use the predicted output to calculate loss What about .

        If the output is used softmax The probability matrix is obtained , Each column corresponds to the output
All correspond to one character in the label . In training, each sample image needs to mark the position of each character in the picture , Re pass CNN Sense field aligned to Feature
map Each column of , As shown in the figure 3, To train .


        in application , Marking such aligned samples is very difficult , The workload is very heavy , therefore CTC In this paper, we propose a method that does not need to be aligned Loss computing method , For training network .

        Let's calculate it first , Number of paths : hypothesis T = 30, The final output is “state”, Five characters , Common paths can be compressed into 【state】. The formula for calculating the number of paths is .

CTC How to do it :

        about LSTM Given input    In the case of , The output is    The probability of :

        The above formula is explained : In the case of input , The probability of output is : The condition is, and the input is , Yes B The sum of the probabilities of all paths whose output is .

So what is the probability of each path :

The above formula is explained : The probability of each path is the product of its output at each time .

Our objective function is equivalent to

But the number of paths is huge , It can't be calculated directly , therefore CTC Method is borrowed from HMM Forward backward algorithm (forward-backward algorithm) To calculate .


CTC Training process , It's essentially through gradients    adjustment LSTM Parameters of   , So that the input sample is    happen now and then    Get the biggest .

Define all    The result of the transformation is    And in the    The time result is  ( Recorded as  ) The path set for is   .

Derivation :

Pay attention to the second term and    irrelevant , therefore :

First, let's understand the formula above : yes   time   The output characters are   The probability of . In the derivation formula above : Because the total number of routes is huge , I just need to calculate the derivation here   time   The output characters are  
All paths for . The schematic diagram is as follows :

            Suppose it was t=6 time ,k = a character . So according to the figure above, there are two paths through this point ( There are still a few that haven't been drawn ), Suppose there are four paths through the point , Respectively
, All similar to    after    The result of the transformation is    And in the    The set of paths for is represented as   .

            observation   . remember    Blue is   ,   The red path is   . that    It can be expressed as :



              It can be   After passing through the intersection, the lower half of the path is exchanged :


          calculation :


          In order to observe the law , Separate calculation   .






            order :



            Then it can be expressed as :


            Let's promote it , All through    Change to    And    The path of ( Namely   ) It can be written as follows :


            Further promotion , All through    Change to    And    The path of ( Namely   ) You can also write :


      therefore , definition forward  :

          For a length of    The path of   , among    Represents before the path    Characters ,   After the representative    Characters .

        among    Before    Characters    after    Converted to    The first half of the subpath .

        Observation chart 7 You'll find that for    There is a recurrence relationship as follows :

            chart 6

                  So more general :


                  How do you understand the above :


Time character point marked by blue rectangle , its forward By the three directions of its previous moment ( Yellow arrow ) Forecast of . So there are :


      same , definition backword  :

                                      according to


          Similarly, for    There is a recurrence relationship as follows :


that forward and backward Multiply by :

or :

be careful ,   It can be shown in the figure 6 The relation correspondence of , as   ,.

contrast   :

You can get it :

or :

train CTC

about LSTM, There is a training set   , among    It's a picture passing by CNN Calculated Feature map, 
  It's the picture ocr character label(label It's not in it blank character ).

Now what we have to do is : By gradient adjustment LSTM Parameters of , Make the input sample have    Get the biggest . So how to calculate the gradient is the core .

Look at it alone LSTM output    A value in a matrix   ( be careful    And    The meaning is the same , It's all in    Time    The probability of ):

Get the gradient    It's easy to do after that , Among them    It can be calculated quickly by recursion , So gradient descent, you know .

CTC summary

CTC It's a kind of Loss computing method , use CTC replace Softmax Loss, Training samples do not need to be aligned .

CTC characteristic :

* At the same time blank character , Solve the problem that there are no characters in some positions
* By recursion , Fast calculation of gradient