<> One . Background of flow computing

<>1. summary

In fact, technology is always updating , To do this industry has always been to learn and adapt to the road , This is also what attracts me most in the field of artificial intelligence , In fact, the basic theory is unchanged , But with the development of the business , The development of computing power , The upper implementation is always iterating , Today, I'd like to talk about my understanding of stream computing .

Let's talk about the evolution of computing engines , I drew the picture above . In fact, the first generation of distributed computing engine is Hadoop, This is a cross era creation , People use it Hadoop Of MapReduce The framework implements many algorithms , These algorithms also play a great role .Hadoop The biggest feature is that , Data computing depends on hard disk storage , In other words, a lot of the results of the calculation process need to be stored in the hard disk , Then pull it from the hard disk , Causing low performance problems .

Spark The good news is that , Put all the data into memory for calculation , Greatly improve efficiency . But no matter what Spark or Hadoop All the problems solved are batch calculation , Also known as batch calculation . Offline computing needs to collect data and calculate uniformly , For algorithms , Maybe convergence will be faster , Because there are a lot of data involved in the calculation . But there is a problem , Poor real-time performance . This leads to the next generation of computing engines - The topic of stream computing .

<>2. Stream computing

Understanding flow computing , First of all, we should make clear the concept . Let's start with dirty computing (stream compute) And batch calculation (batch compute) Calculation model of :

Stream computing : When a piece of data is processed , Serialize to cache , Then it is immediately transmitted to the next node through the network , Continue processing by the next node .


Batch processing system : When a piece of data is processed , Serialize to cache , It doesn't immediately go over the network to the next node , When the cache is full , It is persisted to the local hard disk , When all the data has been processed , The processed data is transmitted to the next node through the network .

For stream computing , Do you feel a little bit . Compared with batch compute,stream
compute It must be more flexible in business , Because it can be associated with more real-time data ( The time period of data is really important , I will share my views with you when I have the opportunity ).

stream I'll give you an example of the advantages of the business , For example, an e-commerce platform , There's a recommendation system , Recommendation models are generated by batch training based on offline data every week . But suddenly one day , The e-commerce company launched a targeted marketing campaign for special groups of people , There are a lot of special users coming in , The old model for this group may not work , At this time, if there is a real-time training model ability, it will respond to this scenario more quickly , This is what it is online
learning The concept of , The underlying layer depends on the stream computing engine .

<>3. True next generation stream computing engine

Stream computing engine will be the next generation of computing engine , This does not mean flow computing instead of batch computing , Instead, the next generation of stream computing engines will be compatible batch compute and stream
compute, To achieve the integration of flow and batch ,Flink Maybe it's an answer .

Of course, the challenge of streaming computing is much larger than that of batch computing , such as failover mechanism , All calculation results of batch calculation are stored , It can be traced back , How to solve the problem of downtime in stream computing . such as exactly
once mechanism , How to ensure that the data in distributed stream computing is processed only once , Instead of being processed by multiple machines .

But I believe these problems will be solved perfectly , In the future, the algorithm will migrate to the flow direction .

<> Two . Talking about FTRL algorithm ( Flow logistic regression algorithm )

<>1. Overview of churn algorithms

Let's talk about some views on streaming algorithm , Streaming algorithm is to update the model in real time , So from the perspective of easy implementation , Not all batch algorithms are suitable for streaming , Only those algorithms which are easy to calculate the loss function are more suitable for flow .

Two common calculation methods of loss function are as follows :

( notes : I will not explain the specific meaning of each variable here , If you can't understand , It's time to buy a book to supplement basic knowledge )

The biggest difference between streaming algorithm and batch algorithm is the amount of data calculation , The calculation models of batch algorithm and stream algorithm are as follows :

Batch algorithm : Loss functions and gradients were calculated using full data each time , Then update the model

Stream Algorithm : Every use 1 Loss functions and gradients are calculated from data , Then update the model

From this point of view , Because the amount of data each time you participate in training becomes smaller , So for the algorithm, from the perspective of training data sparsity and data dimension , There are more constraints and challenges .FTRL The algorithm is developed by Google propose , Currently in online
learning There are very good algorithms at this level , It can be understood as a logical regression algorithm in stream computing , at present FTRL In advertising , Real time computing scenarios such as product recommendation are widely used . Here's an introduction FTRL The specific calculation process of .

<>2.FTRL Specific derivation

Take a look first FTRL Iterative formula of ( There may need to be some algorithmic background , You can first understand the iterative method of logistic regression ), The derivation of logistic regression is discussed in my book , I won't say much here :

Make a concrete explanation for this formula ,

first w Represents the weight of the model ,t Represents the iteration round

min(f(x)) This function represents that f(x) With the minimum value ,x Set of . In this formula, the loss function of each iteration is minimized w combination , That is, wait until convergence ,w The parameter will be a constant value

It means that t Original model parameters of wheel

It's a loss function

It's a regular term , Prevent over fitting

To sum up ,FTRL In terms of algorithm logic, the following batch algorithm has not changed much , It's just a lot of testing for table names FTRL The algorithm has a good effect on sparse data and large dimension model training in the process of streaming model training .

There are several good articles to introduce FTRL, Also recommended to you .

reference :


【2】http://vividfree.github.io/ machine learning /2015/12/05/understanding-FTRL-algorithm