<> One . Background of flow computing
In fact, technology is always updating , To do this industry has always been to learn and adapt to the road , This is also what attracts me most in the field of artificial intelligence , In fact, the basic theory is unchanged , But with the development of the business , The development of computing power , The upper implementation is always iterating , Today, I'd like to talk about my understanding of stream computing .
Let's talk about the evolution of computing engines , I drew the picture above . In fact, the first generation of distributed computing engine is Hadoop, This is a cross era creation , People use it Hadoop Of MapReduce The framework implements many algorithms , These algorithms also play a great role .Hadoop The biggest feature is that , Data computing depends on hard disk storage , In other words, a lot of the results of the calculation process need to be stored in the hard disk , Then pull it from the hard disk , Causing low performance problems .
Spark The good news is that , Put all the data into memory for calculation , Greatly improve efficiency . But no matter what Spark or Hadoop All the problems solved are batch calculation , Also known as batch calculation . Offline computing needs to collect data and calculate uniformly , For algorithms , Maybe convergence will be faster , Because there are a lot of data involved in the calculation . But there is a problem , Poor real-time performance . This leads to the next generation of computing engines - The topic of stream computing .
<>2. Stream computing
Understanding flow computing , First of all, we should make clear the concept . Let's start with dirty computing （stream compute） And batch calculation （batch compute） Calculation model of ：
Stream computing ： When a piece of data is processed , Serialize to cache , Then it is immediately transmitted to the next node through the network , Continue processing by the next node .
Batch processing system ： When a piece of data is processed , Serialize to cache , It doesn't immediately go over the network to the next node , When the cache is full , It is persisted to the local hard disk , When all the data has been processed , The processed data is transmitted to the next node through the network .
For stream computing , Do you feel a little bit . Compared with batch compute,stream
compute It must be more flexible in business , Because it can be associated with more real-time data （ The time period of data is really important , I will share my views with you when I have the opportunity ）.
stream I'll give you an example of the advantages of the business , For example, an e-commerce platform , There's a recommendation system , Recommendation models are generated by batch training based on offline data every week . But suddenly one day , The e-commerce company launched a targeted marketing campaign for special groups of people , There are a lot of special users coming in , The old model for this group may not work , At this time, if there is a real-time training model ability, it will respond to this scenario more quickly , This is what it is online
learning The concept of , The underlying layer depends on the stream computing engine .
<>3. True next generation stream computing engine
Stream computing engine will be the next generation of computing engine , This does not mean flow computing instead of batch computing , Instead, the next generation of stream computing engines will be compatible batch compute and stream
compute, To achieve the integration of flow and batch ,Flink Maybe it's an answer .
Of course, the challenge of streaming computing is much larger than that of batch computing , such as failover mechanism , All calculation results of batch calculation are stored , It can be traced back , How to solve the problem of downtime in stream computing . such as exactly
once mechanism , How to ensure that the data in distributed stream computing is processed only once , Instead of being processed by multiple machines .
But I believe these problems will be solved perfectly , In the future, the algorithm will migrate to the flow direction .
<> Two . Talking about FTRL algorithm （ Flow logistic regression algorithm ）
<>1. Overview of churn algorithms
Let's talk about some views on streaming algorithm , Streaming algorithm is to update the model in real time , So from the perspective of easy implementation , Not all batch algorithms are suitable for streaming , Only those algorithms which are easy to calculate the loss function are more suitable for flow .
Two common calculation methods of loss function are as follows ：
（ notes ： I will not explain the specific meaning of each variable here , If you can't understand , It's time to buy a book to supplement basic knowledge ）
The biggest difference between streaming algorithm and batch algorithm is the amount of data calculation , The calculation models of batch algorithm and stream algorithm are as follows ：
Batch algorithm ： Loss functions and gradients were calculated using full data each time , Then update the model
Stream Algorithm ： Every use 1 Loss functions and gradients are calculated from data , Then update the model
From this point of view , Because the amount of data each time you participate in training becomes smaller , So for the algorithm, from the perspective of training data sparsity and data dimension , There are more constraints and challenges .FTRL The algorithm is developed by Google propose , Currently in online
learning There are very good algorithms at this level , It can be understood as a logical regression algorithm in stream computing , at present FTRL In advertising , Real time computing scenarios such as product recommendation are widely used . Here's an introduction FTRL The specific calculation process of .
<>2.FTRL Specific derivation
Take a look first FTRL Iterative formula of （ There may need to be some algorithmic background , You can first understand the iterative method of logistic regression ）, The derivation of logistic regression is discussed in my book , I won't say much here ：
Make a concrete explanation for this formula ,
first w Represents the weight of the model ,t Represents the iteration round
min（f(x)） This function represents that f(x) With the minimum value ,x Set of . In this formula, the loss function of each iteration is minimized w combination , That is, wait until convergence ,w The parameter will be a constant value
It means that t Original model parameters of wheel
It's a loss function
It's a regular term , Prevent over fitting
To sum up ,FTRL In terms of algorithm logic, the following batch algorithm has not changed much , It's just a lot of testing for table names FTRL The algorithm has a good effect on sparse data and large dimension model training in the process of streaming model training .
There are several good articles to introduce FTRL, Also recommended to you .
【2】http://vividfree.github.io/ machine learning /2015/12/05/understanding-FTRL-algorithm