Understand the technology ecology of big data Hadoop, hive,spark It's all there
Big data itself is a very broad concept ,Hadoop Ecosphere （ Or pan ecosystem ） Basically, it was born to process data beyond the scale of a single machine .
You can compare it to a kitchen, so you need all kinds of tools . Pots and pans , Each has its own use , There's overlap between them . You can eat and drink soup in a soup pot , You can peel with a knife or a plane . But each tool has its own characteristics , Although strange combinations work , But not necessarily the best option .
big data , First of all, you need to be able to save big data .
Traditional file systems are stand-alone , Can't span different machines .HDFS（Hadoop Distributed
FileSystem） Is essentially designed to span hundreds of machines with large amounts of data , But what you see is a file system, not a lot of file systems . For example, you say I want to get /hdfs/tmp/file1 Data for , You are referring to a file path , But the actual data is stored on many different machines . You as a user , You don't need to know that , It's like on a single computer, you don't care what tracks or sectors the files are scattered in .HDFS Manage the data for you .
After saving the data , You start to think about how to deal with the data .
although HDFS It can manage data on different machines for you as a whole , But the data is too big . One machine reads as T upper P Data for （ Big data , For example, the size of all high-definition movies in the history of Tokyo fever is even larger ）, It may take days or even weeks for a machine to run slowly . For many companies , Single machine processing is intolerable , For example, Weibo needs to be updated 24 Hourly thermowave , It has to be 24 Run them in hours . So if I have to deal with it with a lot of machines , I'm faced with how to allocate work , How to restart the corresponding task if a machine hangs up , How to communicate and exchange data between machines to complete complex calculation, etc . This is it. MapReduce
/ Tez /
Spark Function of .MapReduce Is the first generation of computing engine ,Tez and Spark It's the second generation .MapReduce Design of , A very simplified calculation model is adopted , only Map and Reduce Two calculation processes （ Intermediate Shuffle series connection ）, Use this model , We can deal with a large part of big data problems .
So what is Map What is? Reduce?
Consider if you want to count a huge text file stored in a similar HDFS upper , You want to know how often words appear in this text . You started a MapReduce program .Map stage , Hundreds of machines read all parts of the file at the same time , Count the word frequency of each part read , Produce similar （hello,
12100 second ）,（world,15214 second ） Wait a minute Pair（ I'm here Map and Combine Put it together to simplify ）; These hundreds of machines each produce the same set , And then hundreds of machines started up Reduce handle .Reducer machine A From Mapper Machine received all A Statistics at the beginning , machine B Will receive B Statistics of words at the beginning （ Of course, it doesn't really start with a letter , It's a function Hash Value to avoid data serialization . Because it's similar X There must be fewer words at the beginning than others , And you don't want the amount of data processing to vary from machine to machine ）. And then these Reducer Will summarize again ,（hello,12100）＋（hello,12311）＋（hello,345881）=
（hello,370292）. each Reducer All as above , You get the word frequency of the whole document .
This seems like a very simple model , But many algorithms can be described by this model .
Map＋Reduce The simple model is yellow and violent , Although easy to use , But it's heavy . Second generation Tez and Spark Except for memory Cache Something new feature, In essence , Yes Map/Reduce More general model , Give Way Map and Reduce The line between them is more blurred , More flexible data exchange , Fewer disk reads and writes , In order to describe the complex algorithm more easily , Get higher throughput .
Yes MapReduce,Tez and Spark after , Programmer discovery ,MapReduce It's a real hassle . They want to simplify the process . It's like you have assembly language , Although you can do almost everything , But you still think it's complicated . you hope
There is a higher and more abstract language layer to describe algorithms and data processing processes . So there it is Pig and Hive.
Pig It's close to scripting MapReduce,Hive It is SQL. They put the script and SQL Language translation into MapReduce program , To the computing engine , And you're not MapReduce Get out of the program , Write the program in a simpler and more intuitive language .
Yes Hive after , People found that SQL contrast Java There are huge advantages
. One is that it's too easy to write . What just happened , use SQL There's only one or two lines to describe ,MapReduce It's about tens or hundreds of lines . And more importantly , Non computer users finally feel love ： I can also write SQL! So the data analysts finally got out of the dilemma of asking engineers for help , Engineers are also freed from writing strange one-time handlers . Everyone's happy .Hive Gradually grow into the core component of big data warehouse . Even a lot of company's assembly line work sets are completely used SQL describe , Because it's easy to write and change , You can read it , Easy to maintain .
Since data analysts started using Hive After analyzing the data , They found ,Hive stay MapReduce Run up , Real chicken slow ! Maybe it doesn't matter , such as 24 Hourly updated recommendations , anyway 24 Run in hours . But data analysis , People always want to run faster . For example, I want to see how many people have stopped at the inflatable doll page in the past hour , How long did they stay , For a huge website with massive data , This process may take dozens of minutes or even hours . And this analysis may be the first step in your long march , How many people do you want to see? How many people have seen Rakhmaninov's CD, In order to report to the boss , Our users are lewd men and sullen women, more or literary youth ／ More girls . You can't stand waiting , I can only talk to a handsome engineer , fast , fast , Faster !
therefore Impala,Presto,Drill Born （ Of course, there are countless non famous interactions SQL engine , Not one by one ）. The core idea of the three systems is ,MapReduce The engine is too slow , Because it's so versatile , Too strong , Too conservative , We SQL Need lighter weight , More aggressive access to resources , More specifically SQL Optimize , And it doesn't need so much fault tolerance （ Because of a system error, restart the task , If the whole processing time is shorter , Like in a few minutes ）. These systems allow users to process more quickly SQL task , Sacrificing the universality, stability and other characteristics . if MapReduce It's a machete , Not afraid of cutting anything , Three of them are bone picking knives , Smart and sharp , But we can't do anything too big or too hard .
These systems , tell the truth , It hasn't been as popular as people expected . Because then two other aliens were created . They are Hive on Tez /
Spark and SparkSQL. Their design philosophy is ,MapReduce slow , But if I use a new generation of general-purpose computing engine Tez perhaps Spark Come and run SQL, Then I can run faster . And users do not need to maintain two systems . It's like if your kitchen is small , People are lazy , Limited requirements for delicacy of food , Then you can buy a rice cooker , Can steam, can boil, can burn , Save a lot of cooking utensils .
Introduction above , It's basically a data warehouse architecture .
bottom HDFS, Run up MapReduce／Tez／Spark, Run on it Hive,Pig. perhaps HDFS Go straight Impala,Drill,Presto. This solves the requirements of low and medium speed data processing .
What if I want to handle it more quickly ?
If I was a company like Weibo , I want to show that it's not 24 Hourly thermowave , I want to see a constantly changing hit list , Update delay within one minute , None of the above will do it . So another computing model was developed , This is it. Streaming（ flow ） calculation .
Storm Is the most popular flow computing platform
. The idea of flow calculation is , If you want to achieve more real-time updates , Why don't I process the data when it comes in ? For example, word frequency statistics , My data stream is one word , I'll let them flow through me and start counting . Flow computing is awesome , Basically no delay , But its disadvantage is , Inflexible , What you want to count must be known in advance , After all, data flow is gone , You can't make up what you didn't count . So it's a good thing , But it can't replace the above data warehouse and batch processing system .
There's another module that's a little bit independent KV Store, such as Cassandra,HBase,MongoDB And many, many, many, many others （ Too much to imagine ）. therefore KV
Store That is to say , I have a bunch of keys , I can get this quickly Key Bound data . For example, I use my ID card number , Can get your identity data . This action MapReduce It can also be done , But it's likely to scan the entire dataset . and KV
Store Dedicated to handle this operation , All storage and retrieval are optimized for this purpose . From several P Find an ID number in the data of , Maybe just a few seconds . This greatly optimizes some special operations of big data companies . For example, there is a page on my web page to find the order content according to the order number , However, the order quantity of the whole website cannot be stored in a single database , I'll think about using KV
Store Come and save .KV
Store The idea is , Basically unable to handle complex calculations , Mostly not JOIN, Maybe we can't get together , No strong consistency guarantee （ Different data distributed on different machines , You may read different results every time you read , It can't handle operations with strong consistency requirements like bank transfers ）. But ya is quick . Extremely fast .
Each different KV Store There are different choices in design , Some faster , Some have higher capacity , Some can support more complex operations . There must be one for you .
besides , And some more specialized systems ／ assembly , such as Mahout Is a distributed machine learning library ,Protobuf Is the code and Library of data exchange ,ZooKeeper It is a highly consistent distributed access cooperative system , wait .
With so many messy tools , All running on the same cluster , We need to respect each other and work orderly . So another important component is , dispatching system . Now the most popular is Yarn. You can think of him as central management , It's like your mother oversees the kitchen , hey , Your sister cut the vegetables , You can take the knife to kill the chicken . As long as everyone obeys your mother's orders , Then everyone can cook happily .
You can think ,
Big data ecosystem is a kitchen tool ecosystem . To make different dishes , Chinese food , Japanese dish , French cuisine , You need different tools . And the needs of the guests are becoming more complex , Your kitchenware is constantly being invented , And there's not a universal kitchen that can handle all situations , So it's going to get more and more complicated .