Understand the technology ecology of big dataHadoop, hive,spark Have it all.
Big data itself is a very broad concept,Hadoop Ecosphere（ Or pan ecosystem） Basically, it was born to process data beyond the scale of a single machine.
You can compare it to a kitchen, so you need all kinds of tools. Pots and pans, Each has its own use, There's overlap between them. You can eat and drink soup in a soup pot, You can peel with a knife or a plane. But each tool has its own characteristics, Although strange combinations work, But not necessarily the best option.
Big data, First of all, you need to be able to save big data.
Traditional file systems are stand-alone, Can't span different machines.HDFS（Hadoop Distributed
FileSystem） Is essentially designed to span hundreds of machines with large amounts of data, But what you see is a file system, not a lot of file systems. For example, you say I want to get/hdfs/tmp/file1 Data, You are referring to a file path, But the actual data is stored on many different machines. You as a user, You don't need to know that, It's like on a single computer, you don't care what tracks or sectors the files are scattered in.HDFS Manage the data for you.
After saving the data, You start to think about how to deal with the data.
althoughHDFS It can manage data on different machines for you as a whole, But the data is too big. One machine reads asT upperP Data（ Big data, For example, the size of all high-definition movies in the history of Tokyo fever is even larger）, It may take days or even weeks for a machine to run slowly. For many companies, Single machine processing is intolerable, For example, Weibo needs to be updated24 Hour hot Bo, It must be in24 Run them in hours. So if I have to deal with it with a lot of machines, I'm faced with how to allocate work, How to restart the corresponding task if a machine hangs up, How to communicate and exchange data between machines to complete complex calculation, etc. This is it.MapReduce
/ Tez /
Spark Function.MapReduce Is the first generation of computing engine,Tez andSpark The second generation..MapReduce Design, A very simplified calculation model is adopted, onlyMap andReduce Two calculation processes（ Intermediate useShuffle Series connection）, Use this model, We can deal with a large part of big data problems.
Then what isMap What is?Reduce?
Consider if you want to count a huge text file stored in a similarHDFS upper, You want to know how often words appear in this text. You started aMapReduce program.Map stage, Hundreds of machines read all parts of the file at the same time, Count the word frequency of each part read, Produce similar（hello,
12100 second）,（world,15214 second） Wait a minutePair（ Here I am.Map andCombine Put it together to simplify）; These hundreds of machines each produce the same set, And then hundreds of machines started upReduce Handle.Reducer machineA Will fromMapper Machine receives allA Statistics at the beginning, machineB Will receiveB Statistics of words at the beginning（ Of course, it doesn't really start with a letter, It's a functionHash Value to avoid data serialization. Because similarX There must be fewer words at the beginning than others, And you don't want the amount of data processing to vary from machine to machine）. Then theseReducer Will summarize again,（hello,12100）＋（hello,12311）＋（hello,345881）=
（hello,370292）. eachReducer All as above, You get the word frequency of the whole document.
This seems like a very simple model, But many algorithms can be described by this model.
Map＋Reduce The simple model is yellow and violent, Although easy to use, But it's heavy. The second generationTez andSpark In addition to memoryCache And so on.feature, In essence, Is letMap/Reduce More general model, Give WayMap andReduce The line between them is more blurred, More flexible data exchange, Fewer disk reads and writes, In order to describe the complex algorithm more easily, Get higher throughput.
Yes.MapReduce,Tez andSpark after, Programmer discovery,MapReduce It's a real hassle. They want to simplify the process. It's like you have assembly language, Although you can do almost everything, But you still think it's complicated. You hope
There is a higher and more abstract language layer to describe algorithms and data processing processes. So there it isPig andHive.
Pig It's close to scriptingMapReduce,Hive It usesSQL. They put the script andSQL Language translation intoMapReduce program, To the computing engine, And you're notMapReduce Get out of the program, Write the program in a simpler and more intuitive language.
Yes.Hive after, People findSQL ContrastJava There are huge advantages
. One is that it's too easy to write. What just happened, useSQL There's only one or two lines to describe,MapReduce It's about tens or hundreds of lines. And more importantly, Non computer users finally feel love： I can write, too.SQL! So the data analysts finally got out of the dilemma of asking engineers for help, Engineers are also freed from writing strange one-time handlers. Everyone's happy.Hive Gradually grow into the core component of big data warehouse. Even a lot of company's assembly line work sets are completely usedSQL describe, Because it's easy to write and change, You can understand it at a glance. Easy maintenance.
Since data analysts started usingHive After analyzing the data, They found,Hive stayMapReduce Run up, Slow dick! Maybe it doesn't matter, such as24 Hourly updated recommendations, anyway24 Run in hours. But data analysis, People always want to run faster. For example, I want to see how many people have stopped at the inflatable doll page in the past hour, How long did they stay, For a huge website with massive data, This process may take dozens of minutes or even hours. And this analysis may be the first step in your long march, How many people do you want to see? How many people have seen Rakhmaninov'sCD, In order to report to the boss, Our users are lewd men and sullen women, more or literary youth／ More girls. You can't stand waiting, I can only talk to a handsome engineer, fast, fast, A little faster.!
ThereforeImpala,Presto,Drill Born（ And of course, there are countless non famous interactionsSQL engine, Not one by one）. The core idea of the three systems is,MapReduce The engine is too slow. Because it's so versatile, Too strong, Too conservative, WeSQL Need lighter weight, More aggressive access to resources, More specificallySQL Make optimization, And it doesn't need so much fault tolerance（ Because of a system error, restart the task, If the whole processing time is shorter, Like in a few minutes）. These systems allow users to process more quicklySQL task, Sacrificing the universality, stability and other characteristics. IfMapReduce It's a machete. Not afraid of cutting anything, Three of them are bone picking knives, Dexterity and sharpness, But we can't do anything too big or too hard.
These systems, Tell the truth, It hasn't been as popular as people expected. Because then two other aliens were created. They areHive on Tez /
Spark andSparkSQL. Their design philosophy is,MapReduce slow, But if I use a new generation of general-purpose computing engineTez perhapsSpark Come runningSQL, Then I can run faster. And users do not need to maintain two systems. It's like if your kitchen is small, Lazy people, Limited requirements for delicacy of food, Then you can buy a rice cooker, Can steam, can boil, can burn, Saved a lot of cooking utensils.
Introduction above, It's basically a data warehouse architecture.
BottomHDFS, Run upMapReduce／Tez／Spark, Run on itHive,Pig. perhapsHDFS Go straight aheadImpala,Drill,Presto. This solves the requirements of low and medium speed data processing.
What if I want to handle it more quickly?
If I was a company like Weibo, I want to show that it's not24 Hour hot Bo, I want to see a constantly changing hit list, Update delay within one minute, None of the above will do it. So another computing model was developed, This is it.Streaming（ flow） Calculation.
Storm Is the most popular flow computing platform
. The idea of flow calculation is, If you want to achieve more real-time updates, Why don't I process the data when it comes in? For example, word frequency statistics, My data stream is one word, I'll let them flow through me and start counting. Flow computing is awesome, Basically no delay, But its disadvantage is, Inflexibility, What you want to count must be known in advance, After all, data flow is gone, You can't make up what you didn't count. So it's a good thing, But it can't replace the above data warehouse and batch processing system.
There's another module that's a little bit independentKV Store, such asCassandra,HBase,MongoDB And many, many, many, many others（ Too much to imagine）. thereforeKV
Store That is to say, I have a bunch of keys, I can get this quicklyKey Bound data. For example, I use my ID card number, Can get your identity data. This actionMapReduce It can also be completed. But it's likely to scan the entire dataset. andKV
Store Dedicated to handle this operation, All storage and retrieval are optimized for this purpose. From severalP Find an ID number in the data of, Maybe just a few seconds. This greatly optimizes some special operations of big data companies. For example, there is a page on my web page to find the order content according to the order number, However, the order quantity of the whole website cannot be stored in a single database, I'll think about usingKV
Store To deposit.KV
Store The idea is, Basically unable to handle complex calculations, Most can not.JOIN, Maybe we can't get together, No strong consistency guarantee（ Different data distributed on different machines, You may read different results every time you read, It can't handle operations with strong consistency requirements like bank transfers）. But ya is quick. Extremely fast.
Each differentKV Store There are different choices in design, Some faster, Some have higher capacity, Some can support more complex operations. There must be one for you.
Besides, And some more specialized systems／ assembly, such asMahout Is a distributed machine learning library,Protobuf Is the code and Library of data exchange,ZooKeeper It is a highly consistent distributed access cooperative system, Wait.
With so many messy tools, All running on the same cluster, We need to respect each other and work orderly. So another important component is, dispatching system. Now the most popular isYarn. You can think of him as central management, It's like your mother oversees the kitchen, hey, Your sister cut the vegetables, You can take the knife to kill the chicken. As long as everyone obeys your mother's orders, Then everyone can cook happily.
You can think,
Big data ecosystem is a kitchen tool ecosystem. To make different dishes, Chinese food, Japanese dish, French cuisine, You need different tools. And the needs of the guests are becoming more complex, Your kitchenware is constantly being invented, And there's not a universal kitchen that can handle all situations, So it's going to get more and more complicated.