One , What is big data ? Characteristics of big data ?
big data （big
data）, It refers to that it can not be captured by conventional software tools within a certain period of time , Data collection for management and processing , It needs new processing mode to have stronger decision-making power , Insight into the massive capacity of discovery and process optimization , High growth rate and diversified information assets .
Big data 5V characteristic （IBM propose ）：Volume（ of large number ）,Velocity（ high speed ）,Variety（ various ）,Value（ Low value density ）,Veracity（ Authenticity ）.
Two , What is data warehouse （Data warehouse）? characteristic ? What is?
OLTP,OLAP? The difference between them ?
data warehouse , It is a decision-making process for all levels of the enterprise , Strategy set that provides all types of data support . It's a single data store , Created for analytical reporting and decision support purposes .
For enterprises that need business intelligence , Provide guidance for business process improvement , Monitoring time , cost , Quality and control .
1. Efficient enough .2. Data quality .3. Expansibility .
4. Subject oriented
Data organization of operational database oriented to transaction processing task , Each business system is separated from each other , The data in the data warehouse is organized according to a certain topic domain . The theme is corresponding to the application oriented of traditional database , It's an abstract concept , It is to integrate the data in the enterprise information system at a higher level , The abstraction of classification and analysis . Each topic corresponds to a macro analysis field . Data warehouse eliminates data that is useless for decision making , Provides a concise view of a specific topic .
Also known as transaction oriented processing system , Its basic feature is that the customer's original data can be immediately transmitted to the computing center for processing , The processing results are given in a short time .
The biggest advantage of this is that the input data can be processed instantly , Answer promptly . Also known as a real-time system (Real time
System). One of the most important performance indexes of online transaction processing system is system performance , It is embodied in real-time response time (Response
Time), That is, after the user sends data to the terminal , The time required for the computer to respond to this request .OLTP It's done by the database engine .
OLTP Databases are designed to allow transactional applications to write only the data they need , To process a single transaction as quickly as possible .
On line analytical processing （OLAP） System is the main application of data warehouse system , Specifically designed to support complex analysis operations , Focus on decision support for decision makers and senior managers , It can be fast according to the requirements of analysts , Flexible and complex query processing of large amount of data , And in an intuitive and understandable form, the query results are provided to decision makers , So that they can accurately grasp the enterprise （ company ） Business status of , Understand the needs of the object , Make the right plan .
Three ,ETL(Extract-Transform-Load) And digging （DataMine） Of
Data analysis is based on the purpose of the analysis , Using appropriate statistical analysis methods and tools , Process and analyze the collected data , Extracting valuable information , Play the role of data . It has three main functions ： Current situation analysis , Cause analysis , Forecast analysis （ ration ）. The goal of data analysis is clear , Make assumptions first , Then the hypothesis is verified by data analysis , The corresponding conclusions are obtained . Comparative analysis is mainly used , Group analysis , Cross analysis , Regression analysis and other common analysis methods ; Data analysis is usually a result of index statistics , Such as the sum , Average value, etc , These indicators need to be interpreted in combination with the business , In order to play the value and role of data ;
Data mining refers to from a large number of data , Through statistics , artificial intelligence , Machine learning and other methods , Dig out the unknown , And valuable information and knowledge process . Data mining mainly focuses on solving four kinds of problems ： classification , clustering , Correlation and prediction （ ration , qualitative ）, The focus of data mining is to find unknown patterns and laws ; As we often say, data mining cases ： Beer and diapers , Condoms and chocolate , This is the unknown , But it's very valuable information ; Decision tree is mainly used , neural network , Association rules , Cluster analysis and other statistics , artificial intelligence , Machine learning and other methods of mining ; Output model or rule , The model score or label can be obtained accordingly , Model score such as loss probability value , Total score , Similarity , Predicted values, etc , Tags such as high, medium and low value users , Loss and non loss , Good credit, medium poor, etc ;
Four , What is? Hadoop?
Hadoop It's a Apache Distributed system infrastructure developed by the foundation . Users can not understand the underlying details of the distribution , Developing distributed programs . Make full use of the power of cluster for high-speed operation and storage .Hadoop A distributed file system is implemented （Hadoop
System）, abbreviation HDFS.HDFS It has high fault tolerance , And it's designed to be deployed at low cost （low-cost） Hardware ; And it provides high throughput （high
throughput） To access the application's data , For those with very large datasets （large data
set） Applications for .HDFS It's relaxed （relax）POSIX Requirements of , Can be accessed as a stream （streaming
access） Data in the file system .Hadoop The core design of the framework is ：HDFS and MapReduce.HDFS It provides storage for massive data , be MapReduce It provides calculation for massive data .