One , What is big data ? Characteristics of big data ?

      big data (big
data), It refers to that it can not be captured by conventional software tools within a certain period of time , Data collection for management and processing , It needs new processing mode to have stronger decision-making power , Insight into the massive capacity of discovery and process optimization , High growth rate and diversified information assets .

    
Big data 5V characteristic (IBM propose ):Volume( of large number ),Velocity( high speed ),Variety( various ),Value( Low value density ),Veracity( Authenticity ).

Two , What is data warehouse (Data warehouse)? characteristic ? What is?

OLTP,OLAP? The difference between them ?


      data warehouse , It is a decision-making process for all levels of the enterprise , Strategy set that provides all types of data support . It's a single data store , Created for analytical reporting and decision support purposes .
For enterprises that need business intelligence , Provide guidance for business process improvement , Monitoring time , cost , Quality and control .

1. Efficient enough .2. Data quality .3. Expansibility .

4. Subject oriented


     
Data organization of operational database oriented to transaction processing task , Each business system is separated from each other , The data in the data warehouse is organized according to a certain topic domain . The theme is corresponding to the application oriented of traditional database , It's an abstract concept , It is to integrate the data in the enterprise information system at a higher level , The abstraction of classification and analysis . Each topic corresponds to a macro analysis field . Data warehouse eliminates data that is useless for decision making , Provides a concise view of a specific topic .

OLTP:


Also known as transaction oriented processing system , Its basic feature is that the customer's original data can be immediately transmitted to the computing center for processing , The processing results are given in a short time .

The biggest advantage of this is that the input data can be processed instantly , Answer promptly . Also known as a real-time system (Real time
System). One of the most important performance indexes of online transaction processing system is system performance , It is embodied in real-time response time (Response
Time), That is, after the user sends data to the terminal , The time required for the computer to respond to this request .OLTP It's done by the database engine .
OLTP Databases are designed to allow transactional applications to write only the data they need , To process a single transaction as quickly as possible .




OLAP:


On line analytical processing (OLAP) System is the main application of data warehouse system , Specifically designed to support complex analysis operations , Focus on decision support for decision makers and senior managers , It can be fast according to the requirements of analysts , Flexible and complex query processing of large amount of data , And in an intuitive and understandable form, the query results are provided to decision makers , So that they can accurately grasp the enterprise ( company ) Business status of , Understand the needs of the object , Make the right plan .

Three ,ETL(Extract-Transform-Load) And digging (DataMine) Of

difference ?


     
Data analysis is based on the purpose of the analysis , Using appropriate statistical analysis methods and tools , Process and analyze the collected data , Extracting valuable information , Play the role of data . It has three main functions : Current situation analysis , Cause analysis , Forecast analysis ( ration ). The goal of data analysis is clear , Make assumptions first , Then the hypothesis is verified by data analysis , The corresponding conclusions are obtained . Comparative analysis is mainly used , Group analysis , Cross analysis , Regression analysis and other common analysis methods ; Data analysis is usually a result of index statistics , Such as the sum , Average value, etc , These indicators need to be interpreted in combination with the business , In order to play the value and role of data ;

     
Data mining refers to from a large number of data , Through statistics , artificial intelligence , Machine learning and other methods , Dig out the unknown , And valuable information and knowledge process . Data mining mainly focuses on solving four kinds of problems : classification , clustering , Correlation and prediction ( ration , qualitative ), The focus of data mining is to find unknown patterns and laws ; As we often say, data mining cases : Beer and diapers , Condoms and chocolate , This is the unknown , But it's very valuable information ; Decision tree is mainly used , neural network , Association rules , Cluster analysis and other statistics , artificial intelligence , Machine learning and other methods of mining ; Output model or rule , The model score or label can be obtained accordingly , Model score such as loss probability value , Total score , Similarity , Predicted values, etc , Tags such as high, medium and low value users , Loss and non loss , Good credit, medium poor, etc ;

Four , What is? Hadoop?

      
Hadoop It's a Apache Distributed system infrastructure developed by the foundation . Users can not understand the underlying details of the distribution , Developing distributed programs . Make full use of the power of cluster for high-speed operation and storage .Hadoop A distributed file system is implemented (Hadoop
Distributed File
System), abbreviation HDFS.HDFS It has high fault tolerance , And it's designed to be deployed at low cost (low-cost) Hardware ; And it provides high throughput (high
throughput) To access the application's data , For those with very large datasets (large data
set) Applications for .HDFS It's relaxed (relax)POSIX Requirements of , Can be accessed as a stream (streaming
access) Data in the file system .Hadoop The core design of the framework is :HDFS and MapReduce.HDFS It provides storage for massive data , be MapReduce It provides calculation for massive data .