One, What is big data? Characteristics of big data?

      Big data(big
data), It refers to that it is impossible to capture with conventional software tools within a certain time range, Data collection for management and processing, New processing mode is needed to have stronger decision-making power, Insight into the massive capacity of discovery and process optimization, High growth rate and diversified information assets.

Big data5V Characteristic(IBM propose):Volume( Of large number),Velocity( high speed),Variety( Various),Value( Low value density),Veracity( Authenticity).

Two, What is data warehouse(Data warehouse)? Characteristic? What is?

OLTP,OLAP? The difference between them?

      data warehouse, Is the decision-making process for all levels of the enterprise, A strategic collection that provides support for all types of data. It's a single data store, Created for analytical reporting and decision support purposes.
For businesses that need business intelligence, Provide guidance for business process improvement, Monitoring time, cost, Quality and control.

1. Efficient enough.2. Data quality.3. Expansibility.

4. Theme oriented

Data organization of operational database oriented to transaction processing task, Separate business systems, The data in the data warehouse is organized according to certain subject fields. The theme is corresponding to the application-oriented of traditional database, It's an abstract concept, It is to synthesize data in enterprise information system at a higher level, The abstraction of classification and analysis. Each theme corresponds to a macro analysis field. Data warehouse excludes useless data for decision, Provides a concise view of a specific topic.


Also known as transaction oriented processing system, Its basic feature is that the customer's original data can be immediately transferred to the computing center for processing, And give the result in a very short time.

The biggest advantage of this is that the input data can be processed in real time, Answer in time. Also known as real-time system(Real time
System). An important performance index of online transaction processing system is system performance, Real time response time(Response
Time), After the user sends data to the terminal, To the time it takes for the computer to respond to this request.OLTP It's done by the database engine.
OLTP The database is designed to allow transactional applications to write only the required data, To process individual transactions as soon as possible.


On line analytical processing(OLAP) System is the most important application of data warehouse system, Designed to support complex analysis operations, Focus on decision support for decision makers and senior management, It can be fast according to the requirements of analysts, Flexible and complex query processing with large amount of data, And provide the query results to the decision-makers in an intuitive and understandable form, So that they can master the enterprise accurately( company) Business status of, Understand the needs of the object, Make the right plan.

Three,ETL(Extract-Transform-Load) And mining(DataMine) Of


Data analysis is based on the purpose of the analysis, Using appropriate statistical analysis methods and tools, Process and analyze the collected data, Extract valuable information, Play the role of data. It mainly realizes three functions: Current situation analysis, Cause analysis, Prediction analysis( ration). Clear goal of data analysis, Let's make assumptions first. And then through data analysis to verify whether the hypothesis is correct, So we can get the corresponding conclusion. Mainly comparative analysis, Group analysis, Cross analysis, Regression analysis and other common analysis methods; Generally, data analysis is to get an index statistic result, If sum, Average value, etc. These indicators need to be interpreted in combination with the business, In order to play the value and role of data;

Data mining refers to data mining from a large number of, Through statistics, Artificial intelligence, Machine learning and other methods, Mining out the unknown, The process of valuable information and knowledge. Data mining focuses on solving four kinds of problems: classification, clustering, Correlation and prediction( ration, qualitative), The focus of data mining is to find unknown patterns and laws; As we often say, data mining cases: Beer and diapers, Condoms and chocolate, etc, This is what we didn't know in advance, But it's also very valuable information; Mainly using decision tree, neural network, Association rules, Cluster analysis and other statistics, Artificial intelligence, Mining by machine learning and other methods; Output model or rule, And corresponding model scores or labels can be obtained, Model score such as loss probability value, Total score, Similarity degree, Prediction value, etc. Labels such as high, middle and low value users, Loss and non loss, Good credit, medium poor, etc;

Four, What is?Hadoop?

Hadoop It is aApache Distributed system infrastructure developed by foundation. Users can do this without knowing the underlying details of the distribution, Developing distributed programs. Make full use of the power of cluster for high-speed operation and storage.Hadoop A distributed file system is implemented(Hadoop
Distributed File
System), AbbreviationHDFS.HDFS High fault tolerance, And it's designed to be deployed at low cost(low-cost) Hardware; And it provides high throughput(high
throughput) To access application data, For those with large data sets(large data
set) Applications for.HDFS Relaxed(relax)POSIX Requirements, Can be accessed as a stream(streaming
access) Data in the file system.Hadoop The core design of the framework is:HDFS andMapReduce.HDFS Provides storage for massive data, beMapReduce Provides calculation for massive data.