One, What skills should data analysts have

Data analysis talent enthusiasm is also high, On the one hand, the data volume of enterprises is growing on a large scale, Growing demand for data analysis; On the other hand, Compared with other technical positions, There are far fewer candidates for data analysts.


Clear learning path, The most effective way is to look at the specific profession, Specific needs of the job for skills.
We found some of the most representative data analyst positions from the tick, Let's look at the highly paid data analysts, What skills are needed.





In fact, the demand for basic skills of data analysts varies little, It can be summarized as follows:

* SQL Basic operation of database, Basic data management
* Will useExcel/SQL Do basic data analysis and display
* Can analyze data with script language,Python or R
* Ability to access external data, Such as reptiles
* Basic data visualization skills, Ability to write data reports
* Familiar with common data mining algorithms: Mainly regression analysis, Decision tree, logistic regression,SVM, Random forest, etc.
Two, Data analysis process


Next is the data analysis process, Generally, you can press“ Data acquisition- Data storage and extraction- Data preprocessing- Data modeling and analysis- Data visualization” To implement a data analysis project. Follow this process, The detailed knowledge points that each part needs to master are as follows:



What is an efficient learning path? This is the process of data analysis. Step by step in this order, You will know what you need to accomplish for each part, What knowledge points need to be learned, What knowledge is temporarily unnecessary.

Three, How to learn data analysis skills?
Next, we will talk about what we should learn from each part, How to learn.

3.1, Data acquisition: Public data,Python Reptile

There are two main ways to obtain external data.


The first is to obtain external public data sets, Some scientific research institutions, enterprise, The government will open up some data, You need to go to a specific website to download the data. These datasets are usually relatively complete, Relatively high quality. Recommend some common websites that can get datasets:

UCI: University of California Irvine open classic data set, Adopted by many data mining laboratories.
http://archive.ics.uci.edu/ml/datasets.html
<http://archive.ics.uci.edu/ml/datasets.html>

National data: Data from China National Bureau of Statistics, Including the data of China's economy and people's livelihood.
http://data.stats.gov.cn/ <http://data.stats.gov.cn/>

CEIC: Exceed128 Economic data of countries, Can find exactlyGDP, Import and export retail, Sales and other in-depth data.
http://www.ceicdata.com/zh-hans <http://www.ceicdata.com/zh-hans>

China Statistical Information Network: Official website of National Bureau of Statistics, Collected statistics of national economic and social development.
http://www.tjcn.org/ <http://www.tjcn.org/>

Favorable data: Initiated by the National Information Center, Leading data trading platform in China, Lots of free data.
http://www.youedata.com/ <http://www.youedata.com/>

Another way to get external data is crawler.


For example, you can get the recruitment information of a certain position on the recruitment website through the crawler, Crawling the rental information of a city on the rental website, Crawling the list of movies with the highest score of Douban, Get Zhihu likes ranking, Netease cloud music review ranking list. Data crawling based on Internet, You can talk about a certain industry, Analysis of a certain population.

You need to know something before you can crawl Python Basic knowledge of: element( list, Dictionaries, Tuple, etc.), variable, loop, function………

as well as, How to use Python library(urllib,BeautifulSoup,requests,scrapy) Implement web crawler. If it's a beginner, Recommend from
urllib+BeautifulSoup start.

Popular e-commerce websites, Q & a website, Second hand trading website, Match-making website, Recruitment website, etc, Can climb to very valuable data.

3.2, data access:SQL language


When dealing with data within ten thousand yuan,Excel No problem with general analysis, Once the data volume is large, It's not going to work, Database can solve this problem well. And most businesses, Will takeSQL To store data, If you're an analyst, At least understandSQL Operation, Can query, Extract company data.

SQL As the most classic database tool, Provide possibility for storage and management of massive data, And greatly improve the efficiency of data extraction. You need to master the following skills:


Extract data in specific situations: The data in the enterprise database must be large and complex, You need to extract what you need. For example, you can extract it according to your needs2017 All sales data of the year, Extract the largest sales volume this year50 Item data, Extract Shanghai, Consumption data of users in Guangdong……,SQL You can do this with simple commands.

Increase of database, Delete, check, change: These are the most basic operations of the database, But it can be done with simple commands, So you just need to remember the command.


Group aggregation of data, How to establish a relationship between multiple tables: This part isSQL Advanced operation of, Association between multiple tables, When you deal with multi dimensions, Very useful when there are multiple datasets, It also allows you to deal with more complex data.

SQL This part is relatively simple, It is mainly to master some basic sentences. Of course, I suggest you find some datasets to operate, Even the most basic query, Extraction, etc..

3.3, Data preprocessing:Python(pandas)

Most of the time, the data we get is not clean, Duplication of data, Defect, Outliers, etc, At this time, data cleaning is needed, Process the data of impact analysis, In order to obtain more accurate analysis results.

For example, sales data, Some channel sales are not entered in time, There are some duplicate records. For example, user behavior data, There are a lot of invalid operations that don't make sense for analysis, It needs to be deleted.

Then we need to deal with it in a corresponding way, For example, incomplete data, We're going to get rid of this data, Or use the adjacent value to complete, These are all questions to consider.

For data preprocessing, Learn pandas (Python package) Usage, It's no problem to deal with general data cleaning. The knowledge points to be mastered are as follows:

Choice: data access( Label, Specific value, Boolean index, etc)
Missing value handling: Delete or fill in missing data rows
Duplicate value processing: Judgment and deletion of duplicate value
Exception handling: Clear unnecessary spaces and extremes, Abnormal data
Related operations: descriptive statistics ,Apply, Histogram, etc.
merge: Merge operations in line with various logical relationships
Grouping: Data partitioning, Execute functions separately, Data reorganization
Reshaping: Generate PivotTable report quickly

There are many on the Internet pandas Tutorial, Mainly the application of some functions, It's very simple, having evidence or referent sources pandas Official documents.

3.4, Probability theory and statistical knowledge


What is the overall distribution of data? What is a population and a sample? Median, Mode number, mean value, How to apply basic statistics such as variance? How to do hypothesis testing in different scenarios? Data analysis methods mostly come from the concept of Statistics, So the knowledge of statistics is essential. The knowledge points to be mastered are as follows:

Basic Statistics : mean value, Median, Mode number, Percentile, Extreme value, etc.
Other descriptive statistics: skewness, variance, standard deviation, Saliency, etc.
Other statistical knowledge: Population and samples, parameter and statistic ,ErrorBar
Probability distribution and hypothesis test: Various distributions, Hypothesis testing process
Other probability theory knowledge: conditional probability, Bias et al.


With basic knowledge of Statistics, You can use these statistics for basic analysis. Describe the indicators of data in a visual way, In fact, many conclusions can be drawn: Like in the top100 What are they? What is the average, What is the trend of change in recent years……

You can use Seaborn,matplotlib etc.(python package) Do some visual analysis, Through various visual statistical charts, And get the instructive results.

3.5,Python Data analysis

If you know something, We know that there are a lot of them on the market Python
Books on data analysis, But each one is thick, Learning resistance is very high. But actually the most useful part of the information, Just a few of these books.

For example, mastering the method of regression analysis, Through linear regression and logical regression, In fact, you can do regression analysis on most of the data, And draw a relatively accurate conclusion. The knowledge points to be mastered in this part are as follows:

regression analysis: linear regression, logistic regression
Basic classification algorithm: Decision tree, Random forest……
Basic clustering algorithm:k-means……
Characteristic Engineering Foundation: How to use feature selection to optimize model
Parameter adjustment method: How to adjust the parameter optimization model
Python Data analysis package:scipy,numpy,scikit-learn etc.

At this stage of data analysis, Focus on the methods of regression analysis, Most problems can be solved, Using descriptive statistical analysis and regression analysis, You can get a good analysis.


Then you will know which algorithm model is more suitable for different types of problems, For model optimization, You need to learn how to extract features, Parameter adjustment to improve the accuracy of prediction. It's a bit of data mining and machine learning, Actually a good data analyst, I should be a junior data mining engineer.

You can go through Python Medium scikit-learn Library for data analysis, The whole process of data mining modeling and analysis.

Four, Join us

To facilitate communication, We set up a data analysis group, Welcome to join the data analysis exchange group.