One , What skills should data analysts have

Data analysis talent enthusiasm is also high , On the one hand, the data volume of enterprises is growing on a large scale , Growing demand for data analysis ; on the other hand , Compared with other technical positions , There are far fewer candidates for data analysts .

Clear learning path , The most effective way is to look at the specific profession , Specific needs of the job for skills .
We found some of the most representative data analyst positions from the tick , Let's look at the highly paid data analysts , What skills are needed .

In fact, the demand for basic skills of data analysts varies little , It can be summarized as follows :

* SQL Basic operation of database , Basic data management
* Can use Excel/SQL Do basic data analysis and display
* Can analyze data with script language ,Python or R
* Ability to access external data , Like a reptile
* Basic data visualization skills , Ability to write data reports
* Familiar with common data mining algorithms : Mainly regression analysis , Decision tree , logistic regression ,SVM, Random forest, etc .
Two , Data analysis process

Next is the data analysis process , Generally, you can press “ Data acquisition - Data storage and extraction - Data preprocessing - Data modeling and analysis - Data visualization ” To implement a data analysis project . Follow this process , The detailed knowledge points that each part needs to master are as follows :

What is an efficient learning path ? This is the process of data analysis . Step by step in this order , You will know what you need to accomplish for each part , What knowledge points need to be learned , What knowledge is temporarily unnecessary .

Three , How to learn data analysis skills ?
Next, we will talk about what we should learn from each part , How to learn .

3.1, Data acquisition : Open data ,Python Reptile

There are two main ways to obtain external data .

The first is to obtain external public data sets , Some scientific research institutions , enterprise , The government will open up some data , You need to go to a specific website to download the data . These datasets are usually relatively complete , Relatively high quality . Recommend some common websites that can get datasets :

UCI: University of California Irvine open classic data set , Adopted by many data mining laboratories .

National data : Data from China National Bureau of Statistics , Including the data of China's economy and people's livelihood . <>

CEIC: exceed 128 Economic data of countries , Can find exactly GDP, Import and export retail , Sales and other in-depth data . <>

China Statistical Information Network : Official website of National Bureau of Statistics , Collected statistics of national economic and social development . <>

Youyi data : Initiated by the National Information Center , Leading data trading platform in China , Lots of free data . <>

Another way to get external data is crawler .

For example, you can get the recruitment information of a certain position on the recruitment website through the crawler , Crawling the rental information of a city on the rental website , Crawling the list of movies with the highest score of Douban , Get Zhihu likes ranking , Netease cloud music review ranking list . Data crawling based on Internet , You can talk about a certain industry , Analysis of a certain population .

You need to know something before you can crawl Python Basic knowledge of : element ( list , Dictionaries , Tuple, etc ), variable , loop , function ………

as well as , How to use Python library (urllib,BeautifulSoup,requests,scrapy) Implement web crawler . If it's a beginner , It is suggested that
urllib+BeautifulSoup start .

Popular e-commerce websites , Q & a website , Second hand trading website , Match-making website , Recruitment website, etc , Can climb to very valuable data .

3.2, data access :SQL language

When dealing with data within ten thousand yuan ,Excel No problem with general analysis , Once the data volume is large , It's not going to work , Database can solve this problem well . And most businesses , Will be SQL To store data , If you're an analyst , At least understand SQL Operation of , Able to query , Extract company data .

SQL As the most classic database tool , Provide possibility for storage and management of massive data , And greatly improve the efficiency of data extraction . You need to master the following skills :

Extract data in specific situations : The data in the enterprise database must be large and complex , You need to extract what you need . For example, you can extract it according to your needs 2017 All sales data of the year , Extract the largest sales volume this year 50 Item data , Extracted from Shanghai , Consumption data of users in Guangdong ……,SQL You can do this with simple commands .

Increase of database , Delete , check , change : These are the most basic operations of the database , But it can be done with simple commands , So you just need to remember the command .

Group aggregation of data , How to establish a relationship between multiple tables : This part is SQL Advanced operation of , Association between multiple tables , When you deal with multi dimensions , Very useful for multiple datasets , It also allows you to deal with more complex data .

SQL This part is relatively simple , It is mainly to master some basic sentences . of course , I suggest you find some datasets to operate , Even the most basic query , Extraction, etc .

3.3, Data preprocessing :Python(pandas)

Most of the time, the data we get is not clean , Duplication of data , defect , Outliers, etc , At this time, data cleaning is needed , Process the data of impact analysis , In order to obtain more accurate analysis results .

For example, sales data , Some channel sales are not entered in time , There are some duplicate records . For example, user behavior data , There are a lot of invalid operations that don't make sense for analysis , It needs to be deleted .

Then we need to deal with it in a corresponding way , For example, incomplete data , We're going to get rid of this data , Or use the adjacent value to complete , These are all questions to consider .

For data preprocessing , learn pandas (Python package ) Usage of , It's no problem to deal with general data cleaning . The knowledge points to be mastered are as follows :

choice : data access ( label , Specific value , Boolean index, etc )
Missing value handling : Delete or fill in missing data rows
Duplicate value processing : Judgment and deletion of duplicate value
Exception handling : Clear unnecessary spaces and extremes , Abnormal data
Related operations : descriptive statistics ,Apply, Histogram, etc
merge : Merge operations in line with various logical relationships
grouping : Data division , Execute functions separately , Data reorganization
Reshaping: Generate PivotTable report quickly

There are many on the Internet pandas Tutorials for , Mainly the application of some functions , It's very simple , having evidence or referent sources pandas Official documents .

3.4, Probability theory and statistical knowledge

What is the overall distribution of data ? What is a population and a sample ? median , Mode , mean value , How to apply basic statistics such as variance ? How to do hypothesis testing in different scenarios ? Data analysis methods mostly come from the concept of Statistics , So the knowledge of statistics is essential . The knowledge points to be mastered are as follows :

Basic Statistics : mean value , median , Mode , Percentile , Extremum, etc
Other descriptive statistics : skewness , variance , standard deviation , Significance, etc
Other statistical knowledge : Population and samples , parameter and statistic ,ErrorBar
Probability distribution and hypothesis test : Various distributions , Hypothesis testing process
Other probability theory knowledge : conditional probability , Bayes et al

With basic knowledge of Statistics , You can use these statistics for basic analysis . Describe the indicators of data in a visual way , In fact, many conclusions can be drawn : Like in the top 100 What are , What is the average , What is the trend of change in recent years ……

You can use Seaborn,matplotlib etc. (python package ) Do some visual analysis , Through various visual statistical charts , And get the instructive results .

3.5,Python Data analysis

If you know something , We know that there are a lot of them on the market Python
Books on data analysis , But each one is thick , Learning resistance is very high . But actually the most useful part of the information , Just a few of these books .

For example, mastering the method of regression analysis , Through linear regression and logical regression , In fact, you can do regression analysis on most of the data , And draw a relatively accurate conclusion . The knowledge points to be mastered in this part are as follows :

regression analysis : linear regression , logistic regression
Basic classification algorithm : Decision tree , Random forest ……
Basic clustering algorithm :k-means……
Characteristic Engineering Foundation : How to use feature selection to optimize model
Parameter adjustment method : How to adjust the parameter optimization model
Python Data analysis package :scipy,numpy,scikit-learn etc.

At this stage of data analysis , Focus on the methods of regression analysis , Most problems can be solved , Using descriptive statistical analysis and regression analysis , You can get a good analysis .

Then you will know which algorithm model is more suitable for different types of problems , For model optimization , You need to learn how to extract features , Parameter adjustment to improve the accuracy of prediction . It's a bit of data mining and machine learning , Actually a good data analyst , I should be a junior data mining engineer .

You can go through Python In scikit-learn Library for data analysis , The whole process of data mining modeling and analysis .

Four , Join us

To facilitate communication , We set up a data analysis group , Welcome to join the data analysis exchange group .