# python Data analysis Efficient learning path

One , What skills should data analysts have

Data analysis talent enthusiasm is also high , On the one hand, the data volume of enterprises is growing on a large scale , Growing demand for data analysis ; on the other hand , Compared with other technical positions , There are far fewer candidates for data analysts .

Clear learning path , The most effective way is to look at the specific profession , Specific needs of the job for skills .

We found some of the most representative data analyst positions from the tick , Let's look at the highly paid data analysts , What skills are needed .

In fact, the demand for basic skills of data analysts varies little , It can be summarized as follows ：

* SQL Basic operation of database , Basic data management

* Can use Excel/SQL Do basic data analysis and display

* Can analyze data with script language ,Python or R

* Ability to access external data , Like a reptile

* Basic data visualization skills , Ability to write data reports

* Familiar with common data mining algorithms ： Mainly regression analysis , Decision tree , logistic regression ,SVM, Random forest, etc .

Two , Data analysis process

Next is the data analysis process , Generally, you can press “ Data acquisition - Data storage and extraction - Data preprocessing - Data modeling and analysis - Data visualization ” To implement a data analysis project . Follow this process , The detailed knowledge points that each part needs to master are as follows ：

What is an efficient learning path ? This is the process of data analysis . Step by step in this order , You will know what you need to accomplish for each part , What knowledge points need to be learned , What knowledge is temporarily unnecessary .

Three , How to learn data analysis skills ?

Next, we will talk about what we should learn from each part , How to learn .

3.1, Data acquisition ： Open data ,Python Reptile

There are two main ways to obtain external data .

The first is to obtain external public data sets , Some scientific research institutions , enterprise , The government will open up some data , You need to go to a specific website to download the data . These datasets are usually relatively complete , Relatively high quality . Recommend some common websites that can get datasets ：

UCI： University of California Irvine open classic data set , Adopted by many data mining laboratories .

http://archive.ics.uci.edu/ml/datasets.html

<http://archive.ics.uci.edu/ml/datasets.html>

National data ： Data from China National Bureau of Statistics , Including the data of China's economy and people's livelihood .

http://data.stats.gov.cn/ <http://data.stats.gov.cn/>

CEIC： exceed 128 Economic data of countries , Can find exactly GDP, Import and export retail , Sales and other in-depth data .

http://www.ceicdata.com/zh-hans <http://www.ceicdata.com/zh-hans>

China Statistical Information Network ： Official website of National Bureau of Statistics , Collected statistics of national economic and social development .

http://www.tjcn.org/ <http://www.tjcn.org/>

Youyi data ： Initiated by the National Information Center , Leading data trading platform in China , Lots of free data .

http://www.youedata.com/ <http://www.youedata.com/>

Another way to get external data is crawler .

For example, you can get the recruitment information of a certain position on the recruitment website through the crawler , Crawling the rental information of a city on the rental website , Crawling the list of movies with the highest score of Douban , Get Zhihu likes ranking , Netease cloud music review ranking list . Data crawling based on Internet , You can talk about a certain industry , Analysis of a certain population .

You need to know something before you can crawl Python Basic knowledge of ： element （ list , Dictionaries , Tuple, etc ）, variable , loop , function ………

as well as , How to use Python library （urllib,BeautifulSoup,requests,scrapy） Implement web crawler . If it's a beginner , It is suggested that

urllib+BeautifulSoup start .

Popular e-commerce websites , Q & a website , Second hand trading website , Match-making website , Recruitment website, etc , Can climb to very valuable data .

3.2, data access ：SQL language

When dealing with data within ten thousand yuan ,Excel No problem with general analysis , Once the data volume is large , It's not going to work , Database can solve this problem well . And most businesses , Will be SQL To store data , If you're an analyst , At least understand SQL Operation of , Able to query , Extract company data .

SQL As the most classic database tool , Provide possibility for storage and management of massive data , And greatly improve the efficiency of data extraction . You need to master the following skills ：

Extract data in specific situations ： The data in the enterprise database must be large and complex , You need to extract what you need . For example, you can extract it according to your needs 2017 All sales data of the year , Extract the largest sales volume this year 50 Item data , Extracted from Shanghai , Consumption data of users in Guangdong ……,SQL You can do this with simple commands .

Increase of database , Delete , check , change ： These are the most basic operations of the database , But it can be done with simple commands , So you just need to remember the command .

Group aggregation of data , How to establish a relationship between multiple tables ： This part is SQL Advanced operation of , Association between multiple tables , When you deal with multi dimensions , Very useful for multiple datasets , It also allows you to deal with more complex data .

SQL This part is relatively simple , It is mainly to master some basic sentences . of course , I suggest you find some datasets to operate , Even the most basic query , Extraction, etc .

3.3, Data preprocessing ：Python（pandas）

Most of the time, the data we get is not clean , Duplication of data , defect , Outliers, etc , At this time, data cleaning is needed , Process the data of impact analysis , In order to obtain more accurate analysis results .

For example, sales data , Some channel sales are not entered in time , There are some duplicate records . For example, user behavior data , There are a lot of invalid operations that don't make sense for analysis , It needs to be deleted .

Then we need to deal with it in a corresponding way , For example, incomplete data , We're going to get rid of this data , Or use the adjacent value to complete , These are all questions to consider .

For data preprocessing , learn pandas （Python package ） Usage of , It's no problem to deal with general data cleaning . The knowledge points to be mastered are as follows ：

choice ： data access （ label , Specific value , Boolean index, etc ）

Missing value handling ： Delete or fill in missing data rows

Duplicate value processing ： Judgment and deletion of duplicate value

Exception handling ： Clear unnecessary spaces and extremes , Abnormal data

Related operations ： descriptive statistics ,Apply, Histogram, etc

merge ： Merge operations in line with various logical relationships

grouping ： Data division , Execute functions separately , Data reorganization

Reshaping： Generate PivotTable report quickly

There are many on the Internet pandas Tutorials for , Mainly the application of some functions , It's very simple , having evidence or referent sources pandas Official documents .

3.4, Probability theory and statistical knowledge

What is the overall distribution of data ? What is a population and a sample ? median , Mode , mean value , How to apply basic statistics such as variance ? How to do hypothesis testing in different scenarios ? Data analysis methods mostly come from the concept of Statistics , So the knowledge of statistics is essential . The knowledge points to be mastered are as follows ：

Basic Statistics ： mean value , median , Mode , Percentile , Extremum, etc

Other descriptive statistics ： skewness , variance , standard deviation , Significance, etc

Other statistical knowledge ： Population and samples , parameter and statistic ,ErrorBar

Probability distribution and hypothesis test ： Various distributions , Hypothesis testing process

Other probability theory knowledge ： conditional probability , Bayes et al

With basic knowledge of Statistics , You can use these statistics for basic analysis . Describe the indicators of data in a visual way , In fact, many conclusions can be drawn ： Like in the top 100 What are , What is the average , What is the trend of change in recent years ……

You can use Seaborn,matplotlib etc. （python package ） Do some visual analysis , Through various visual statistical charts , And get the instructive results .

3.5,Python Data analysis

If you know something , We know that there are a lot of them on the market Python

Books on data analysis , But each one is thick , Learning resistance is very high . But actually the most useful part of the information , Just a few of these books .

For example, mastering the method of regression analysis , Through linear regression and logical regression , In fact, you can do regression analysis on most of the data , And draw a relatively accurate conclusion . The knowledge points to be mastered in this part are as follows ：

regression analysis ： linear regression , logistic regression

Basic classification algorithm ： Decision tree , Random forest ……

Basic clustering algorithm ：k-means……

Characteristic Engineering Foundation ： How to use feature selection to optimize model

Parameter adjustment method ： How to adjust the parameter optimization model

Python Data analysis package ：scipy,numpy,scikit-learn etc.

At this stage of data analysis , Focus on the methods of regression analysis , Most problems can be solved , Using descriptive statistical analysis and regression analysis , You can get a good analysis .

Then you will know which algorithm model is more suitable for different types of problems , For model optimization , You need to learn how to extract features , Parameter adjustment to improve the accuracy of prediction . It's a bit of data mining and machine learning , Actually a good data analyst , I should be a junior data mining engineer .

You can go through Python In scikit-learn Library for data analysis , The whole process of data mining modeling and analysis .

Four , Join us

To facilitate communication , We set up a data analysis group , Welcome to join the data analysis exchange group .