python Data analysis Efficient learning path
One , What skills should data analysts have
Data analysis talent enthusiasm is also high , On the one hand, the data volume of enterprises is growing on a large scale , Growing demand for data analysis ; on the other hand , Compared with other technical positions , There are far fewer candidates for data analysts .
Clear learning path , The most effective way is to look at the specific profession , Specific needs of the job for skills .
We found some of the most representative data analyst positions from the tick , Let's look at the highly paid data analysts , What skills are needed .
In fact, the demand for basic skills of data analysts varies little , It can be summarized as follows ：
* SQL Basic operation of database , Basic data management
* Can use Excel/SQL Do basic data analysis and display
* Can analyze data with script language ,Python or R
* Ability to access external data , Like a reptile
* Basic data visualization skills , Ability to write data reports
* Familiar with common data mining algorithms ： Mainly regression analysis , Decision tree , logistic regression ,SVM, Random forest, etc .
Two , Data analysis process
Next is the data analysis process , Generally, you can press “ Data acquisition - Data storage and extraction - Data preprocessing - Data modeling and analysis - Data visualization ” To implement a data analysis project . Follow this process , The detailed knowledge points that each part needs to master are as follows ：
What is an efficient learning path ? This is the process of data analysis . Step by step in this order , You will know what you need to accomplish for each part , What knowledge points need to be learned , What knowledge is temporarily unnecessary .
Three , How to learn data analysis skills ?
Next, we will talk about what we should learn from each part , How to learn .
3.1, Data acquisition ： Open data ,Python Reptile
There are two main ways to obtain external data .
The first is to obtain external public data sets , Some scientific research institutions , enterprise , The government will open up some data , You need to go to a specific website to download the data . These datasets are usually relatively complete , Relatively high quality . Recommend some common websites that can get datasets ：
UCI： University of California Irvine open classic data set , Adopted by many data mining laboratories .
National data ： Data from China National Bureau of Statistics , Including the data of China's economy and people's livelihood .
CEIC： exceed 128 Economic data of countries , Can find exactly GDP, Import and export retail , Sales and other in-depth data .
China Statistical Information Network ： Official website of National Bureau of Statistics , Collected statistics of national economic and social development .
Youyi data ： Initiated by the National Information Center , Leading data trading platform in China , Lots of free data .
Another way to get external data is crawler .
For example, you can get the recruitment information of a certain position on the recruitment website through the crawler , Crawling the rental information of a city on the rental website , Crawling the list of movies with the highest score of Douban , Get Zhihu likes ranking , Netease cloud music review ranking list . Data crawling based on Internet , You can talk about a certain industry , Analysis of a certain population .
You need to know something before you can crawl Python Basic knowledge of ： element （ list , Dictionaries , Tuple, etc ）, variable , loop , function ………
as well as , How to use Python library （urllib,BeautifulSoup,requests,scrapy） Implement web crawler . If it's a beginner , It is suggested that
urllib+BeautifulSoup start .
Popular e-commerce websites , Q & a website , Second hand trading website , Match-making website , Recruitment website, etc , Can climb to very valuable data .
3.2, data access ：SQL language
When dealing with data within ten thousand yuan ,Excel No problem with general analysis , Once the data volume is large , It's not going to work , Database can solve this problem well . And most businesses , Will be SQL To store data , If you're an analyst , At least understand SQL Operation of , Able to query , Extract company data .
SQL As the most classic database tool , Provide possibility for storage and management of massive data , And greatly improve the efficiency of data extraction . You need to master the following skills ：
Extract data in specific situations ： The data in the enterprise database must be large and complex , You need to extract what you need . For example, you can extract it according to your needs 2017 All sales data of the year , Extract the largest sales volume this year 50 Item data , Extracted from Shanghai , Consumption data of users in Guangdong ……,SQL You can do this with simple commands .
Increase of database , Delete , check , change ： These are the most basic operations of the database , But it can be done with simple commands , So you just need to remember the command .
Group aggregation of data , How to establish a relationship between multiple tables ： This part is SQL Advanced operation of , Association between multiple tables , When you deal with multi dimensions , Very useful for multiple datasets , It also allows you to deal with more complex data .
SQL This part is relatively simple , It is mainly to master some basic sentences . of course , I suggest you find some datasets to operate , Even the most basic query , Extraction, etc .
3.3, Data preprocessing ：Python（pandas）
Most of the time, the data we get is not clean , Duplication of data , defect , Outliers, etc , At this time, data cleaning is needed , Process the data of impact analysis , In order to obtain more accurate analysis results .
For example, sales data , Some channel sales are not entered in time , There are some duplicate records . For example, user behavior data , There are a lot of invalid operations that don't make sense for analysis , It needs to be deleted .
Then we need to deal with it in a corresponding way , For example, incomplete data , We're going to get rid of this data , Or use the adjacent value to complete , These are all questions to consider .
For data preprocessing , learn pandas （Python package ） Usage of , It's no problem to deal with general data cleaning . The knowledge points to be mastered are as follows ：
choice ： data access （ label , Specific value , Boolean index, etc ）
Missing value handling ： Delete or fill in missing data rows
Duplicate value processing ： Judgment and deletion of duplicate value
Exception handling ： Clear unnecessary spaces and extremes , Abnormal data
Related operations ： descriptive statistics ,Apply, Histogram, etc
merge ： Merge operations in line with various logical relationships
grouping ： Data division , Execute functions separately , Data reorganization
Reshaping： Generate PivotTable report quickly
There are many on the Internet pandas Tutorials for , Mainly the application of some functions , It's very simple , having evidence or referent sources pandas Official documents .
3.4, Probability theory and statistical knowledge
What is the overall distribution of data ? What is a population and a sample ? median , Mode , mean value , How to apply basic statistics such as variance ? How to do hypothesis testing in different scenarios ? Data analysis methods mostly come from the concept of Statistics , So the knowledge of statistics is essential . The knowledge points to be mastered are as follows ：
Basic Statistics ： mean value , median , Mode , Percentile , Extremum, etc
Other descriptive statistics ： skewness , variance , standard deviation , Significance, etc
Other statistical knowledge ： Population and samples , parameter and statistic ,ErrorBar
Probability distribution and hypothesis test ： Various distributions , Hypothesis testing process
Other probability theory knowledge ： conditional probability , Bayes et al
With basic knowledge of Statistics , You can use these statistics for basic analysis . Describe the indicators of data in a visual way , In fact, many conclusions can be drawn ： Like in the top 100 What are , What is the average , What is the trend of change in recent years ……
You can use Seaborn,matplotlib etc. （python package ） Do some visual analysis , Through various visual statistical charts , And get the instructive results .
3.5,Python Data analysis
If you know something , We know that there are a lot of them on the market Python
Books on data analysis , But each one is thick , Learning resistance is very high . But actually the most useful part of the information , Just a few of these books .
For example, mastering the method of regression analysis , Through linear regression and logical regression , In fact, you can do regression analysis on most of the data , And draw a relatively accurate conclusion . The knowledge points to be mastered in this part are as follows ：
regression analysis ： linear regression , logistic regression
Basic classification algorithm ： Decision tree , Random forest ……
Basic clustering algorithm ：k-means……
Characteristic Engineering Foundation ： How to use feature selection to optimize model
Parameter adjustment method ： How to adjust the parameter optimization model
Python Data analysis package ：scipy,numpy,scikit-learn etc.
At this stage of data analysis , Focus on the methods of regression analysis , Most problems can be solved , Using descriptive statistical analysis and regression analysis , You can get a good analysis .
Then you will know which algorithm model is more suitable for different types of problems , For model optimization , You need to learn how to extract features , Parameter adjustment to improve the accuracy of prediction . It's a bit of data mining and machine learning , Actually a good data analyst , I should be a junior data mining engineer .
You can go through Python In scikit-learn Library for data analysis , The whole process of data mining modeling and analysis .
Four , Join us
To facilitate communication , We set up a data analysis group , Welcome to join the data analysis exchange group .