Udacity数据分析（入门）-探索美国共享单车数据 - 好文

Udacity数据分析（入门）-探索美国共享单车数据

* 概述 <https://blog.csdn.net/u010606346/article/details/84196562#_2>
* 自行车共享数据 <https://blog.csdn.net/u010606346/article/details/84196562#_5>
* 数据集 <https://blog.csdn.net/u010606346/article/details/84196562#_16>
* 问题 <https://blog.csdn.net/u010606346/article/details/84196562#_31>
* 项目代码 <https://blog.csdn.net/u010606346/article/details/84196562#_44>
* 导入库及数据集 <https://blog.csdn.net/u010606346/article/details/84196562#_46>
* 输入函数 <https://blog.csdn.net/u010606346/article/details/84196562#_56>
* 选取数据集 <https://blog.csdn.net/u010606346/article/details/84196562#_74>
* 通过用户的输入来得到要分析的 “城市，月，日”
<https://blog.csdn.net/u010606346/article/details/84196562#__104>
* 加载相应的 “城市，月，日” 的数据
<https://blog.csdn.net/u010606346/article/details/84196562#___146>
* 计算并显示共享单车出行最热门起始站及行程
<https://blog.csdn.net/u010606346/article/details/84196562#_185>
* 计算并显示共享单车出行的总/平均时间
<https://blog.csdn.net/u010606346/article/details/84196562#_209>
* 计算并显示共享单车用户的统计信息
<https://blog.csdn.net/u010606346/article/details/84196562#_229>
* 主函数 <https://blog.csdn.net/u010606346/article/details/84196562#_263>

<>概述

利用 Python
探索与以下三大美国城市的自行车共享系统相关的数据：芝加哥、纽约和华盛顿特区。编写代码导入数据，并通过计算描述性统计数据回答有趣的问题。写一个脚本，该脚本会接受原始输入并在终端中创建交互式体验，以展现这些统计信息。

<>自行车共享数据

在过去十年内，自行车共享系统的数量不断增多，并且在全球多个城市内越来越受欢迎。自行车共享系统使用户能够按照一定的金额在短时间内租赁自行车。用户可以在 A
处借自行车，并在 B 处还车，或者他们只是想骑一下，也可以在同一地点还车。每辆自行车每天可以供多位用户使用。

由于信息技术的迅猛发展，共享系统的用户可以轻松地访问系统中的基座并解锁或还回自行车。这些技术还提供了大量数据，使我们能够探索这些自行车共享系统的使用情况。

在此项目中，你将使用 Motivate 提供的数据探索自行车共享使用模式，Motivate
是一家入驻美国很多大型城市的自行车共享系统。你将比较以下三座城市的系统使用情况：芝加哥、纽约市和华盛顿特区。

<>数据集

提供了三座城市 2017 年上半年的数据。三个数据文件都包含相同的核心六列：

起始时间 Start Time（例如 2017-01-01 00:07:57）
结束时间 End Time（例如 2017-01-01 00:20:53）
骑行时长 Trip Duration（例如 776 秒）
起始车站 Start Station（例如百老汇街和巴里大道）
结束车站 End Station（例如塞奇威克街和北大道）
用户类型 User Type（订阅者 Subscriber/Registered 或客户Customer/Casual）
芝加哥和纽约市文件还包含以下两列（数据格式可以查看下面的图片）：

性别 Gender
出生年份 Birth Year

<>问题

1.起始时间（Start Time 列）中哪个月份最常见？
2.起始时间中，一周的哪一天（比如 Monday, Tuesday）最常见？
3.起始时间中，一天当中哪个小时最常见？
4.总骑行时长（Trip Duration）是多久，平均骑行时长是多久？
5.哪个起始车站（Start Station）最热门，哪个结束车站（End Station）最热门？
6.哪一趟行程最热门（即，哪一个起始站点与结束站点的组合最热门）？
7.每种用户类型有多少人？
8.每种性别有多少人？
9.出生年份最早的是哪一年、最晚的是哪一年，最常见的是哪一年？

<>项目代码

<>导入库及数据集
import time import pandas as pd import numpy as np CITY_DATA = { 'chicago':
'chicago.csv', 'new york city': 'new_york_city.csv', 'washington':
'washington.csv' }
<>输入函数
def input_mod(input_print,enterable_list): """ Simplify code when user choose
cities or months data Arg: (str) input_print - asking questions (str)
enterable_list - find list(cities or months) Return: (str) ret- return user's
choice about city, month or day """ while True: ret = input(input_print).title()
if ret in enterable_list: return ret.lower() break print('Sorry, please enter
{}.'.format(enterable_list))
<>选取数据集
def see_datas(data): """ User choose a data to input. Arg: (str) data - choose
a data to input(cities,months,days) Return: (str) city, month or day - return
user's choice about city, month or day """ #bulid lists and dictionary( cities,
months and days) for user to search data cities=['Chicago','New York City',
'Washington'] months =['January', 'February', 'March', 'April', 'May', 'June']
days={'1':'Sunday', '2':'Monday', '3':'Tuesday', '4':'Wednesday', '5':'Thursday'
, '6':'Friday', '7':'Saturday'} while True: #get user input about cities if data
=='cities': return input_mod('Would you like to see data for Chicago, New York
City or Washington: \n',cities) #get user input about months elif data=='months'
: return input_mod('Which month? January, February, March, April, May or
June?\n',months) #get user input about weekdays elif data=='days': while True:
day= input('Which day? Please type an interger(e.g., 1=Sunday): \n') if day in
days: return days[day] break print('Sorry, please enter a correct
interger(e.g., 1=Sunday)')
<>通过用户的输入来得到要分析的 “城市，月，日”
def get_filters(): """ Asks user to specify a city, month, and day to analyze.
Returns: (str) city - name of the city to analyze (str) month - name of the
month to filter by, or "all" to apply no month filter (str) day - name of the
day of week to filter by, or "all" to apply no day filter """ print('Hello!
Let\'s explore some US bikeshare data!') # TO DO: get user input for city
(chicago, new york city, washington). HINT: Use a while loop to handle invalid
inputs city=see_datas('cities') # TO DO: get user input for month (all,
january, february, ... , june) while True: enter=input('Would you like to
filter the data by month, day, both, or not at all? Type "none" for no time
filter.\n').lower() if enter == 'none': month='all' day='all' break elif enter
== 'both': month=see_datas('months') day=see_datas('days') break elif enter ==
'month': month=see_datas('months') day='all' break elif enter == 'day': month=
'all' day=see_datas('days') break else: print ('Sorry, please input a correct
content') # TO DO: get user input for day of week (all, monday, tuesday, ...
sunday) print('-'*40) return city,month,day
<>加载相应的 “城市，月，日” 的数据
def load_data(city, month, day): """ Loads data for the specified city and
filters by month and day if applicable. Args: (str) city - name of the city to
analyze (str) month - name of the month to filter by, or "all" to apply no
month filter (str) day - name of the day of week to filter by, or "all" to
apply no day filter Returns: df - Pandas DataFrame containing city data
filtered by month and day """ # load data file into a dataframe df = pd.read_csv
(CITY_DATA[city]) # convert the Start Time column to datetime df['Start Time'] =
pd.to_datetime(df['Start Time']) # extract month and day of week from Start
Time to create new columns df['month'] = df['Start Time'].dt.month df[
'day_of_week'] = df['Start Time'].dt.weekday_name # filter by month if
applicable if month != 'all': # use the index of the months list to get the
corresponding int months = ['january', 'february', 'march', 'april', 'may',
'june'] month = months.index(month) + 1 # filter by month to create the new
dataframe df = df[df['month'] == month] # filter by day of week if applicable if
day!= 'all': # filter by day of week to create the new dataframe df = df[df[
'day_of_week'] == day.title()] return df
<>计算并显示共享单车出行最热门起始站及行程
def station_stats(df): """Displays statistics on the most popular stations and
trip.""" print('\nCalculating The Most Popular Stations and Trip...\n')
start_time= time.time() # TO DO: display most commonly used start station
common_start=df['Start Station'].value_counts().index[0] print('Most commonly
used start station: {}.'.format(common_start)) # TO DO: display most commonly
used end station common_end=df['End Station'].value_counts().index[0] print(
'Most commonly used end station: {}.'.format(common_end)) # TO DO: display most
frequent combination of start station and end station trip df['combination']=df[
'Start Station']+'/ '+df['End Station'] common_combine=df['combination'].
value_counts().index[0] print('Most frequent combination of start and end
station trip: {}.'.format(common_combine)) print("\nThis took %s seconds." % (
time.time() - start_time)) print('-'*40)
<>计算并显示共享单车出行的总/平均时间
def trip_duration_stats(df): """Displays statistics on the total and average
trip duration.""" print('\nCalculating Trip Duration...\n') start_time = time.
time() # TO DO: display total travel time total_time=df['Trip Duration'].sum()
print('Total travel time: {} seconds.'.format(total_time)) # TO DO: display
mean travel time mean_time=df['Trip Duration'].mean() print('Mean travel time:
{} seconds.'.format(mean_time)) print("\nThis took %s seconds." % (time.time() -
start_time)) print('-'*40)
<>计算并显示共享单车用户的统计信息
def user_stats(df): """Displays statistics on bikeshare users.""" print(
'\nCalculating User Stats...\n') start_time = time.time() # TO DO: Display
counts of user types user_type=df['User Type'].value_counts() print('User
type\n{0}: {1}\n{2}: {3}'.format(user_type.index[0],user_type.iloc[0],user_type.
index[1],user_type.iloc[1])) # TO DO: Display counts of gender cities_columns=df
.columns if 'Gender' in cities_columns: user_gender=df['Gender'].value_counts()
print('Male:{0}\nFemale:{1}. '.format(user_gender.loc['Male'],user_gender.loc[
'Female'])) else: print("Sorry, this city don't have gender data" ) # TO DO:
Display earliest, most recent, and most common year of birth if 'Birth Year' in
cities_columns: earliest_birth=df['Birth Year'].min() recent_birth=df['Birth
Year'].max() common_birth=df['Birth Year'].value_counts().index[0] print(
'Earliest user year of birth: %i.'%(earliest_birth)) print('Most recent user
year of birth: %i.'%(recent_birth)) print('Most common user year of birth: %i.'%
(common_birth)) else: print("Sorry, this city don't have birth year data" )
print("\nThis took %s seconds." % (time.time() - start_time)) print('-'*40)
<>主函数
def main(): while True: city, month, day = get_filters() df = load_data(city,
month, day) time_stats(df) station_stats(df) trip_duration_stats(df) user_stats(
df) restart = input('\nWould you like to restart? Enter yes or no.\n') if
restart.lower() != 'yes': break if __name__ == "__main__": main()
链接：https://pan.baidu.com/s/1sSgbXBaSy1IxIfJqoMil2w
<https://pan.baidu.com/s/1sSgbXBaSy1IxIfJqoMil2w> 密码：m55o

热门工具换一换