Pandas详解二之DataFrame对象 - 好文

约定 import pandas as pd from pandas import DataFrame import numpy as np
DataFrame

DataFrame是一个表格型的数据结构，既有行索引（保存在index）又有列索引（保存在columns）。

一、DataFrame对象常用属性：

* 创建DateFrame方法有很多（后面再介绍），最常用的是直接传入一个由等长列表或Numpy组成的字典： dict1={"Province":[
"Guangdong","Beijing","Qinghai","Fujiang"], "year":[2018]*4, "pop":[1.3,2.5,1.1,
0.7]} df1=DataFrame(dict1) df1 代码结果：
Province pop year
0 Guangdong 1.3 2018
1 Beijing 2.5 2018
2 Qinghai 1.1 2018
3 Fujiang 0.7 2018
* 同Series一样，也可在创建时指定序列（对于字典中缺失的用NaN）： df2=DataFrame(dict1,columns=['year',
'Province','pop','debt'],index=['one','two','three','four']) df2 代码结果：
year Province pop debt
one 2018 Guangdong 1.3 NaN
two 2018 Beijing 2.5 NaN
three 2018 Qinghai 1.1 NaN
four 2018 Fujiang 0.7 NaN
* 同Series一样，DataFrame的index和columns有name属性： df2 代码结果：
year Province pop debt
one 2018 Guangdong 1.3 NaN
two 2018 Beijing 2.5 NaN
three 2018 Qinghai 1.1 NaN
four 2018 Fujiang 0.7 NaN df2.index.name='English' df2.columns.name='Province'
df2 代码结果：
Province year Province pop debt
English
one 2018 Guangdong 1.3 NaN
two 2018 Beijing 2.5 NaN
three 2018 Qinghai 1.1 NaN
four 2018 Fujiang 0.7 NaN
* 通过shape属性获取DataFrame的行数和列数： df2.shape
代码结果：
(4, 4)
* values属性也会以二维ndarray的形式返回DataFrame的数据： df2.values
代码结果：
array([[2018, 'Guangdong', 1.3, nan], [2018, 'Beijing', 2.5, nan], [2018,
'Qinghai', 1.1, nan], [2018, 'Fujiang', 0.7, nan]], dtype=object)
* 列索引会作为DataFrame对象的属性： df2.Province
代码结果：
English one Guangdong two Beijing three Qinghai four Fujiang Name: Province,
dtype: object
二、DataFrame对象常见存取、赋值和删除方式：

* DataFrame_object[ ] 能通过列索引来存取，当只有一个标签则返回Series，多于一个则返回DataFrame： df2[
'Province'] 代码结果： English one Guangdong two Beijing three Qinghai four Fujiang
Name: Province, dtype: objectdf2[['Province','pop']] 代码结果：
Province Province pop
English
one Guangdong 1.3
two Beijing 2.5
three Qinghai 1.1
four Fujiang 0.7
* DataFrame_object.loc[ ] 能通过行索引来获取指定行： df2.loc['one']
代码结果：
Province year 2018 Province Guangdong pop 1.3 debt NaN Name: one, dtype: object
df2.loc['one':'three'] 代码结果：
Province year Province pop debt
English
one 2018 Guangdong 1.3 NaN
two 2018 Beijing 2.5 NaN
three 2018 Qinghai 1.1 NaN
* 还可以获取单值： df2.loc['one','Province']
代码结果：
'Guangdong'
* DataFrame的列可以通过赋值（一个值或一组值）来修改： df2["debt"]=np.arange(2,3,0.25) df2 代码结果：
Province year Province pop debt
English
one 2018 Guangdong 1.3 2.00
two 2018 Beijing 2.5 2.25
three 2018 Qinghai 1.1 2.50
four 2018 Fujiang 0.7 2.75
* 为不存在的列赋值会创建一个新的列，可通过del来删除： df2['eastern']=df2.Province=='Guangdong' df2
代码结果：
Province year Province pop debt eastern
English
one 2018 Guangdong 1.3 2.00 True
two 2018 Beijing 2.5 2.25 False
three 2018 Qinghai 1.1 2.50 False
four 2018 Fujiang 0.7 2.75 False del df2['eastern'] df2.columns
代码结果：
Index(['year', 'Province', 'pop', 'debt'], dtype='object', name='Province')
* 当然，还可以转置： df2.T
English one two three four
Province
year 2018 2018 2018 2018
Province Guangdong Beijing Qinghai Fujiang
pop 1.3 2.5 1.1 0.7
debt 2 2.25 2.5 2.75
三、多种创建DataFrame方式

*
调用DataFrame()可以将多种格式的数据转换为DataFrame对象，它的的三个参数data、index和columns分别为数据、行索引和列索引。data可以是：
1 二维数组
df3=pd.DataFrame(np.random.randint(0,10,(4,4)),index=[1,2,3,4],columns=['A','B'
,'C','D']) df3 代码结果：
A B C D
1 9 8 4 6
2 5 7 7 4
3 6 3 0 2
4 4 6 9 8
2 字典

行索引由index决定，列索引由字典的键决定
dict1
代码结果：
{'Province': ['Guangdong', 'Beijing', 'Qinghai', 'Fujiang'], 'pop': [1.3, 2.5,
1.1, 0.7], 'year': [2018, 2018, 2018, 2018]} df4=pd.DataFrame(dict1,index=[1,2,3
,4]) df4 代码结果：
Province pop year
1 Guangdong 1.3 2018
2 Beijing 2.5 2018
3 Qinghai 1.1 2018
4 Fujiang 0.7 2018
3 结构数组

其中列索引由结构数组的字段名决定
arr=np.array([('item1',10),('item2',20),('item3',30),('item4',40)],dtype=[(
"name","10S"),("count",int)]) df5=pd.DataFrame(arr) df5 代码结果：
name count
0 b’item1’ 10
1 b’item2’ 20
2 b’item3’ 30
3 b’item4’ 40
* 此外可以调用from_
开头的类方法，将特定的数据转换为DataFrame对象。例如from_dict()，其orient参数指定字典键对应的方向，默认为”columns”:
dict2={"a":[1,2,3],"b":[4,5,6]} df6=pd.DataFrame.from_dict(dict2) df6 代码结果：
a b
0 1 4
1 2 5
2 3 6 df7=pd.DataFrame.from_dict(dict2,orient="index") df7 代码结果：
0 1 2
a 1 2 3
b 4 5 6
四、将DataFrame对象转换为其他格式的数据

* to_dict()方法将DataFrame对象转换为字典，参数orient决定字典元素的类型： df7.to_dict()
代码结果：
{0: {'a': 1, 'b': 4}, 1: {'a': 2, 'b': 5}, 2: {'a': 3, 'b': 6}}
df7.to_dict(orient="records")
代码结果：
[{0: 1, 1: 2, 2: 3}, {0: 4, 1: 5, 2: 6}] df7.to_dict(orient="list")
代码结果：
{0: [1, 4], 1: [2, 5], 2: [3, 6]}
* 类似的还有to_records()、to_csv()等
谢谢大家的浏览，
希望我的努力能帮助到您，
共勉！

热门工具换一换