pandas数据可视化（四）之pandas.DataFrame 操作(总结) - 好文

https://www.scipy.org/ <https://www.scipy.org/>

pandas.DataFrame 创建，索引，增添，删除 - CSDN博客
https://blog.csdn.net/haruhi330/article/details/60872526
<https://blog.csdn.net/haruhi330/article/details/60872526>

【python DataFrame】Pandas里面的屠龙宝刀DataFrame - CSDN博客
https://blog.csdn.net/u013421629/article/details/72843957
<https://blog.csdn.net/u013421629/article/details/72843957>

Pandas中DataFrame是一张二维的表。

Intro to Data Structures — pandas 0.22.0 documentation
http://pandas.pydata.org/pandas-docs/version/0.22/dsintro.html
<http://pandas.pydata.org/pandas-docs/version/0.22/dsintro.html>

Column selection, addition, deletion
You can treat a DataFrame semantically like a dict of like-indexed Series
objects. Getting, setting, and deleting columns works with the same syntax as
the analogous dict operations:

Indexing / Selection

The basics of indexing are as follows:
Operation Syntax Result
Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame

API Reference — pandas 0.22.0 documentation
http://pandas.pydata.org/pandas-docs/version/0.22/api.html#attributes-and-underlying-data

<http://pandas.pydata.org/pandas-docs/version/0.22/api.html#attributes-and-underlying-data>

datafram形式就象没有行数、列数限制的Excel表单或者Sql表。

Pandas中是专于数据处理的库，使得dataframe比通用数据处理的SQL数据处理能力更强，更加个性化，可以做很多复杂的操作。创建，读取（从
mysql，csv)，导出(到 csv，html），索引，增添，删除等）

dataframe中index用来标识行，column标识列，shape表示维度。

size表示大小。df.size
df.index # 获得行索引信息 df.columns # 获得列索引信息 df.shape # 获得df的size
df.shape[0] # 获得df的行数 df.shape[1] # 获得df的列数 df.values # 获得df中的值
df.describe()方法，df中的数据概况
pandas.DataFrame.shape:Return a tuple representing the dimensionality of the
DataFrame.

详细：表的合并追加
http://pandas.pydata.org/pandas-docs/version/0.22/merging.html#merging-concatenation

<http://pandas.pydata.org/pandas-docs/version/0.22/merging.html#merging-concatenation>

详细好用：10 Minutes to pandas — pandas 0.22.0 documentation
http://pandas.pydata.org/pandas-docs/version/0.22/10min.html
<http://pandas.pydata.org/pandas-docs/version/0.22/10min.html>

python—pandas中DataFrame类型数据操作函数 - CSDN博客
https://blog.csdn.net/ly_ysys629/article/details/54428838
<https://blog.csdn.net/ly_ysys629/article/details/54428838>

2. 数据select, del, update。

按照列名select:
df[0] #按照列索引select df["姓名"]选择列名为姓名的整列 df.loc[0] 按照索引select df[:3]
#按照行数select,选取前3行
按照行数和列数select：

df["姓名"] #按列名选择；df.loc["姓名"] #按行名选择；df.iloc[3] #选取第3行；df.iloc[2:4] #选取第2到第3行；

df.iloc[0,1] #选取第0行1列的元素；dat.iloc[:2, :3]
#选取第0行到第1行，第0列到第2列区域内的元素；df1.iloc[[1,3,5],[1,3]]
#选取第1，3，5行，第1，3列区域内的元素#——————————————————————————————————del df[0]
#删除某列df.drop(0) #删除某行#——————————————————————————————————

3.运算。
#基本运算： df[4]=df[1]+df[2] #————————————————————————————
#map运算，和python中的map有些类似： df[4].map(int) #———————————————————————————— #apply运算：
df.apply(sum)
4. Group by 操作。

pandas中的group by 操作，不用把数据导入excel或者mysql就可以进行灵活的group by 操作。

Group By: split-apply-combine — pandas 0.22.0 documentation
http://pandas.pydata.org/pandas-docs/stable/groupby.html
<http://pandas.pydata.org/pandas-docs/stable/groupby.html>
df[0]=['A','A','B'] df 1234 0 0-0.394792-0.1718660.304012-0.566659A
10.9890460.1603890.482936 1.149435 A 20.401105-0.492714-1.220438-0.091609B
g=df.groupby([0]) g.size() A2 B1 g.sum() 1234 0
A0.594254-0.0114780.7869480.582776 B0.401105-0.492714-1.220438-0.091609
5.行和列求和及添加新行和列

pandas.DataFrame对行和列求和及添加新行和列 - 东围居士 - 博客园
https://www.cnblogs.com/wuzhiblog/p/python_new_row_or_col.html
<https://www.cnblogs.com/wuzhiblog/p/python_new_row_or_col.html>

（一）创建DataFrame方式：

利用函数直接生成：

列表，序列(pandas.Series), numpy.ndarray的字典
二维numpy.ndarray
别的DataFrame

结构化的记录(structured arrays)
import pandas as pd import numpy as np df=pd.DataFrame(np.random.randn(3,4))

读取外部数据生成：

http://pandas.pydata.org/pandas-docs/version/0.22/io.html
<http://pandas.pydata.org/pandas-docs/version/0.22/io.html>

1. Pandas读取Mysql数据

要读取Mysql中的数据，首先要安装Mysqldb包。假设我数据库安装在本地，用户名位myusername,密码为mypassword,要读取mydb数据库中的数据，那么对应的代码如下：
import pandas as pd import MySQLdb mysql_cn= MySQLdb.connect(host='localhost',
port=3306,user='myusername', passwd='mypassword', db='mydb') df =
pd.read_sql('select * from test;', con=mysql_cn) mysql_cn.close()
MySQLdb.connect(host='localhost', port=3306,user='myusername',
passwd='mypassword', db='mydb') df = pd.read_sql('select * from test;',
con=mysql_cn) mysql_cn.close()
上面的代码读取了test表中所有的数据到df中，而df的数据结构为Dataframe。

2. Pandas读取csv文件数据
Pandas读取csv文件中的数据要简单的多，不用额外安装程序包，假设我们要读取test.csv中的数据, 对应的代码如下:
df = pd.read_csv(loggerfile, header=None, sep=',')pd.read_csv(loggerfile,
header=None, sep=',')
header=None表示没有头部，sep=’,’表示字段之间的分隔符为逗号。

pandas读csv文件提示：UnicodeDecodeError: 'utf8' codec can't decode byte - CSDN博客
https://blog.csdn.net/sjpljr/article/details/79865532
<https://blog.csdn.net/sjpljr/article/details/79865532>

3.Pandas读取sqlite数据库数据

import sqlite3
import pandas pd
conn=sqlite3.connection("dbname.db")
mysql="selcet * from tablename"
df=pd.read_sql(mysql,conn,index_col='id') #将表tablename中数据生成DataFrame格式数据df。

4、使用pandas读取excel - https://blog.csdn.net/sjpljr/article/details/80168955
<https://blog.csdn.net/sjpljr/article/details/80168955>

（二）Pandas dataframe数据写入文件和数据库

Pandas dataframe数据写入文件和数据库 - http://www.dcharm.com/?p=584
<http://www.dcharm.com/?p=584>

1. Dataframe写入到csv文件
df.to_csv('D:\\a.csv', sep=',', header=True, index=True) df.to_csv(file_path,
encoding='utf-8', index=False) df.to_csv(file_path, index=False)

第一个参数是说把dataframe写入到D盘下的a.csv文件中，参数sep表示字段之间用’,’分隔，header表示是否需要头部，index表示是否需要行号。

如果数据中含有中文，一般encoding指定为”utf-8″,否则导出时程序会因为不能识别相应的字符串而抛出异常，index指定为False表示不用导出dataframe的index数据。

2. Dataframe写入到json文件
df.to_json('D:\\a.json') #把dataframe写入到D盘下的a.json文件中,文件的内容为
3.Dataframe写入到html文件
df.to_html('D:\\a.html')
4.Dataframe写入到剪贴板中
这个是我认为最为贴心的功能, 一行代码可以将dataframe的内容导入到剪切板中，然后可以复制到任意地方
df.to_clipboard()
5.Dataframe写入到数据库中
1 df.to_sql('tableName', con=dbcon, flavor='mysql')
第一个参数是要写入表的名字，第二参数是sqlarchmy的数据库链接对象，第三个参数表示数据库的类型，“mysql”表示数据库的类型为mysql。

应用示例：将mysql数据库中文件查询生成dataframe格式文件，写入tableWidget文件。
print("主程点E1:个人利润贡献") if inputText!="": #f_name_t2=r'*' f_name_t2=r'姓名,公司名称,
年份, 任职开始期间, 任职结束时间, 任务目标, 审计利润' t_profit="t_profit" sql_profit=r'''select %s
from %s where 证件号码='%s' '''%(f_name_t2, t_profit,str(inputText ))
#db_b=pd.read_sql(sql_a, conn, index_col="id")#dataframe数据 print("主程观点E2：")
#conn=pymysql.connect("localhost", "pd", "pd123", "cems", charset="utf8")
db_b=pd.read_sql(sql_profit, conn)#dataframe数据 print("主程观察点利润区E3：\n", db_b) if
db_b.empty==False:row_num3=db_b.shape[0]#取行数 col_num3=db_b.shape[1]#取列数
self.tableWidget_profit.setRowCount(row_num3)
self.tableWidget_profit.setColumnCount(col_num3) for i in range(row_num3): for
j in range(col_num3): temp_db_a2=db_b.iloc[i, j]
db_a2=QTableWidgetItem(temp_db_a2)
#须用str()转换数据格式，db_a2=QTableWidgetItem(str(temp_db_a2))
self.tableWidget_profit.setItem(i, j, db_a2) col_name3=list(db_b.columns)
#print(col_name3)
self.tableWidget_profit.setHorizontalHeaderLabels(col_name3)#设置列名
row_num3=db_b.shape[0]#取行数 col_num3=db_b.shape[1]#取列数
self.tableWidget_profit.setRowCount(row_num3)
self.tableWidget_profit.setColumnCount(col_num3) for i in range(row_num3): for
j in range(col_num3): temp_db_a2=db_b.iloc[i, j]
db_a2=QTableWidgetItem(temp_db_a2)
#须用str()转换数据格式，db_a2=QTableWidgetItem(str(temp_db_a2))
self.tableWidget_profit.setItem(i, j, db_a2) col_name3=list(db_b.columns)
#print(col_name3)
self.tableWidget_profit.setHorizontalHeaderLabels(col_name3)#设置列名

专项讲解：pandas数据的索引和选择，Indexing and Selecting Data

Indexing and Selecting Data — pandas 0.23.0 documentation
http://pandas.pydata.org/pandas-docs/stable/indexing.html
<http://pandas.pydata.org/pandas-docs/stable/indexing.html>

The axis labeling information in pandas objects serves many purposes:
pandas对象的轴标签信息有许多用途：

Identifies data (i.e. provides metadata) using known indicators, important for
analysis, visualization, and interactive console display.

用已知的指标标识数据（即提供元数据--管理数据的数据），这对于分析、可视化和交互控制显示很重要。

Enables automatic and explicit data alignment.

Allows intuitive getting and setting of subsets of the data set.
允许直接获取、设置数据集的子集。

一、Different Choices for Indexing 不同选择实现的索引

Pandas now supports three types of multi-axis indexing.

1、.loc is primarily label based, but may also be used with a boolean array.
.loc主要是基础标签的。

A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the
index. This use is not an integer position along the index.).

A list or array of labels ['a', 'b', 'c'].
A slice object with labels 'a':'f' (Note that contrary to usual python
slices, both the start and the stop are included, when present in the index!
See Slicing with labels.).
A boolean array
A callable function with one argument (the calling Series, DataFrame or
Panel) and that returns valid output for indexing (one of the above).

2、.iloc is primarily integer position based (from 0 to length-1 of the axis),
but may also be used with a boolean array.

.loc是主要基于整数的位置。

关于dataframe中数据类型转化的问题：

pandas 读取 csv 形成 dataframe数据df，执行：df.dtypes，结果显示列的类型是object。

object -- 代表了字符串类型，int -- 代表了整型，float -- 代表了浮点数类型，datetime -- 代表了时间类型，bool
-- 代表了布尔类型，其它都显示为object。

pandas 的 dtype 对应 NumPy 的 dtype。NumPy 的 dtype 有两大类，NumPy 的 C 扩展库提供的 value
type，还有 Python 自身的对象类型。对于 value type，NumPy 有 C 扩展库保证运算速度。

df_fra2.loc['currentratio':'adratio'].astype(float) 有时不能实现转为数字类型

float('quickratio') #有时不能将object类型转为数字类型

df_fra2['quickratio'] = pd.to_numeric(df_fra2['quickratio'], errors='ignore')
#这个语句可将object转为数字类型。

dtype 还是 object，那就 df['quickratio'].value_counts()一下看看到底是什么值不是数字

在Pandas中更改列的数据类型【方法总结】 - xitingxie - 博客园
https://www.cnblogs.com/xitingxie/p/8426340.html
<https://www.cnblogs.com/xitingxie/p/8426340.html>

热门工具换一换