数据科学 | 十分钟入门pandas/常用方法笔记

DataScience

发布日期: 2020-04-17

文章字数: 11.9k

阅读时长: 67 分

pandas

0x00 前言

最近对数据科学有点感兴趣，众所周知，在 python 中用来做数据处理常用的有 numpy、pandas 等模块，它们功能非常强大。

考虑到有项目正好要用到 pandas，这次就来学一下吧。（挺久之前就想学的，一直咕

看到网上推荐了官方的 10 Minutes to pandas，就去看了一下，翻译了一下官方的教程，自己跑了代码，顺便再记了点笔记吧。

下面的内容主要参考的是10 Minutes to pandas（~~然而看了一晚上~~），运行环境为 Jupyter Notebook。

这篇博客也是在从 Jupyter Notebook 里导出的 markdown 基础上完成的，挺方便的呢。（有些地方可能由于博客样式的原因显示会有点问题，凑合着看一看吧。

~~后面有空再整理一下~~，放个 Jupyter Notebook 的.ipynb版本吧。

—->详见文末的更新！

0x01 导入模块

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

这些都是常用别名了呢（

0x02 创建对象 Object Creation

传递一个list对象来创建一个Series，pandas会默认创建整型索引(index)。

s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

传递一个numpy array，时间索引以及列标签来创建一个DataFrame

dates = pd.date_range('20200416', periods=6)
dates

DatetimeIndex(['2020-04-16', '2020-04-17', '2020-04-18', '2020-04-19',
               '2020-04-20', '2020-04-21'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

	A	B	C	D
2020-04-16	-0.228084	0.712521	0.743378	0.526823
2020-04-17	-0.423923	0.564346	-2.318931	0.080931
2020-04-18	-2.032243	2.046827	1.176614	-0.365556
2020-04-19	-1.200005	1.109991	-1.527804	1.068642
2020-04-20	-0.535031	1.351420	0.907865	-1.074013
2020-04-21	-0.614734	0.027143	-0.420875	-0.333819

传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame

df2 = pd.DataFrame({ 'A' : 1.,
 'B' : pd.Timestamp('20200416183800'),
 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
 'D' : np.array([3] * 4,dtype='int32'),
 'E' : pd.Categorical(["test","train","test","train"]),
 'F' : 'foo' })
df2

	A	B	C	D	E	F
0	1.0	2020-04-16 18:38:00	1.0	3	test	foo
1	1.0	2020-04-16 18:38:00	1.0	3	train	foo
2	1.0	2020-04-16 18:38:00	1.0	3	test	foo
3	1.0	2020-04-16 18:38:00	1.0	3	train	foo

查看他们的数据类型(dtypes).

df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

在 IPython 里输入 df2.，按下 TAB 键，会有常用的代码提示。

这里我是在 Jupyter Notebook 中跑，也有类似效果。为了方便展示这里用dir来列出来吧。

dir(df2)

['A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'T',
 ######################...###########
 'to_feather',
 'to_gbq',
 'to_hdf',
 'to_html',
 'to_json',
 'to_latex',
 'to_msgpack',
 'to_numpy',
 'to_panel',
 'to_parquet',
 'to_period',
 'to_pickle',
 'to_records',
 'to_sparse',
 'to_sql',
 'to_stata',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unstack',
 'update',
 'values',
 'var',
 'where',
 'xs']

0x03 查看数据 Viewing Data

查看 frame 中头部和尾部的行

df.head()  # 查看前五行数据

	A	B	C	D
2020-04-16	-0.228084	0.712521	0.743378	0.526823
2020-04-17	-0.423923	0.564346	-2.318931	0.080931
2020-04-18	-2.032243	2.046827	1.176614	-0.365556
2020-04-19	-1.200005	1.109991	-1.527804	1.068642
2020-04-20	-0.535031	1.351420	0.907865	-1.074013

df.head(2)  # 前两行

	A	B	C	D
2020-04-16	-0.228084	0.712521	0.743378	0.526823
2020-04-17	-0.423923	0.564346	-2.318931	0.080931

df.tail(3)  # 倒数三行

	A	B	C	D
2020-04-19	-1.200005	1.109991	-1.527804	1.068642
2020-04-20	-0.535031	1.351420	0.907865	-1.074013
2020-04-21	-0.614734	0.027143	-0.420875	-0.333819

显示索引、列和底层的(underlying) numpy 数据

df

	A	B	C	D
2020-04-16	-0.228084	0.712521	0.743378	0.526823
2020-04-17	-0.423923	0.564346	-2.318931	0.080931
2020-04-18	-2.032243	2.046827	1.176614	-0.365556
2020-04-19	-1.200005	1.109991	-1.527804	1.068642
2020-04-20	-0.535031	1.351420	0.907865	-1.074013
2020-04-21	-0.614734	0.027143	-0.420875	-0.333819

df.index

DatetimeIndex(['2020-04-16', '2020-04-17', '2020-04-18', '2020-04-19',
               '2020-04-20', '2020-04-21'],
              dtype='datetime64[ns]', freq='D')

df2.index

Int64Index([0, 1, 2, 3], dtype='int64')

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

df.values

array([[-0.22808386,  0.71252123,  0.74337822,  0.52682341],
       [-0.42392275,  0.56434629, -2.31893066,  0.08093058],
       [-2.03224294,  2.04682736,  1.17661398, -0.3655562 ],
       [-1.20000469,  1.10999124, -1.52780407,  1.06864157],
       [-0.53503099,  1.35142044,  0.90786509, -1.07401348],
       [-0.61473412,  0.02714293, -0.42087502, -0.33381946]])

df2.values

array([[1.0, Timestamp('2020-04-16 18:38:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-04-16 18:38:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2020-04-16 18:38:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-04-16 18:38:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

print(type(df))
print(type(df.columns))
print(type(df.values))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>

describe 函数对于数据的快速统计汇总

df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	-0.839003	0.968708	-0.239959	-0.016166
std	0.669679	0.699208	1.435586	0.751411
min	-2.032243	0.027143	-2.318931	-1.074013
25%	-1.053687	0.601390	-1.251072	-0.357622
50%	-0.574883	0.911256	0.161252	-0.126444
75%	-0.451700	1.291063	0.866743	0.415350
max	-0.228084	2.046827	1.176614	1.068642

df2.describe()

	A	C	D
count	4.0	4.0	4.0
mean	1.0	1.0	3.0
std	0.0	0.0	0.0
min	1.0	1.0	3.0
25%	1.0	1.0	3.0
50%	1.0	1.0	3.0
75%	1.0	1.0	3.0
max	1.0	1.0	3.0

对数据进行转置(Transposing)

df.T

	2020-04-16 00:00:00	2020-04-17 00:00:00	2020-04-18 00:00:00	2020-04-19 00:00:00	2020-04-20 00:00:00	2020-04-21 00:00:00
A	-0.228084	-0.423923	-2.032243	-1.200005	-0.535031	-0.614734
B	0.712521	0.564346	2.046827	1.109991	1.351420	0.027143
C	0.743378	-2.318931	1.176614	-1.527804	0.907865	-0.420875
D	0.526823	0.080931	-0.365556	1.068642	-1.074013	-0.333819

df2.T

	0	1	2	3
A	1	1	1	1
B	2020-04-16 18:38:00	2020-04-16 18:38:00	2020-04-16 18:38:00	2020-04-16 18:38:00
C	1	1	1	1
D	3	3	3	3
E	test	train	test	train
F	foo	foo	foo	foo

按照某一坐标轴进行排序

df.sort_index(axis=1, ascending=False)  # 降序

	D	C	B	A
2020-04-16	0.526823	0.743378	0.712521	-0.228084
2020-04-17	0.080931	-2.318931	0.564346	-0.423923
2020-04-18	-0.365556	1.176614	2.046827	-2.032243
2020-04-19	1.068642	-1.527804	1.109991	-1.200005
2020-04-20	-1.074013	0.907865	1.351420	-0.535031
2020-04-21	-0.333819	-0.420875	0.027143	-0.614734

df.sort_index(axis=1, ascending=True)  # 升序（默认）

	A	B	C	D
2020-04-16	-0.228084	0.712521	0.743378	0.526823
2020-04-17	-0.423923	0.564346	-2.318931	0.080931
2020-04-18	-2.032243	2.046827	1.176614	-0.365556
2020-04-19	-1.200005	1.109991	-1.527804	1.068642
2020-04-20	-0.535031	1.351420	0.907865	-1.074013
2020-04-21	-0.614734	0.027143	-0.420875	-0.333819

df.sort_index(axis=0, ascending=False)  # 以第0轴（行）来排序，降序

	A	B	C	D
2020-04-21	-0.614734	0.027143	-0.420875	-0.333819
2020-04-20	-0.535031	1.351420	0.907865	-1.074013
2020-04-19	-1.200005	1.109991	-1.527804	1.068642
2020-04-18	-2.032243	2.046827	1.176614	-0.365556
2020-04-17	-0.423923	0.564346	-2.318931	0.080931
2020-04-16	-0.228084	0.712521	0.743378	0.526823

按值进行排序，默认为升序

df.sort_values(by='B')

	A	B	C	D
2020-04-21	-0.614734	0.027143	-0.420875	-0.333819
2020-04-17	-0.423923	0.564346	-2.318931	0.080931
2020-04-16	-0.228084	0.712521	0.743378	0.526823
2020-04-19	-1.200005	1.109991	-1.527804	1.068642
2020-04-20	-0.535031	1.351420	0.907865	-1.074013
2020-04-18	-2.032243	2.046827	1.176614	-0.365556

0x04 选择数据 Selection

注意： 虽然标准的 Python/Numpy 的选择和设置表达式都非常易懂且方便用于交互使用，但是作为工程使用的代码，推荐使用经过优化的 pandas 数据访问方法： .at, .iat, .loc, .iloc 和 .ix。

0x04-1 获取 Getting

选择一个单独的列，这将会返回一个 Series，等效于 df.A

df['A']

2020-04-16   -0.228084
2020-04-17   -0.423923
2020-04-18   -2.032243
2020-04-19   -1.200005
2020-04-20   -0.535031
2020-04-21   -0.614734
Freq: D, Name: A, dtype: float64

通过[]进行选择，可以用来行进行切片(slice)

df[1:3]

	A	B	C	D
2020-04-17	-0.423923	0.564346	-2.318931	0.080931
2020-04-18	-2.032243	2.046827	1.176614	-0.365556

df['20200418':'20200420']

	A	B	C	D
2020-04-18	-2.032243	2.046827	1.176614	-0.365556
2020-04-19	-1.200005	1.109991	-1.527804	1.068642
2020-04-20	-0.535031	1.351420	0.907865	-1.074013

0x4-2 通过标签选择 Selection by Label

利用 pandas 的 .loc, .at 方法

使用标签来获取一个交叉的区域

dates[0]

Timestamp('2020-04-16 00:00:00', freq='D')

df.loc[dates[0]]

A   -0.228084
B    0.712521
C    0.743378
D    0.526823
Name: 2020-04-16 00:00:00, dtype: float64

用标签来在多个轴上进行选择

df.loc[:,['A','B']]

	A	B
2020-04-16	-0.228084	0.712521
2020-04-17	-0.423923	0.564346
2020-04-18	-2.032243	2.046827
2020-04-19	-1.200005	1.109991
2020-04-20	-0.535031	1.351420
2020-04-21	-0.614734	0.027143

利用标签来切片

注意此处的切片两端都包含（左闭右闭）

df.loc['20200418':'20200420',['A','B']]

	A	B
2020-04-18	-2.032243	2.046827
2020-04-19	-1.200005	1.109991
2020-04-20	-0.535031	1.351420

df.loc[dates[2]:,:'C']

	A	B	C
2020-04-18	-2.032243	2.046827	1.176614
2020-04-19	-1.200005	1.109991	-1.527804
2020-04-20	-0.535031	1.351420	0.907865
2020-04-21	-0.614734	0.027143	-0.420875

对于返回的对象进行维度缩减

df.loc['20200419',['A','B']]

A   -1.200005
B    1.109991
Name: 2020-04-19 00:00:00, dtype: float64

获取一个标量

df.loc[dates[2],'A']

-2.0322429408627056

快速访问一个标量（与上一个方法等价）

df.at[dates[2],'A']

-2.0322429408627056

0x04-3 通过位置进行选择 Selection by Position

利用的是 pandas 里的 .iloc, .iat 方法

通过传递整数值进行位置选择

df.iloc[3]

A   -1.200005
B    1.109991
C   -1.527804
D    1.068642
Name: 2020-04-19 00:00:00, dtype: float64

df.iloc[:,2]  # [行，列] 或 [第0维，第1维]

2020-04-16    0.743378
2020-04-17   -2.318931
2020-04-18    1.176614
2020-04-19   -1.527804
2020-04-20    0.907865
2020-04-21   -0.420875
Freq: D, Name: C, dtype: float64

利用数值进行切片，类似于 numpy/python

注意一下，这里的切片是普通的切片，即左闭右开。

df.iloc[3:5,0:2]

	A	B
2020-04-19	-1.200005	1.109991
2020-04-20	-0.535031	1.351420

可以指定一个位置的列表来选择数据区域

df.iloc[[1,2,4],[0,2]]

	A	C
2020-04-17	-0.423923	-2.318931
2020-04-18	-2.032243	1.176614
2020-04-20	-0.535031	0.907865

对行进行切片

df.iloc[1:3,:]

	A	B	C	D
2020-04-17	-0.423923	0.564346	-2.318931	0.080931
2020-04-18	-2.032243	2.046827	1.176614	-0.365556

对列进行切片

df.iloc[:,1:3]

	B	C
2020-04-16	0.712521	0.743378
2020-04-17	0.564346	-2.318931
2020-04-18	2.046827	1.176614
2020-04-19	1.109991	-1.527804
2020-04-20	1.351420	0.907865
2020-04-21	0.027143	-0.420875

利用位置快速得到一个位置对应的值

df.iloc[1,1]

0.5643462917985026

df.iat[1,1]

0.5643462917985026

.ix 用法（Deprecated）

df.ix[] 既可以通过整数索引进行数据选取，也可以通过标签索引进行数据选取。

也就是说，df.ix[] 是 df.loc[] 和 df.iloc[] 的功能集合，且在同义词选取中，可以同时使用整数索引和标签索引。

这个方法在官方的 10 Minutes to pandas 中没有专门说明，不建议用这一方法，请用.loc和.iloc来索引，那这里就提一下好了。

df.ix[1,'A']

D:\Programs\Anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





-1.363418018143355

df.ix[dates[1:3],[0,2,3]]

D:\Programs\Anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

	A	C	D
2020-04-17	-11.332834	7.659436	-20.925183
2020-04-18	-11.040688	6.867701	-20.157355

从 DeprecationWarning 看来的确已经 deprecated 了，那就不要用了！

0x04-4 布尔索引 Boolean Indexing

使用一个单独列的值来选择数据

df[df.D > 0]

	A	B	C	D
2020-04-16	-0.228084	0.712521	0.743378	0.526823
2020-04-17	-0.423923	0.564346	-2.318931	0.080931
2020-04-19	-1.200005	1.109991	-1.527804	1.068642

利用布尔条件从一个 DataFrame 中选择数据

df>0

	A	B	C	D
2020-04-16	False	True	True	True
2020-04-17	False	True	False	True
2020-04-18	False	True	True	False
2020-04-19	False	True	False	True
2020-04-20	False	True	True	False
2020-04-21	False	True	False	False

df[df > 0]

	A	B	C	D
2020-04-16	NaN	0.712521	0.743378	0.526823
2020-04-17	NaN	0.564346	NaN	0.080931
2020-04-18	NaN	2.046827	1.176614	NaN
2020-04-19	NaN	1.109991	NaN	1.068642
2020-04-20	NaN	1.351420	0.907865	NaN
2020-04-21	NaN	0.027143	NaN	NaN

使用 isin() 方法来过滤数据

(is in)

df3 = df.copy()
df3['E'] = ['one', 'one','two','three','four','three']
df3

	A	B	C	D	E
2020-04-16	-0.228084	0.712521	0.743378	0.526823	one
2020-04-17	-0.423923	0.564346	-2.318931	0.080931	one
2020-04-18	-2.032243	2.046827	1.176614	-0.365556	two
2020-04-19	-1.200005	1.109991	-1.527804	1.068642	three
2020-04-20	-0.535031	1.351420	0.907865	-1.074013	four
2020-04-21	-0.614734	0.027143	-0.420875	-0.333819	three

df3[df3['E'].isin(['two','four'])]

	A	B	C	D	E
2020-04-18	-2.032243	2.046827	1.176614	-0.365556	two
2020-04-20	-0.535031	1.351420	0.907865	-1.074013	four

0x05 设置数据 Setting

设置一个新的列，自动按照索引来对其数据

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20200416', periods=6))
s1

2020-04-16    1
2020-04-17    2
2020-04-18    3
2020-04-19    4
2020-04-20    5
2020-04-21    6
Freq: D, dtype: int64

df['F'] = s1

通过标签设置新的值

df.at[dates[0],'A'] = 0

通过标签设置新的值

df.iat[0,1] = 0

通过一个 numpy 数组设置一组新值

df.loc[:,'D'] = np.array([5] * len(df))

上述操作结果如下

df

	A	B	C	D	F
2020-04-16	0.000000	0.000000	0.743378	5	1
2020-04-17	-0.423923	0.564346	-2.318931	5	2
2020-04-18	-2.032243	2.046827	1.176614	5	3
2020-04-19	-1.200005	1.109991	-1.527804	5	4
2020-04-20	-0.535031	1.351420	0.907865	5	5
2020-04-21	-0.614734	0.027143	-0.420875	5	6

通过 where 操作来设置新的值

df2 = df.copy()
df2[df2 > 0] = -df2
df2

	A	B	C	D	F
2020-04-16	0.000000	0.000000	-0.743378	-5	-1
2020-04-17	-0.423923	-0.564346	-2.318931	-5	-2
2020-04-18	-2.032243	-2.046827	-1.176614	-5	-3
2020-04-19	-1.200005	-1.109991	-1.527804	-5	-4
2020-04-20	-0.535031	-1.351420	-0.907865	-5	-5
2020-04-21	-0.614734	-0.027143	-0.420875	-5	-6

0x06 缺失值处理 Missing Data

在pandas中，使用np.nan来代替缺失值，这些值将默认不会包含在计算中。

reindex()方法可以对指定轴上的索引进行改变/增加/删除操作，返回原始数据的一个拷贝。

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1

	A	B	C	D	F	E
2020-04-16	0.000000	0.000000	0.743378	5	1	NaN
2020-04-17	-0.423923	0.564346	-2.318931	5	2	NaN
2020-04-18	-2.032243	2.046827	1.176614	5	3	NaN
2020-04-19	-1.200005	1.109991	-1.527804	5	4	NaN

df1.loc[dates[0]:dates[1],'E'] = 1
df1

	A	B	C	D	F	E
2020-04-16	0.000000	0.000000	0.743378	5	1	1.0
2020-04-17	-0.423923	0.564346	-2.318931	5	2	1.0
2020-04-18	-2.032243	2.046827	1.176614	5	3	NaN
2020-04-19	-1.200005	1.109991	-1.527804	5	4	NaN

去掉包含缺失值的行

df1.dropna(how='any')

	A	B	C	D	F	E
2020-04-16	0.000000	0.000000	0.743378	5	1	1.0
2020-04-17	-0.423923	0.564346	-2.318931	5	2	1.0

填充缺失的值，利用value来指定

print(df1)
df1.fillna(value=5)

                   A         B         C  D  F    E
2020-04-16  0.000000  0.000000  0.743378  5  1  1.0
2020-04-17 -0.423923  0.564346 -2.318931  5  2  1.0
2020-04-18 -2.032243  2.046827  1.176614  5  3  NaN
2020-04-19 -1.200005  1.109991 -1.527804  5  4  NaN

	A	B	C	D	F	E
2020-04-16	0.000000	0.000000	0.743378	5	1	1.0
2020-04-17	-0.423923	0.564346	-2.318931	5	2	1.0
2020-04-18	-2.032243	2.046827	1.176614	5	3	5.0
2020-04-19	-1.200005	1.109991	-1.527804	5	4	5.0

获取值为nan的布尔值(boolean mask)

pd.isna(df1)

	A	B	C	D	F	E
2020-04-16	False	False	False	False	False	False
2020-04-17	False	False	False	False	False	False
2020-04-18	False	False	False	False	False	True
2020-04-19	False	False	False	False	False	True

0x07 一些常用的操作 Operations

0x07-1 统计 Stats

这些操作不包括缺失数据

执行描述性统计：

df.mean()  # 均值，默认为列的

A   -0.800989
B    0.849955
C   -0.239959
D    5.000000
F    3.500000
dtype: float64

df.mean(1)  # 指定坐标轴（此处为行）

2020-04-16    1.348676
2020-04-17    0.964299
2020-04-18    1.838240
2020-04-19    1.476436
2020-04-20    2.344851
2020-04-21    1.998307
Freq: D, dtype: float64

对于拥有不同维度，需要对齐对象再进行操作。

Pandas 会自动的沿着指定的维度进行广播。

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s

2020-04-16    NaN
2020-04-17    NaN
2020-04-18    1.0
2020-04-19    3.0
2020-04-20    5.0
2020-04-21    NaN
Freq: D, dtype: float64

print(df)
df.sub(s, axis='index')  # df-s

                   A         B         C  D  F
2020-04-16  0.000000  0.000000  0.743378  5  1
2020-04-17 -0.423923  0.564346 -2.318931  5  2
2020-04-18 -2.032243  2.046827  1.176614  5  3
2020-04-19 -1.200005  1.109991 -1.527804  5  4
2020-04-20 -0.535031  1.351420  0.907865  5  5
2020-04-21 -0.614734  0.027143 -0.420875  5  6

	A	B	C	D	F
2020-04-16	NaN	NaN	NaN	NaN	NaN
2020-04-17	NaN	NaN	NaN	NaN	NaN
2020-04-18	-3.032243	1.046827	0.176614	4.0	2.0
2020-04-19	-4.200005	-1.890009	-4.527804	2.0	1.0
2020-04-20	-5.535031	-3.648580	-4.092135	0.0	0.0
2020-04-21	NaN	NaN	NaN	NaN	NaN

0x07-2 应用方法到数据 Apply

Applying functions to the data

df.apply(np.cumsum)  # cumsum 默认按照行累加

	A	B	C	D	F
2020-04-16	0.000000	0.000000	0.743378	5	1
2020-04-17	-0.423923	0.564346	-1.575552	10	3
2020-04-18	-2.456166	2.611174	-0.398938	15	6
2020-04-19	-3.656170	3.721165	-1.926743	20	10
2020-04-20	-4.191201	5.072585	-1.018877	25	15
2020-04-21	-4.805935	5.099728	-1.439752	30	21

df.apply(np.cumsum,axis=1)  # 指定 cumsum 的 axis 参数，按照列累加

	A	B	C	D	F
2020-04-16	0.000000	0.000000	0.743378	5.743378	6.743378
2020-04-17	-0.423923	0.140424	-2.178507	2.821493	4.821493
2020-04-18	-2.032243	0.014584	1.191198	6.191198	9.191198
2020-04-19	-1.200005	-0.090013	-1.617818	3.382182	7.382182
2020-04-20	-0.535031	0.816389	1.724255	6.724255	11.724255
2020-04-21	-0.614734	-0.587591	-1.008466	3.991534	9.991534

df.apply(lambda x: x.max() - x.min())

A    2.032243
B    2.046827
C    3.495545
D    0.000000
F    5.000000
dtype: float64

0x07-3 直方图 Histogramming

s = pd.Series(np.random.randint(0, 7, size=10))
s

0    0
1    4
2    5
3    0
4    3
5    2
6    2
7    0
8    5
9    1
dtype: int32

s.value_counts()

0    3
5    2
2    2
4    1
3    1
1    1
dtype: int64

0x07-4 字符串方法 String Methods

Series 在str属性中准备了一组字符串处理方法，可以方便地对数组的每个元素进行操作，如下面的代码片段所示。

注意str中的模式匹配通常默认使用’正则表达式（在某些情况下总是使用它们）。

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

0x08 合并 Merge

pandas 提供了一系列的方法，在 join/merge 操作中，可以轻松地对Series, DataFrame和Panel对象进行各种集合逻辑以及关系代数的操作。

0x08-1 Concat

df = pd.DataFrame(np.random.randn(10, 4))
df

	0	1	2	3
0	-0.089143	0.087569	0.218078	-0.336815
1	-0.694835	0.108324	-1.882762	-0.220210
2	-2.069784	-0.788424	-1.694595	0.569263
3	2.433036	0.719851	-0.109935	0.177864
4	-0.838252	1.117676	0.880839	-1.118529
5	1.706060	0.290877	-0.907941	-0.532450
6	0.118169	-0.496975	-0.190422	0.783875
7	-1.213645	-0.489622	-0.798443	-0.471611
8	-0.314598	-2.658906	-1.139778	0.374437
9	0.229281	-1.372309	0.996916	-1.013462

# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces

[          0         1         2         3
 0 -0.089143  0.087569  0.218078 -0.336815
 1 -0.694835  0.108324 -1.882762 -0.220210
 2 -2.069784 -0.788424 -1.694595  0.569263,
           0         1         2         3
 3  2.433036  0.719851 -0.109935  0.177864
 4 -0.838252  1.117676  0.880839 -1.118529
 5  1.706060  0.290877 -0.907941 -0.532450
 6  0.118169 -0.496975 -0.190422  0.783875,
           0         1         2         3
 7 -1.213645 -0.489622 -0.798443 -0.471611
 8 -0.314598 -2.658906 -1.139778  0.374437
 9  0.229281 -1.372309  0.996916 -1.013462]

pd.concat(pieces)

	0	1	2	3
0	-0.089143	0.087569	0.218078	-0.336815
1	-0.694835	0.108324	-1.882762	-0.220210
2	-2.069784	-0.788424	-1.694595	0.569263
3	2.433036	0.719851	-0.109935	0.177864
4	-0.838252	1.117676	0.880839	-1.118529
5	1.706060	0.290877	-0.907941	-0.532450
6	0.118169	-0.496975	-0.190422	0.783875
7	-1.213645	-0.489622	-0.798443	-0.471611
8	-0.314598	-2.658906	-1.139778	0.374437
9	0.229281	-1.372309	0.996916	-1.013462

0x08-2 Join

类似于 SQL 类型的合并。

下面是 key 相同的情况。

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left

	key	lval
0	foo	1
1	foo	2

right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right

	key	rval
0	foo	4
1	foo	5

pd.merge(left, right, on='key')

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

另一个例子，key 不同的情况。

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
left

	key	lval
0	foo	1
1	bar	2

right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
right

	key	rval
0	foo	4
1	bar	5

pd.merge(left, right, on='key')

	key	lval	rval
0	foo	1	4
1	bar	2	5

0x08-3 Append

将一行加入到一个 DataFrame 上

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

	A	B	C	D
0	-2.727829	-1.382168	0.961920	0.083472
1	-0.297131	-0.294723	0.303267	-0.539177
2	-0.226735	-2.157170	1.559418	0.102023
3	0.569927	-0.800855	0.918515	1.630873
4	0.402999	-2.238052	2.181635	1.950669
5	1.238971	-0.165159	0.688200	0.793499
6	1.356265	-0.034859	-0.020774	0.485487
7	-1.793398	0.900625	0.365789	-2.089145

s = df.iloc[3]
df.append(s, ignore_index=True)  # 忽略 index

	A	B	C	D
0	-2.727829	-1.382168	0.961920	0.083472
1	-0.297131	-0.294723	0.303267	-0.539177
2	-0.226735	-2.157170	1.559418	0.102023
3	0.569927	-0.800855	0.918515	1.630873
4	0.402999	-2.238052	2.181635	1.950669
5	1.238971	-0.165159	0.688200	0.793499
6	1.356265	-0.034859	-0.020774	0.485487
7	-1.793398	0.900625	0.365789	-2.089145
8	0.569927	-0.800855	0.918515	1.630873

s = df.iloc[4]
df.append(s, ignore_index=False)

	A	B	C	D
0	-2.727829	-1.382168	0.961920	0.083472
1	-0.297131	-0.294723	0.303267	-0.539177
2	-0.226735	-2.157170	1.559418	0.102023
3	0.569927	-0.800855	0.918515	1.630873
4	0.402999	-2.238052	2.181635	1.950669
5	1.238971	-0.165159	0.688200	0.793499
6	1.356265	-0.034859	-0.020774	0.485487
7	-1.793398	0.900625	0.365789	-2.089145
4	0.402999	-2.238052	2.181635	1.950669

0x09 分组 Grouping

对于分组(group by)，通常有下面几个操作步骤：

Splitting 按照一些规则将数据分为不同的组
Applying 对于每组数据分别执行一个函数
Combining 将结果组合到一个数据结构中

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                  'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                  'C' : np.random.randn(8),
                  'D' : np.random.randn(8)})
df

	A	B	C	D
0	foo	one	1.038586	2.040014
1	bar	one	0.454452	-0.553457
2	foo	two	-0.730273	-0.207136
3	bar	three	0.435136	0.443462
4	foo	two	2.557988	-1.831259
5	bar	two	1.131084	0.915332
6	foo	one	1.078171	0.037355
7	foo	three	2.057203	1.209778

分组并对得到的分组执行sum函数。

df.groupby('A').sum()

	C	D
A
bar	-1.001520	1.880284
foo	1.365107	1.459170

通过多个列进行分组形成一个层次(hierarchical)索引，然后执行函数。

df.groupby(['A','B']).sum()

		C	D
A	B
bar	one	0.598883	1.667737
	three	-0.605689	-0.531403
	two	-0.994715	0.743951
foo	one	-0.090250	0.694644
	three	0.738067	1.498865
	two	0.717290	-0.734339

0x0A 重塑/改变形状 Reshaping

0x0A-1 Stack

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                      'foo', 'foo', 'qux', 'qux'],
                     ['one', 'two', 'one', 'two',
                      'one', 'two', 'one', 'two']]))
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df

		A	B
first	second
bar	one	1.402402	-0.215913
bar	two	-0.305333	-0.679614
baz	one	0.745205	0.872718
baz	two	0.421718	-0.662528
foo	one	-1.433628	0.627898
foo	two	2.129267	-0.260206
qux	one	0.498752	-0.622814
qux	two	-0.356492	1.364184

df2 = df[:4]
df2

		A	B
first	second
bar	one	1.402402	-0.215913
bar	two	-0.305333	-0.679614
baz	one	0.745205	0.872718
baz	two	0.421718	-0.662528

stack()方法在 DataFrame 列的层次上进行压缩

stacked = df2.stack()
stacked

first  second   
bar    one     A    1.402402
               B   -0.215913
       two     A   -0.305333
               B   -0.679614
baz    one     A    0.745205
               B    0.872718
       two     A    0.421718
               B   -0.662528
dtype: float64

对于一个经过了 stack 的 DataFrame 或者 Series（以 MultiIndex 作为索引），stack()的逆运算是isunstack()，默认情况下是 unstack 最后一层。

就是说，bar、baz是第0级，one、two是第1级，A、B是第2级。

stacked.unstack()

		A	B
first	second
bar	one	1.402402	-0.215913
bar	two	-0.305333	-0.679614
baz	one	0.745205	0.872718
baz	two	0.421718	-0.662528

stacked.unstack(1)

	second	one	two
first
bar	A	1.402402	-0.305333
bar	B	-0.215913	-0.679614
baz	A	0.745205	0.421718
baz	B	0.872718	-0.662528

stacked.unstack(0)

	first	bar	baz
second
one	A	1.402402	0.745205
one	B	-0.215913	0.872718
two	A	-0.305333	0.421718
two	B	-0.679614	-0.662528

0x0A-2 数据透视表 Pivot Tables

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

	A	B	C	D	E
0	one	A	foo	0.072696	0.305015
1	one	B	foo	1.428383	-1.108178
2	two	C	foo	1.239483	0.001580
3	three	A	bar	-0.953547	1.061740
4	one	B	bar	0.762730	1.161542
5	one	C	bar	-0.372740	-0.480251
6	two	A	foo	0.765181	-1.034218
7	three	B	foo	-1.858334	0.359220
8	one	C	foo	-0.952807	1.067494
9	one	A	bar	0.541013	-0.280833
10	two	B	bar	-1.810560	-0.829793
11	three	C	bar	-1.049108	-0.295998

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

	C	bar	foo
A	B
one	A	0.541013	0.072696
	B	0.762730	1.428383
	C	-0.372740	-0.952807
three	A	-0.953547	NaN
	B	NaN	-1.858334
	C	-1.049108	NaN
two	A	NaN	0.765181
	B	-1.810560	NaN
	C	NaN	1.239483

0x0B 时间序列 Time Series

Pandas 在对频率转换进行重新采样时拥有简单、强大且高效的功能（如将按秒采样的数据转换为按5分钟为单位进行采样的数据）。这种操作在金融领域非常常见，但不局限于此。

哇，原来还可以这样玩唉（

这个在信号处理里面也很常用的呢！

rng = pd.date_range('1/1/2020', periods=200, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('1Min').sum()

2020-01-01 00:00:00    13244
2020-01-01 00:01:00    14799
2020-01-01 00:02:00    15489
2020-01-01 00:03:00     3894
Freq: T, dtype: int32

时区表示

rng = pd.date_range('4/16/2020 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2020-04-16    0.872043
2020-04-17   -0.305200
2020-04-18    1.681828
2020-04-19    0.163184
2020-04-20    1.415422
Freq: D, dtype: float64

ts_utc = ts.tz_localize('UTC')
ts_utc

2020-04-16 00:00:00+00:00    0.872043
2020-04-17 00:00:00+00:00   -0.305200
2020-04-18 00:00:00+00:00    1.681828
2020-04-19 00:00:00+00:00    0.163184
2020-04-20 00:00:00+00:00    1.415422
Freq: D, dtype: float64

转换到另外一个时区

ts_utc.tz_convert('Asia/Shanghai')

2020-04-16 08:00:00+08:00    0.872043
2020-04-17 08:00:00+08:00   -0.305200
2020-04-18 08:00:00+08:00    1.681828
2020-04-19 08:00:00+08:00    0.163184
2020-04-20 08:00:00+08:00    1.415422
Freq: D, dtype: float64

时间跨度转换

rng = pd.date_range('1/1/2020', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts

2020-01-31    0.259847
2020-02-29   -0.092969
2020-03-31    0.525985
2020-04-30    0.090362
2020-05-31    1.707569
Freq: M, dtype: float64

ps = ts.to_period()
ps

2020-01    0.259847
2020-02   -0.092969
2020-03    0.525985
2020-04    0.090362
2020-05    1.707569
Freq: M, dtype: float64

ps.to_timestamp()

2020-01-01    0.259847
2020-02-01   -0.092969
2020-03-01    0.525985
2020-04-01    0.090362
2020-05-01    1.707569
Freq: MS, dtype: float64

时期和时间戳之间的转换使得可以使用一些方便的算术函数。

下面这个例子，转换一个每年以11月结束的季度频率为季度结束后的月末的上午9点。（有点懵

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()

1990-03-01 09:00   -1.287411
1990-06-01 09:00    1.744485
1990-09-01 09:00    0.336561
1990-12-01 09:00   -1.206576
1991-03-01 09:00    0.108631
Freq: H, dtype: float64

0x0C 类别 Categoricals

pandas 可以在 DataFrame 中支持 category 类型的数据。

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

将 raw grades 转换为 category 数据类型。

df["grade"] = df["raw_grade"].astype("category")
df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

将 category 类型数据重命名为更有意义的名称，利用Series.cat.categories方法。

df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]

对类别进行重新排序，同时增加缺失的类别。

Series.cat下的方法默认返回一个新的序列。

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

排序是按照 category 的顺序进行的，而不是按照词汇顺序(lexical order)。

df.sort_values(by="grade")

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

对一个 Category 类型的列进行分组时显示了空的类别。

df.groupby("grade").size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

0x0D 作图 Plotting

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2020', periods=1000))
ts = ts.cumsum()
ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x250ccebcb70>

png

对于 DataFrame 而言，plot()方法是一种将所有带有标签的列进行绘制的简便方法。

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
# print(df)
df = df.cumsum()
plt.figure()
df.plot()
plt.legend(loc='best')

<matplotlib.legend.Legend at 0x250ce7620b8>




<Figure size 432x288 with 0 Axes>

png

ts.index

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2022-09-17', '2022-09-18', '2022-09-19', '2022-09-20',
               '2022-09-21', '2022-09-22', '2022-09-23', '2022-09-24',
               '2022-09-25', '2022-09-26'],
              dtype='datetime64[ns]', length=1000, freq='D')

0x0E 导入导出数据 Getting Data In/Out

0x0E-1 CSV

写入 csv 文件

df.to_csv('foo.csv')  # 保存在当前目录下

从 csv 文件中导入数据

pd.read_csv('foo.csv')

	Unnamed: 0	A	B	C	D
0	2020-01-01	-1.397481	-0.453722	1.374079	-2.095767
1	2020-01-02	-1.363418	0.529293	0.178497	-2.364989
2	2020-01-03	-2.047267	-0.333020	0.379978	-2.628204
3	2020-01-04	-2.305870	-0.145598	-0.232491	-2.812290
4	2020-01-05	-1.851390	-1.378215	-1.202399	-4.574516
5	2020-01-06	-3.438928	-0.779975	-1.864979	-5.666847
6	2020-01-07	-4.201270	-0.827366	-0.796513	-5.267628
7	2020-01-08	-5.682762	-1.009404	-0.760801	-4.696243
8	2020-01-09	-4.362126	-0.768552	-0.747517	-5.562345
9	2020-01-10	-5.226357	-1.791920	0.290504	-5.214061
10	2020-01-11	-4.429691	-2.026951	-0.821212	-6.621030
11	2020-01-12	-4.222744	-2.875169	-0.654313	-5.926621
12	2020-01-13	-2.413314	-3.329878	-1.595397	-5.585677
13	2020-01-14	-2.769944	-3.536045	-1.463671	-5.098753
14	2020-01-15	-3.599683	-4.973067	-3.859425	-5.823875
15	2020-01-16	-3.486154	-4.792330	-2.866047	-6.312086
16	2020-01-17	-4.232714	-5.082053	-3.572924	-8.478300
17	2020-01-18	-5.346857	-5.787422	-4.313125	-8.595854
18	2020-01-19	-5.720494	-5.620498	-3.699862	-9.678113
19	2020-01-20	-5.264735	-4.938944	-2.354040	-8.942524
20	2020-01-21	-5.407854	-5.379811	-2.284447	-8.068545
21	2020-01-22	-5.793401	-5.501279	-2.254013	-7.997644
22	2020-01-23	-6.796539	-4.692782	-2.330458	-7.449650
23	2020-01-24	-7.502072	-4.085084	-1.810495	-6.540544
24	2020-01-25	-7.639024	-3.791416	1.214301	-7.599166
25	2020-01-26	-7.783544	-2.180918	1.628560	-7.928116
26	2020-01-27	-7.673699	-3.485888	0.640970	-8.124872
27	2020-01-28	-7.435098	-4.480212	-0.409425	-7.943383
28	2020-01-29	-6.843557	-5.069680	-1.650871	-7.311741
29	2020-01-30	-7.647267	-4.115332	-1.558887	-6.308304
...	...	...	...	...	...
970	2022-08-28	17.157914	-23.306973	-12.461653	-2.478731
971	2022-08-29	18.207251	-23.725280	-12.912440	-2.664828
972	2022-08-30	17.968302	-21.485150	-15.044877	-4.034060
973	2022-08-31	18.933869	-21.587237	-16.046081	-4.014128
974	2022-09-01	18.337284	-21.451739	-15.028398	-4.147177
975	2022-09-02	18.783912	-21.096423	-15.506137	-3.858413
976	2022-09-03	18.828378	-19.514168	-16.924864	-2.559397
977	2022-09-04	18.570511	-21.482747	-17.147155	-4.267082
978	2022-09-05	17.212119	-22.331837	-16.496095	-5.990857
979	2022-09-06	15.726498	-23.219830	-19.321023	-5.281962
980	2022-09-07	16.053999	-22.681168	-19.139536	-5.587482
981	2022-09-08	17.335456	-23.320514	-21.271891	-4.048170
982	2022-09-09	16.925567	-24.175782	-22.318101	-4.890968
983	2022-09-10	18.652048	-25.436089	-22.213175	-6.330027
984	2022-09-11	19.535978	-25.673818	-23.064392	-5.673334
985	2022-09-12	19.632355	-26.369119	-22.953900	-3.463010
986	2022-09-13	18.792587	-25.430236	-23.672206	-2.292810
987	2022-09-14	21.200064	-26.299390	-23.567706	-2.753986
988	2022-09-15	22.890536	-27.813985	-23.938164	-3.195772
989	2022-09-16	23.138447	-25.855344	-24.390078	-3.612660
990	2022-09-17	23.163567	-26.157774	-24.309178	-3.155928
991	2022-09-18	20.973996	-27.319631	-22.948785	-4.487101
992	2022-09-19	21.090756	-26.712847	-21.722990	-4.110810
993	2022-09-20	21.848391	-25.018472	-21.850579	-4.360523
994	2022-09-21	22.898217	-26.005364	-20.914269	-5.357187
995	2022-09-22	21.978310	-25.001244	-20.881626	-5.606849
996	2022-09-23	24.132628	-25.692639	-20.310265	-6.258977
997	2022-09-24	23.777794	-25.485527	-21.305882	-5.330284
998	2022-09-25	25.105611	-25.485135	-22.407657	-6.600228
999	2022-09-26	25.207099	-25.512208	-23.434955	-7.275999

1000 rows × 5 columns

0x0E-2 HDF5

写入 HDF5 存储

df.to_hdf('foo.h5','df')

从 HDF5 存储中读取数据

pd.read_hdf('foo.h5','df')

	A	B	C	D
2020-01-01	-1.397481	-0.453722	1.374079	-2.095767
2020-01-02	-1.363418	0.529293	0.178497	-2.364989
2020-01-03	-2.047267	-0.333020	0.379978	-2.628204
2020-01-04	-2.305870	-0.145598	-0.232491	-2.812290
2020-01-05	-1.851390	-1.378215	-1.202399	-4.574516
2020-01-06	-3.438928	-0.779975	-1.864979	-5.666847
2020-01-07	-4.201270	-0.827366	-0.796513	-5.267628
2020-01-08	-5.682762	-1.009404	-0.760801	-4.696243
2020-01-09	-4.362126	-0.768552	-0.747517	-5.562345
2020-01-10	-5.226357	-1.791920	0.290504	-5.214061
2020-01-11	-4.429691	-2.026951	-0.821212	-6.621030
2020-01-12	-4.222744	-2.875169	-0.654313	-5.926621
2020-01-13	-2.413314	-3.329878	-1.595397	-5.585677
2020-01-14	-2.769944	-3.536045	-1.463671	-5.098753
2020-01-15	-3.599683	-4.973067	-3.859425	-5.823875
2020-01-16	-3.486154	-4.792330	-2.866047	-6.312086
2020-01-17	-4.232714	-5.082053	-3.572924	-8.478300
2020-01-18	-5.346857	-5.787422	-4.313125	-8.595854
2020-01-19	-5.720494	-5.620498	-3.699862	-9.678113
2020-01-20	-5.264735	-4.938944	-2.354040	-8.942524
2020-01-21	-5.407854	-5.379811	-2.284447	-8.068545
2020-01-22	-5.793401	-5.501279	-2.254013	-7.997644
2020-01-23	-6.796539	-4.692782	-2.330458	-7.449650
2020-01-24	-7.502072	-4.085084	-1.810495	-6.540544
2020-01-25	-7.639024	-3.791416	1.214301	-7.599166
2020-01-26	-7.783544	-2.180918	1.628560	-7.928116
2020-01-27	-7.673699	-3.485888	0.640970	-8.124872
2020-01-28	-7.435098	-4.480212	-0.409425	-7.943383
2020-01-29	-6.843557	-5.069680	-1.650871	-7.311741
2020-01-30	-7.647267	-4.115332	-1.558887	-6.308304
...	...	...	...	...
2022-08-28	17.157914	-23.306973	-12.461653	-2.478731
2022-08-29	18.207251	-23.725280	-12.912440	-2.664828
2022-08-30	17.968302	-21.485150	-15.044877	-4.034060
2022-08-31	18.933869	-21.587237	-16.046081	-4.014128
2022-09-01	18.337284	-21.451739	-15.028398	-4.147177
2022-09-02	18.783912	-21.096423	-15.506137	-3.858413
2022-09-03	18.828378	-19.514168	-16.924864	-2.559397
2022-09-04	18.570511	-21.482747	-17.147155	-4.267082
2022-09-05	17.212119	-22.331837	-16.496095	-5.990857
2022-09-06	15.726498	-23.219830	-19.321023	-5.281962
2022-09-07	16.053999	-22.681168	-19.139536	-5.587482
2022-09-08	17.335456	-23.320514	-21.271891	-4.048170
2022-09-09	16.925567	-24.175782	-22.318101	-4.890968
2022-09-10	18.652048	-25.436089	-22.213175	-6.330027
2022-09-11	19.535978	-25.673818	-23.064392	-5.673334
2022-09-12	19.632355	-26.369119	-22.953900	-3.463010
2022-09-13	18.792587	-25.430236	-23.672206	-2.292810
2022-09-14	21.200064	-26.299390	-23.567706	-2.753986
2022-09-15	22.890536	-27.813985	-23.938164	-3.195772
2022-09-16	23.138447	-25.855344	-24.390078	-3.612660
2022-09-17	23.163567	-26.157774	-24.309178	-3.155928
2022-09-18	20.973996	-27.319631	-22.948785	-4.487101
2022-09-19	21.090756	-26.712847	-21.722990	-4.110810
2022-09-20	21.848391	-25.018472	-21.850579	-4.360523
2022-09-21	22.898217	-26.005364	-20.914269	-5.357187
2022-09-22	21.978310	-25.001244	-20.881626	-5.606849
2022-09-23	24.132628	-25.692639	-20.310265	-6.258977
2022-09-24	23.777794	-25.485527	-21.305882	-5.330284
2022-09-25	25.105611	-25.485135	-22.407657	-6.600228
2022-09-26	25.207099	-25.512208	-23.434955	-7.275999

1000 rows × 4 columns

0x0E-3 Excel

看网上说，相对于 xlwt、xlrd 之类的库，用 pandas 来对 excel 进行数据处理和分析挺不错的，正好最近也想试试呢。

写入 Excel

df.to_excel('foo.xlsx', sheet_name='Sheet1')

从 excel 中读取数据

pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

	Unnamed: 0	A	B	C	D
0	2020-01-01	-1.397481	-0.453722	1.374079	-2.095767
1	2020-01-02	-1.363418	0.529293	0.178497	-2.364989
2	2020-01-03	-2.047267	-0.333020	0.379978	-2.628204
3	2020-01-04	-2.305870	-0.145598	-0.232491	-2.812290
4	2020-01-05	-1.851390	-1.378215	-1.202399	-4.574516
5	2020-01-06	-3.438928	-0.779975	-1.864979	-5.666847
6	2020-01-07	-4.201270	-0.827366	-0.796513	-5.267628
7	2020-01-08	-5.682762	-1.009404	-0.760801	-4.696243
8	2020-01-09	-4.362126	-0.768552	-0.747517	-5.562345
9	2020-01-10	-5.226357	-1.791920	0.290504	-5.214061
10	2020-01-11	-4.429691	-2.026951	-0.821212	-6.621030
11	2020-01-12	-4.222744	-2.875169	-0.654313	-5.926621
12	2020-01-13	-2.413314	-3.329878	-1.595397	-5.585677
13	2020-01-14	-2.769944	-3.536045	-1.463671	-5.098753
14	2020-01-15	-3.599683	-4.973067	-3.859425	-5.823875
15	2020-01-16	-3.486154	-4.792330	-2.866047	-6.312086
16	2020-01-17	-4.232714	-5.082053	-3.572924	-8.478300
17	2020-01-18	-5.346857	-5.787422	-4.313125	-8.595854
18	2020-01-19	-5.720494	-5.620498	-3.699862	-9.678113
19	2020-01-20	-5.264735	-4.938944	-2.354040	-8.942524
20	2020-01-21	-5.407854	-5.379811	-2.284447	-8.068545
21	2020-01-22	-5.793401	-5.501279	-2.254013	-7.997644
22	2020-01-23	-6.796539	-4.692782	-2.330458	-7.449650
23	2020-01-24	-7.502072	-4.085084	-1.810495	-6.540544
24	2020-01-25	-7.639024	-3.791416	1.214301	-7.599166
25	2020-01-26	-7.783544	-2.180918	1.628560	-7.928116
26	2020-01-27	-7.673699	-3.485888	0.640970	-8.124872
27	2020-01-28	-7.435098	-4.480212	-0.409425	-7.943383
28	2020-01-29	-6.843557	-5.069680	-1.650871	-7.311741
29	2020-01-30	-7.647267	-4.115332	-1.558887	-6.308304
...	...	...	...	...	...
970	2022-08-28	17.157914	-23.306973	-12.461653	-2.478731
971	2022-08-29	18.207251	-23.725280	-12.912440	-2.664828
972	2022-08-30	17.968302	-21.485150	-15.044877	-4.034060
973	2022-08-31	18.933869	-21.587237	-16.046081	-4.014128
974	2022-09-01	18.337284	-21.451739	-15.028398	-4.147177
975	2022-09-02	18.783912	-21.096423	-15.506137	-3.858413
976	2022-09-03	18.828378	-19.514168	-16.924864	-2.559397
977	2022-09-04	18.570511	-21.482747	-17.147155	-4.267082
978	2022-09-05	17.212119	-22.331837	-16.496095	-5.990857
979	2022-09-06	15.726498	-23.219830	-19.321023	-5.281962
980	2022-09-07	16.053999	-22.681168	-19.139536	-5.587482
981	2022-09-08	17.335456	-23.320514	-21.271891	-4.048170
982	2022-09-09	16.925567	-24.175782	-22.318101	-4.890968
983	2022-09-10	18.652048	-25.436089	-22.213175	-6.330027
984	2022-09-11	19.535978	-25.673818	-23.064392	-5.673334
985	2022-09-12	19.632355	-26.369119	-22.953900	-3.463010
986	2022-09-13	18.792587	-25.430236	-23.672206	-2.292810
987	2022-09-14	21.200064	-26.299390	-23.567706	-2.753986
988	2022-09-15	22.890536	-27.813985	-23.938164	-3.195772
989	2022-09-16	23.138447	-25.855344	-24.390078	-3.612660
990	2022-09-17	23.163567	-26.157774	-24.309178	-3.155928
991	2022-09-18	20.973996	-27.319631	-22.948785	-4.487101
992	2022-09-19	21.090756	-26.712847	-21.722990	-4.110810
993	2022-09-20	21.848391	-25.018472	-21.850579	-4.360523
994	2022-09-21	22.898217	-26.005364	-20.914269	-5.357187
995	2022-09-22	21.978310	-25.001244	-20.881626	-5.606849
996	2022-09-23	24.132628	-25.692639	-20.310265	-6.258977
997	2022-09-24	23.777794	-25.485527	-21.305882	-5.330284
998	2022-09-25	25.105611	-25.485135	-22.407657	-6.600228
999	2022-09-26	25.207099	-25.512208	-23.434955	-7.275999

1000 rows × 5 columns

0x0F 陷阱 Gotchas

如果你在运行中看到例如下面的异常。

if pd.Series([False, True, False]):
    print("I was true")

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-304-5c782b38cd2f> in <module>
----> 1 if pd.Series([False, True, False]):
      2     print("I was true")


D:\Programs\Anaconda\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1476         raise ValueError("The truth value of a {0} is ambiguous. "
   1477                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1478                          .format(self.__class__.__name__))
   1479 
   1480     __bool__ = __nonzero__


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

可以参考 Comparisons 这里来寻求解释和下一步的方案。

也可以参考 Comparisons。

（上面都是官方的文档地址）

（官方文档到这里就结束了）

0x10 小结

关于选取：

选择整行/整列，或多个整行/多个整列的数据，可以用df[]、df.loc[]、df.iloc[]这三种用法。
区域选取，分为标签索引和整数索引
- 标签索引：df.loc[]
- 整数索引：df.iloc[]
选取单元格：df.at[]、df.iat[]、df.loc[]、df.iloc[] 都可以，要注意参数。df[]不行！
关于选取的返回值：
- 返回值包括单行多列或多行单列时，返回值为 Series 对象
- 返回值包括多行多列时，返回值为 DataFrame 对象
- 返回值仅为一个单元格（单行单列）时，返回值为（可能是 numpy 的）基本数据类型，例如 str, float64, int64 等
- df[]因为不能精确到单元格，所以返回值一定 DataFrame 或 Series 对象
当使用 DataFrame 的默认索引（整数索引）时，整数索引即为标签索引。