数据科学 | 十分钟入门pandas/常用方法笔记


pandas

0x00 前言

最近对数据科学有点感兴趣,众所周知,在 python 中用来做数据处理常用的有 numpy、pandas 等模块,它们功能非常强大。

考虑到有项目正好要用到 pandas,这次就来学一下吧。(挺久之前就想学的,一直咕

看到网上推荐了官方的 10 Minutes to pandas,就去看了一下,翻译了一下官方的教程,自己跑了代码,顺便再记了点笔记吧。

下面的内容主要参考的是10 Minutes to pandas然而看了一晚上),运行环境为 Jupyter Notebook

这篇博客也是在从 Jupyter Notebook 里导出的 markdown 基础上完成的,挺方便的呢。(有些地方可能由于博客样式的原因显示会有点问题,凑合着看一看吧。

后面有空再整理一下,放个 Jupyter Notebook 的.ipynb版本吧。

—->详见文末的更新!


0x01 导入模块

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

这些都是常用别名了呢(

0x02 创建对象 Object Creation

传递一个list对象来创建一个Series,pandas会默认创建整型索引(index)。

s = pd.Series([1,3,5,np.nan,6,8])
s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

传递一个numpy array,时间索引以及列标签来创建一个DataFrame

dates = pd.date_range('20200416', periods=6)
dates
DatetimeIndex(['2020-04-16', '2020-04-17', '2020-04-18', '2020-04-19',
               '2020-04-20', '2020-04-21'],
              dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

A B C D
2020-04-16 -0.228084 0.712521 0.743378 0.526823
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
2020-04-18 -2.032243 2.046827 1.176614 -0.365556
2020-04-19 -1.200005 1.109991 -1.527804 1.068642
2020-04-20 -0.535031 1.351420 0.907865 -1.074013
2020-04-21 -0.614734 0.027143 -0.420875 -0.333819

传递一个能够被转换成类似序列结构的字典对象来创建一个DataFrame

df2 = pd.DataFrame({ 'A' : 1.,
 'B' : pd.Timestamp('20200416183800'),
 'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
 'D' : np.array([3] * 4,dtype='int32'),
 'E' : pd.Categorical(["test","train","test","train"]),
 'F' : 'foo' })
df2

A B C D E F
0 1.0 2020-04-16 18:38:00 1.0 3 test foo
1 1.0 2020-04-16 18:38:00 1.0 3 train foo
2 1.0 2020-04-16 18:38:00 1.0 3 test foo
3 1.0 2020-04-16 18:38:00 1.0 3 train foo

查看他们的数据类型(dtypes).

df2.dtypes
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

在 IPython 里输入 df2.,按下 TAB 键,会有常用的代码提示。

这里我是在 Jupyter Notebook 中跑,也有类似效果。为了方便展示这里用dir来列出来吧。

dir(df2)
['A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'T',
 ######################...###########
 'to_feather',
 'to_gbq',
 'to_hdf',
 'to_html',
 'to_json',
 'to_latex',
 'to_msgpack',
 'to_numpy',
 'to_panel',
 'to_parquet',
 'to_period',
 'to_pickle',
 'to_records',
 'to_sparse',
 'to_sql',
 'to_stata',
 'to_string',
 'to_timestamp',
 'to_xarray',
 'transform',
 'transpose',
 'truediv',
 'truncate',
 'tshift',
 'tz_convert',
 'tz_localize',
 'unstack',
 'update',
 'values',
 'var',
 'where',
 'xs']

0x03 查看数据 Viewing Data

查看 frame 中头部和尾部的行

df.head()  # 查看前五行数据

A B C D
2020-04-16 -0.228084 0.712521 0.743378 0.526823
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
2020-04-18 -2.032243 2.046827 1.176614 -0.365556
2020-04-19 -1.200005 1.109991 -1.527804 1.068642
2020-04-20 -0.535031 1.351420 0.907865 -1.074013
df.head(2)  # 前两行

A B C D
2020-04-16 -0.228084 0.712521 0.743378 0.526823
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
df.tail(3)  # 倒数三行

A B C D
2020-04-19 -1.200005 1.109991 -1.527804 1.068642
2020-04-20 -0.535031 1.351420 0.907865 -1.074013
2020-04-21 -0.614734 0.027143 -0.420875 -0.333819

显示索引、列和底层的(underlying) numpy 数据

df

A B C D
2020-04-16 -0.228084 0.712521 0.743378 0.526823
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
2020-04-18 -2.032243 2.046827 1.176614 -0.365556
2020-04-19 -1.200005 1.109991 -1.527804 1.068642
2020-04-20 -0.535031 1.351420 0.907865 -1.074013
2020-04-21 -0.614734 0.027143 -0.420875 -0.333819
df.index
DatetimeIndex(['2020-04-16', '2020-04-17', '2020-04-18', '2020-04-19',
               '2020-04-20', '2020-04-21'],
              dtype='datetime64[ns]', freq='D')
df2.index
Int64Index([0, 1, 2, 3], dtype='int64')
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.values
array([[-0.22808386,  0.71252123,  0.74337822,  0.52682341],
       [-0.42392275,  0.56434629, -2.31893066,  0.08093058],
       [-2.03224294,  2.04682736,  1.17661398, -0.3655562 ],
       [-1.20000469,  1.10999124, -1.52780407,  1.06864157],
       [-0.53503099,  1.35142044,  0.90786509, -1.07401348],
       [-0.61473412,  0.02714293, -0.42087502, -0.33381946]])
df2.values
array([[1.0, Timestamp('2020-04-16 18:38:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-04-16 18:38:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2020-04-16 18:38:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2020-04-16 18:38:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)
print(type(df))
print(type(df.columns))
print(type(df.values))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.indexes.base.Index'>
<class 'numpy.ndarray'>

describe 函数对于数据的快速统计汇总

df.describe()

A B C D
count 6.000000 6.000000 6.000000 6.000000
mean -0.839003 0.968708 -0.239959 -0.016166
std 0.669679 0.699208 1.435586 0.751411
min -2.032243 0.027143 -2.318931 -1.074013
25% -1.053687 0.601390 -1.251072 -0.357622
50% -0.574883 0.911256 0.161252 -0.126444
75% -0.451700 1.291063 0.866743 0.415350
max -0.228084 2.046827 1.176614 1.068642
df2.describe()

A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0

对数据进行转置(Transposing)

df.T

2020-04-16 00:00:00 2020-04-17 00:00:00 2020-04-18 00:00:00 2020-04-19 00:00:00 2020-04-20 00:00:00 2020-04-21 00:00:00
A -0.228084 -0.423923 -2.032243 -1.200005 -0.535031 -0.614734
B 0.712521 0.564346 2.046827 1.109991 1.351420 0.027143
C 0.743378 -2.318931 1.176614 -1.527804 0.907865 -0.420875
D 0.526823 0.080931 -0.365556 1.068642 -1.074013 -0.333819
df2.T

0 1 2 3
A 1 1 1 1
B 2020-04-16 18:38:00 2020-04-16 18:38:00 2020-04-16 18:38:00 2020-04-16 18:38:00
C 1 1 1 1
D 3 3 3 3
E test train test train
F foo foo foo foo

按照某一坐标轴进行排序

df.sort_index(axis=1, ascending=False)  # 降序

D C B A
2020-04-16 0.526823 0.743378 0.712521 -0.228084
2020-04-17 0.080931 -2.318931 0.564346 -0.423923
2020-04-18 -0.365556 1.176614 2.046827 -2.032243
2020-04-19 1.068642 -1.527804 1.109991 -1.200005
2020-04-20 -1.074013 0.907865 1.351420 -0.535031
2020-04-21 -0.333819 -0.420875 0.027143 -0.614734
df.sort_index(axis=1, ascending=True)  # 升序(默认)

A B C D
2020-04-16 -0.228084 0.712521 0.743378 0.526823
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
2020-04-18 -2.032243 2.046827 1.176614 -0.365556
2020-04-19 -1.200005 1.109991 -1.527804 1.068642
2020-04-20 -0.535031 1.351420 0.907865 -1.074013
2020-04-21 -0.614734 0.027143 -0.420875 -0.333819
df.sort_index(axis=0, ascending=False)  # 以第0轴(行)来排序,降序

A B C D
2020-04-21 -0.614734 0.027143 -0.420875 -0.333819
2020-04-20 -0.535031 1.351420 0.907865 -1.074013
2020-04-19 -1.200005 1.109991 -1.527804 1.068642
2020-04-18 -2.032243 2.046827 1.176614 -0.365556
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
2020-04-16 -0.228084 0.712521 0.743378 0.526823

按值进行排序,默认为升序

df.sort_values(by='B')

A B C D
2020-04-21 -0.614734 0.027143 -0.420875 -0.333819
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
2020-04-16 -0.228084 0.712521 0.743378 0.526823
2020-04-19 -1.200005 1.109991 -1.527804 1.068642
2020-04-20 -0.535031 1.351420 0.907865 -1.074013
2020-04-18 -2.032243 2.046827 1.176614 -0.365556

0x04 选择数据 Selection

注意: 虽然标准的 Python/Numpy 的选择和设置表达式都非常易懂且方便用于交互使用,但是作为工程使用的代码,推荐使用经过优化的 pandas 数据访问方法: .at, .iat, .loc, .iloc.ix

0x04-1 获取 Getting

选择一个单独的列,这将会返回一个 Series,等效于 df.A

df['A']
2020-04-16   -0.228084
2020-04-17   -0.423923
2020-04-18   -2.032243
2020-04-19   -1.200005
2020-04-20   -0.535031
2020-04-21   -0.614734
Freq: D, Name: A, dtype: float64

通过[]进行选择,可以用来行进行切片(slice)

df[1:3]

A B C D
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
2020-04-18 -2.032243 2.046827 1.176614 -0.365556
df['20200418':'20200420']

A B C D
2020-04-18 -2.032243 2.046827 1.176614 -0.365556
2020-04-19 -1.200005 1.109991 -1.527804 1.068642
2020-04-20 -0.535031 1.351420 0.907865 -1.074013

0x4-2 通过标签选择 Selection by Label

利用 pandas 的 .loc, .at 方法

使用标签来获取一个交叉的区域

dates[0]
Timestamp('2020-04-16 00:00:00', freq='D')
df.loc[dates[0]]
A   -0.228084
B    0.712521
C    0.743378
D    0.526823
Name: 2020-04-16 00:00:00, dtype: float64

用标签来在多个轴上进行选择

df.loc[:,['A','B']]

A B
2020-04-16 -0.228084 0.712521
2020-04-17 -0.423923 0.564346
2020-04-18 -2.032243 2.046827
2020-04-19 -1.200005 1.109991
2020-04-20 -0.535031 1.351420
2020-04-21 -0.614734 0.027143

利用标签来切片

注意此处的切片两端都包含(左闭右闭)

df.loc['20200418':'20200420',['A','B']]

A B
2020-04-18 -2.032243 2.046827
2020-04-19 -1.200005 1.109991
2020-04-20 -0.535031 1.351420
df.loc[dates[2]:,:'C']

A B C
2020-04-18 -2.032243 2.046827 1.176614
2020-04-19 -1.200005 1.109991 -1.527804
2020-04-20 -0.535031 1.351420 0.907865
2020-04-21 -0.614734 0.027143 -0.420875

对于返回的对象进行维度缩减

df.loc['20200419',['A','B']]
A   -1.200005
B    1.109991
Name: 2020-04-19 00:00:00, dtype: float64

获取一个标量

df.loc[dates[2],'A']
-2.0322429408627056

快速访问一个标量(与上一个方法等价)

df.at[dates[2],'A']
-2.0322429408627056

0x04-3 通过位置进行选择 Selection by Position

利用的是 pandas 里的 .iloc, .iat 方法

通过传递整数值进行位置选择

df.iloc[3] 
A   -1.200005
B    1.109991
C   -1.527804
D    1.068642
Name: 2020-04-19 00:00:00, dtype: float64
df.iloc[:,2]  # [行,列] 或 [第0维,第1维]
2020-04-16    0.743378
2020-04-17   -2.318931
2020-04-18    1.176614
2020-04-19   -1.527804
2020-04-20    0.907865
2020-04-21   -0.420875
Freq: D, Name: C, dtype: float64

利用数值进行切片,类似于 numpy/python

注意一下,这里的切片是普通的切片,即左闭右开。

df.iloc[3:5,0:2]

A B
2020-04-19 -1.200005 1.109991
2020-04-20 -0.535031 1.351420

可以指定一个位置的列表来选择数据区域

df.iloc[[1,2,4],[0,2]]

A C
2020-04-17 -0.423923 -2.318931
2020-04-18 -2.032243 1.176614
2020-04-20 -0.535031 0.907865

对行进行切片

df.iloc[1:3,:]

A B C D
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
2020-04-18 -2.032243 2.046827 1.176614 -0.365556

对列进行切片

df.iloc[:,1:3]

B C
2020-04-16 0.712521 0.743378
2020-04-17 0.564346 -2.318931
2020-04-18 2.046827 1.176614
2020-04-19 1.109991 -1.527804
2020-04-20 1.351420 0.907865
2020-04-21 0.027143 -0.420875

利用位置快速得到一个位置对应的值

df.iloc[1,1]
0.5643462917985026
df.iat[1,1]
0.5643462917985026

.ix 用法(Deprecated)

df.ix[] 既可以通过整数索引进行数据选取,也可以通过标签索引进行数据选取。

也就是说,df.ix[]df.loc[]df.iloc[] 的功能集合,且在同义词选取中,可以同时使用整数索引和标签索引。

这个方法在官方的 10 Minutes to pandas 中没有专门说明,不建议用这一方法,请用.loc.iloc来索引,那这里就提一下好了。

df.ix[1,'A']
D:\Programs\Anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.





-1.363418018143355
df.ix[dates[1:3],[0,2,3]]
D:\Programs\Anaconda\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: 
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.

A C D
2020-04-17 -11.332834 7.659436 -20.925183
2020-04-18 -11.040688 6.867701 -20.157355

从 DeprecationWarning 看来的确已经 deprecated 了,那就不要用了!

0x04-4 布尔索引 Boolean Indexing

使用一个单独列的值来选择数据

df[df.D > 0]

A B C D
2020-04-16 -0.228084 0.712521 0.743378 0.526823
2020-04-17 -0.423923 0.564346 -2.318931 0.080931
2020-04-19 -1.200005 1.109991 -1.527804 1.068642

利用布尔条件从一个 DataFrame 中选择数据

df>0

A B C D
2020-04-16 False True True True
2020-04-17 False True False True
2020-04-18 False True True False
2020-04-19 False True False True
2020-04-20 False True True False
2020-04-21 False True False False
df[df > 0]

A B C D
2020-04-16 NaN 0.712521 0.743378 0.526823
2020-04-17 NaN 0.564346 NaN 0.080931
2020-04-18 NaN 2.046827 1.176614 NaN
2020-04-19 NaN 1.109991 NaN 1.068642
2020-04-20 NaN 1.351420 0.907865 NaN
2020-04-21 NaN 0.027143 NaN NaN

使用 isin() 方法来过滤数据

(is in)

df3 = df.copy()
df3['E'] = ['one', 'one','two','three','four','three']
df3

A B C D E
2020-04-16 -0.228084 0.712521 0.743378 0.526823 one
2020-04-17 -0.423923 0.564346 -2.318931 0.080931 one
2020-04-18 -2.032243 2.046827 1.176614 -0.365556 two
2020-04-19 -1.200005 1.109991 -1.527804 1.068642 three
2020-04-20 -0.535031 1.351420 0.907865 -1.074013 four
2020-04-21 -0.614734 0.027143 -0.420875 -0.333819 three
df3[df3['E'].isin(['two','four'])]

A B C D E
2020-04-18 -2.032243 2.046827 1.176614 -0.365556 two
2020-04-20 -0.535031 1.351420 0.907865 -1.074013 four

0x05 设置数据 Setting

设置一个新的列,自动按照索引来对其数据

s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20200416', periods=6))
s1
2020-04-16    1
2020-04-17    2
2020-04-18    3
2020-04-19    4
2020-04-20    5
2020-04-21    6
Freq: D, dtype: int64
df['F'] = s1

通过标签设置新的值

df.at[dates[0],'A'] = 0

通过标签设置新的值

df.iat[0,1] = 0

通过一个 numpy 数组设置一组新值

df.loc[:,'D'] = np.array([5] * len(df))

上述操作结果如下

df

A B C D F
2020-04-16 0.000000 0.000000 0.743378 5 1
2020-04-17 -0.423923 0.564346 -2.318931 5 2
2020-04-18 -2.032243 2.046827 1.176614 5 3
2020-04-19 -1.200005 1.109991 -1.527804 5 4
2020-04-20 -0.535031 1.351420 0.907865 5 5
2020-04-21 -0.614734 0.027143 -0.420875 5 6

通过 where 操作来设置新的值

df2 = df.copy()
df2[df2 > 0] = -df2
df2

A B C D F
2020-04-16 0.000000 0.000000 -0.743378 -5 -1
2020-04-17 -0.423923 -0.564346 -2.318931 -5 -2
2020-04-18 -2.032243 -2.046827 -1.176614 -5 -3
2020-04-19 -1.200005 -1.109991 -1.527804 -5 -4
2020-04-20 -0.535031 -1.351420 -0.907865 -5 -5
2020-04-21 -0.614734 -0.027143 -0.420875 -5 -6

0x06 缺失值处理 Missing Data

在pandas中,使用np.nan来代替缺失值,这些值将默认不会包含在计算中。

reindex()方法可以对指定轴上的索引进行改变/增加/删除操作,返回原始数据的一个拷贝。

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1

A B C D F E
2020-04-16 0.000000 0.000000 0.743378 5 1 NaN
2020-04-17 -0.423923 0.564346 -2.318931 5 2 NaN
2020-04-18 -2.032243 2.046827 1.176614 5 3 NaN
2020-04-19 -1.200005 1.109991 -1.527804 5 4 NaN
df1.loc[dates[0]:dates[1],'E'] = 1
df1

A B C D F E
2020-04-16 0.000000 0.000000 0.743378 5 1 1.0
2020-04-17 -0.423923 0.564346 -2.318931 5 2 1.0
2020-04-18 -2.032243 2.046827 1.176614 5 3 NaN
2020-04-19 -1.200005 1.109991 -1.527804 5 4 NaN

去掉包含缺失值的行

df1.dropna(how='any')

A B C D F E
2020-04-16 0.000000 0.000000 0.743378 5 1 1.0
2020-04-17 -0.423923 0.564346 -2.318931 5 2 1.0

填充缺失的值,利用value来指定

print(df1)
df1.fillna(value=5)
                   A         B         C  D  F    E
2020-04-16  0.000000  0.000000  0.743378  5  1  1.0
2020-04-17 -0.423923  0.564346 -2.318931  5  2  1.0
2020-04-18 -2.032243  2.046827  1.176614  5  3  NaN
2020-04-19 -1.200005  1.109991 -1.527804  5  4  NaN

A B C D F E
2020-04-16 0.000000 0.000000 0.743378 5 1 1.0
2020-04-17 -0.423923 0.564346 -2.318931 5 2 1.0
2020-04-18 -2.032243 2.046827 1.176614 5 3 5.0
2020-04-19 -1.200005 1.109991 -1.527804 5 4 5.0

获取值为nan的布尔值(boolean mask)

pd.isna(df1)

A B C D F E
2020-04-16 False False False False False False
2020-04-17 False False False False False False
2020-04-18 False False False False False True
2020-04-19 False False False False False True

0x07 一些常用的操作 Operations

0x07-1 统计 Stats

这些操作不包括缺失数据

执行描述性统计:

df.mean()  # 均值,默认为列的
A   -0.800989
B    0.849955
C   -0.239959
D    5.000000
F    3.500000
dtype: float64
df.mean(1)  # 指定坐标轴(此处为行)
2020-04-16    1.348676
2020-04-17    0.964299
2020-04-18    1.838240
2020-04-19    1.476436
2020-04-20    2.344851
2020-04-21    1.998307
Freq: D, dtype: float64

对于拥有不同维度,需要对齐对象再进行操作。

Pandas 会自动的沿着指定的维度进行广播。

s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)
s
2020-04-16    NaN
2020-04-17    NaN
2020-04-18    1.0
2020-04-19    3.0
2020-04-20    5.0
2020-04-21    NaN
Freq: D, dtype: float64
print(df)
df.sub(s, axis='index')  # df-s
                   A         B         C  D  F
2020-04-16  0.000000  0.000000  0.743378  5  1
2020-04-17 -0.423923  0.564346 -2.318931  5  2
2020-04-18 -2.032243  2.046827  1.176614  5  3
2020-04-19 -1.200005  1.109991 -1.527804  5  4
2020-04-20 -0.535031  1.351420  0.907865  5  5
2020-04-21 -0.614734  0.027143 -0.420875  5  6

A B C D F
2020-04-16 NaN NaN NaN NaN NaN
2020-04-17 NaN NaN NaN NaN NaN
2020-04-18 -3.032243 1.046827 0.176614 4.0 2.0
2020-04-19 -4.200005 -1.890009 -4.527804 2.0 1.0
2020-04-20 -5.535031 -3.648580 -4.092135 0.0 0.0
2020-04-21 NaN NaN NaN NaN NaN

0x07-2 应用方法到数据 Apply

Applying functions to the data

df.apply(np.cumsum)  # cumsum 默认按照行累加

A B C D F
2020-04-16 0.000000 0.000000 0.743378 5 1
2020-04-17 -0.423923 0.564346 -1.575552 10 3
2020-04-18 -2.456166 2.611174 -0.398938 15 6
2020-04-19 -3.656170 3.721165 -1.926743 20 10
2020-04-20 -4.191201 5.072585 -1.018877 25 15
2020-04-21 -4.805935 5.099728 -1.439752 30 21
df.apply(np.cumsum,axis=1)  # 指定 cumsum 的 axis 参数,按照列累加

A B C D F
2020-04-16 0.000000 0.000000 0.743378 5.743378 6.743378
2020-04-17 -0.423923 0.140424 -2.178507 2.821493 4.821493
2020-04-18 -2.032243 0.014584 1.191198 6.191198 9.191198
2020-04-19 -1.200005 -0.090013 -1.617818 3.382182 7.382182
2020-04-20 -0.535031 0.816389 1.724255 6.724255 11.724255
2020-04-21 -0.614734 -0.587591 -1.008466 3.991534 9.991534
df.apply(lambda x: x.max() - x.min())
A    2.032243
B    2.046827
C    3.495545
D    0.000000
F    5.000000
dtype: float64

0x07-3 直方图 Histogramming

s = pd.Series(np.random.randint(0, 7, size=10))
s
0    0
1    4
2    5
3    0
4    3
5    2
6    2
7    0
8    5
9    1
dtype: int32
s.value_counts()
0    3
5    2
2    2
4    1
3    1
1    1
dtype: int64

0x07-4 字符串方法 String Methods

Series 在str属性中准备了一组字符串处理方法,可以方便地对数组的每个元素进行操作,如下面的代码片段所示。

注意str中的模式匹配通常默认使用’正则表达式(在某些情况下总是使用它们)。

s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

0x08 合并 Merge

pandas 提供了一系列的方法,在 join/merge 操作中,可以轻松地对Series, DataFramePanel对象进行各种集合逻辑以及关系代数的操作。

0x08-1 Concat

df = pd.DataFrame(np.random.randn(10, 4))
df

0 1 2 3
0 -0.089143 0.087569 0.218078 -0.336815
1 -0.694835 0.108324 -1.882762 -0.220210
2 -2.069784 -0.788424 -1.694595 0.569263
3 2.433036 0.719851 -0.109935 0.177864
4 -0.838252 1.117676 0.880839 -1.118529
5 1.706060 0.290877 -0.907941 -0.532450
6 0.118169 -0.496975 -0.190422 0.783875
7 -1.213645 -0.489622 -0.798443 -0.471611
8 -0.314598 -2.658906 -1.139778 0.374437
9 0.229281 -1.372309 0.996916 -1.013462
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
pieces
[          0         1         2         3
 0 -0.089143  0.087569  0.218078 -0.336815
 1 -0.694835  0.108324 -1.882762 -0.220210
 2 -2.069784 -0.788424 -1.694595  0.569263,
           0         1         2         3
 3  2.433036  0.719851 -0.109935  0.177864
 4 -0.838252  1.117676  0.880839 -1.118529
 5  1.706060  0.290877 -0.907941 -0.532450
 6  0.118169 -0.496975 -0.190422  0.783875,
           0         1         2         3
 7 -1.213645 -0.489622 -0.798443 -0.471611
 8 -0.314598 -2.658906 -1.139778  0.374437
 9  0.229281 -1.372309  0.996916 -1.013462]
pd.concat(pieces)

0 1 2 3
0 -0.089143 0.087569 0.218078 -0.336815
1 -0.694835 0.108324 -1.882762 -0.220210
2 -2.069784 -0.788424 -1.694595 0.569263
3 2.433036 0.719851 -0.109935 0.177864
4 -0.838252 1.117676 0.880839 -1.118529
5 1.706060 0.290877 -0.907941 -0.532450
6 0.118169 -0.496975 -0.190422 0.783875
7 -1.213645 -0.489622 -0.798443 -0.471611
8 -0.314598 -2.658906 -1.139778 0.374437
9 0.229281 -1.372309 0.996916 -1.013462

0x08-2 Join

类似于 SQL 类型的合并。

下面是 key 相同的情况。

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
left

key lval
0 foo 1
1 foo 2
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
right

key rval
0 foo 4
1 foo 5
pd.merge(left, right, on='key')

key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5

另一个例子,key 不同的情况。

left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
left

key lval
0 foo 1
1 bar 2
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
right

key rval
0 foo 4
1 bar 5
pd.merge(left, right, on='key')

key lval rval
0 foo 1 4
1 bar 2 5

0x08-3 Append

将一行加入到一个 DataFrame 上

df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

A B C D
0 -2.727829 -1.382168 0.961920 0.083472
1 -0.297131 -0.294723 0.303267 -0.539177
2 -0.226735 -2.157170 1.559418 0.102023
3 0.569927 -0.800855 0.918515 1.630873
4 0.402999 -2.238052 2.181635 1.950669
5 1.238971 -0.165159 0.688200 0.793499
6 1.356265 -0.034859 -0.020774 0.485487
7 -1.793398 0.900625 0.365789 -2.089145
s = df.iloc[3]
df.append(s, ignore_index=True)  # 忽略 index

A B C D
0 -2.727829 -1.382168 0.961920 0.083472
1 -0.297131 -0.294723 0.303267 -0.539177
2 -0.226735 -2.157170 1.559418 0.102023
3 0.569927 -0.800855 0.918515 1.630873
4 0.402999 -2.238052 2.181635 1.950669
5 1.238971 -0.165159 0.688200 0.793499
6 1.356265 -0.034859 -0.020774 0.485487
7 -1.793398 0.900625 0.365789 -2.089145
8 0.569927 -0.800855 0.918515 1.630873
s = df.iloc[4]
df.append(s, ignore_index=False)

A B C D
0 -2.727829 -1.382168 0.961920 0.083472
1 -0.297131 -0.294723 0.303267 -0.539177
2 -0.226735 -2.157170 1.559418 0.102023
3 0.569927 -0.800855 0.918515 1.630873
4 0.402999 -2.238052 2.181635 1.950669
5 1.238971 -0.165159 0.688200 0.793499
6 1.356265 -0.034859 -0.020774 0.485487
7 -1.793398 0.900625 0.365789 -2.089145
4 0.402999 -2.238052 2.181635 1.950669

0x09 分组 Grouping

对于分组(group by),通常有下面几个操作步骤:

  • Splitting 按照一些规则将数据分为不同的组

  • Applying 对于每组数据分别执行一个函数

  • Combining 将结果组合到一个数据结构中

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                  'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                  'C' : np.random.randn(8),
                  'D' : np.random.randn(8)})
df

A B C D
0 foo one 1.038586 2.040014
1 bar one 0.454452 -0.553457
2 foo two -0.730273 -0.207136
3 bar three 0.435136 0.443462
4 foo two 2.557988 -1.831259
5 bar two 1.131084 0.915332
6 foo one 1.078171 0.037355
7 foo three 2.057203 1.209778

分组并对得到的分组执行sum函数。

df.groupby('A').sum()

C D
A
bar -1.001520 1.880284
foo 1.365107 1.459170

通过多个列进行分组形成一个层次(hierarchical)索引,然后执行函数。

df.groupby(['A','B']).sum()

C D
A B
bar one 0.598883 1.667737
three -0.605689 -0.531403
two -0.994715 0.743951
foo one -0.090250 0.694644
three 0.738067 1.498865
two 0.717290 -0.734339

0x0A 重塑/改变形状 Reshaping

0x0A-1 Stack

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                      'foo', 'foo', 'qux', 'qux'],
                     ['one', 'two', 'one', 'two',
                      'one', 'two', 'one', 'two']]))
tuples
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df

A B
first second
bar one 1.402402 -0.215913
two -0.305333 -0.679614
baz one 0.745205 0.872718
two 0.421718 -0.662528
foo one -1.433628 0.627898
two 2.129267 -0.260206
qux one 0.498752 -0.622814
two -0.356492 1.364184
df2 = df[:4]
df2

A B
first second
bar one 1.402402 -0.215913
two -0.305333 -0.679614
baz one 0.745205 0.872718
two 0.421718 -0.662528

stack()方法在 DataFrame 列的层次上进行压缩

stacked = df2.stack()
stacked
first  second   
bar    one     A    1.402402
               B   -0.215913
       two     A   -0.305333
               B   -0.679614
baz    one     A    0.745205
               B    0.872718
       two     A    0.421718
               B   -0.662528
dtype: float64

对于一个经过了 stack 的 DataFrame 或者 Series(以 MultiIndex 作为索引),stack()的逆运算是isunstack(),默认情况下是 unstack 最后一层。

就是说,bar、baz是第0级,one、two是第1级,A、B是第2级。

stacked.unstack()

A B
first second
bar one 1.402402 -0.215913
two -0.305333 -0.679614
baz one 0.745205 0.872718
two 0.421718 -0.662528
stacked.unstack(1)

second one two
first
bar A 1.402402 -0.305333
B -0.215913 -0.679614
baz A 0.745205 0.421718
B 0.872718 -0.662528
stacked.unstack(0)

first bar baz
second
one A 1.402402 0.745205
B -0.215913 0.872718
two A -0.305333 0.421718
B -0.679614 -0.662528

0x0A-2 数据透视表 Pivot Tables

df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df

A B C D E
0 one A foo 0.072696 0.305015
1 one B foo 1.428383 -1.108178
2 two C foo 1.239483 0.001580
3 three A bar -0.953547 1.061740
4 one B bar 0.762730 1.161542
5 one C bar -0.372740 -0.480251
6 two A foo 0.765181 -1.034218
7 three B foo -1.858334 0.359220
8 one C foo -0.952807 1.067494
9 one A bar 0.541013 -0.280833
10 two B bar -1.810560 -0.829793
11 three C bar -1.049108 -0.295998
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

C bar foo
A B
one A 0.541013 0.072696
B 0.762730 1.428383
C -0.372740 -0.952807
three A -0.953547 NaN
B NaN -1.858334
C -1.049108 NaN
two A NaN 0.765181
B -1.810560 NaN
C NaN 1.239483

0x0B 时间序列 Time Series

Pandas 在对频率转换进行重新采样时拥有简单、强大且高效的功能(如将按秒采样的数据转换为按5分钟为单位进行采样的数据)。这种操作在金融领域非常常见,但不局限于此。

哇,原来还可以这样玩唉(

这个在信号处理里面也很常用的呢!

rng = pd.date_range('1/1/2020', periods=200, freq='S')
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample('1Min').sum()
2020-01-01 00:00:00    13244
2020-01-01 00:01:00    14799
2020-01-01 00:02:00    15489
2020-01-01 00:03:00     3894
Freq: T, dtype: int32

时区表示

rng = pd.date_range('4/16/2020 00:00', periods=5, freq='D')
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2020-04-16    0.872043
2020-04-17   -0.305200
2020-04-18    1.681828
2020-04-19    0.163184
2020-04-20    1.415422
Freq: D, dtype: float64
ts_utc = ts.tz_localize('UTC')
ts_utc
2020-04-16 00:00:00+00:00    0.872043
2020-04-17 00:00:00+00:00   -0.305200
2020-04-18 00:00:00+00:00    1.681828
2020-04-19 00:00:00+00:00    0.163184
2020-04-20 00:00:00+00:00    1.415422
Freq: D, dtype: float64

转换到另外一个时区

ts_utc.tz_convert('Asia/Shanghai')
2020-04-16 08:00:00+08:00    0.872043
2020-04-17 08:00:00+08:00   -0.305200
2020-04-18 08:00:00+08:00    1.681828
2020-04-19 08:00:00+08:00    0.163184
2020-04-20 08:00:00+08:00    1.415422
Freq: D, dtype: float64

时间跨度转换

rng = pd.date_range('1/1/2020', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2020-01-31    0.259847
2020-02-29   -0.092969
2020-03-31    0.525985
2020-04-30    0.090362
2020-05-31    1.707569
Freq: M, dtype: float64
ps = ts.to_period()
ps
2020-01    0.259847
2020-02   -0.092969
2020-03    0.525985
2020-04    0.090362
2020-05    1.707569
Freq: M, dtype: float64
ps.to_timestamp()
2020-01-01    0.259847
2020-02-01   -0.092969
2020-03-01    0.525985
2020-04-01    0.090362
2020-05-01    1.707569
Freq: MS, dtype: float64

时期和时间戳之间的转换使得可以使用一些方便的算术函数。

下面这个例子,转换一个每年以11月结束的季度频率为 季度结束后的月末的上午9点。(有点懵

prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
ts.head()
1990-03-01 09:00   -1.287411
1990-06-01 09:00    1.744485
1990-09-01 09:00    0.336561
1990-12-01 09:00   -1.206576
1991-03-01 09:00    0.108631
Freq: H, dtype: float64

0x0C 类别 Categoricals

pandas 可以在 DataFrame 中支持 category 类型的数据。

df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})

将 raw grades 转换为 category 数据类型。

df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

将 category 类型数据重命名为更有意义的名称,利用Series.cat.categories方法。

df["grade"].cat.categories = ["very good", "good", "very bad"]
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (3, object): [very good, good, very bad]

对类别进行重新排序,同时增加缺失的类别。

Series.cat下的方法默认返回一个新的序列。

df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df["grade"]
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

排序是按照 category 的顺序进行的,而不是按照词汇顺序(lexical order)。

df.sort_values(by="grade")

id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good

对一个 Category 类型的列进行分组时显示了空的类别。

df.groupby("grade").size()
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

0x0D 作图 Plotting

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2020', periods=1000))
ts = ts.cumsum()
ts.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x250ccebcb70>

png

对于 DataFrame 而言,plot()方法是一种将所有带有标签的列进行绘制的简便方法。

df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
# print(df)
df = df.cumsum()
plt.figure()
df.plot()
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x250ce7620b8>




<Figure size 432x288 with 0 Axes>

png

ts.index
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2022-09-17', '2022-09-18', '2022-09-19', '2022-09-20',
               '2022-09-21', '2022-09-22', '2022-09-23', '2022-09-24',
               '2022-09-25', '2022-09-26'],
              dtype='datetime64[ns]', length=1000, freq='D')

0x0E 导入导出数据 Getting Data In/Out

0x0E-1 CSV

写入 csv 文件

df.to_csv('foo.csv')  # 保存在当前目录下

从 csv 文件中导入数据

pd.read_csv('foo.csv')

Unnamed: 0 A B C D
0 2020-01-01 -1.397481 -0.453722 1.374079 -2.095767
1 2020-01-02 -1.363418 0.529293 0.178497 -2.364989
2 2020-01-03 -2.047267 -0.333020 0.379978 -2.628204
3 2020-01-04 -2.305870 -0.145598 -0.232491 -2.812290
4 2020-01-05 -1.851390 -1.378215 -1.202399 -4.574516
5 2020-01-06 -3.438928 -0.779975 -1.864979 -5.666847
6 2020-01-07 -4.201270 -0.827366 -0.796513 -5.267628
7 2020-01-08 -5.682762 -1.009404 -0.760801 -4.696243
8 2020-01-09 -4.362126 -0.768552 -0.747517 -5.562345
9 2020-01-10 -5.226357 -1.791920 0.290504 -5.214061
10 2020-01-11 -4.429691 -2.026951 -0.821212 -6.621030
11 2020-01-12 -4.222744 -2.875169 -0.654313 -5.926621
12 2020-01-13 -2.413314 -3.329878 -1.595397 -5.585677
13 2020-01-14 -2.769944 -3.536045 -1.463671 -5.098753
14 2020-01-15 -3.599683 -4.973067 -3.859425 -5.823875
15 2020-01-16 -3.486154 -4.792330 -2.866047 -6.312086
16 2020-01-17 -4.232714 -5.082053 -3.572924 -8.478300
17 2020-01-18 -5.346857 -5.787422 -4.313125 -8.595854
18 2020-01-19 -5.720494 -5.620498 -3.699862 -9.678113
19 2020-01-20 -5.264735 -4.938944 -2.354040 -8.942524
20 2020-01-21 -5.407854 -5.379811 -2.284447 -8.068545
21 2020-01-22 -5.793401 -5.501279 -2.254013 -7.997644
22 2020-01-23 -6.796539 -4.692782 -2.330458 -7.449650
23 2020-01-24 -7.502072 -4.085084 -1.810495 -6.540544
24 2020-01-25 -7.639024 -3.791416 1.214301 -7.599166
25 2020-01-26 -7.783544 -2.180918 1.628560 -7.928116
26 2020-01-27 -7.673699 -3.485888 0.640970 -8.124872
27 2020-01-28 -7.435098 -4.480212 -0.409425 -7.943383
28 2020-01-29 -6.843557 -5.069680 -1.650871 -7.311741
29 2020-01-30 -7.647267 -4.115332 -1.558887 -6.308304
... ... ... ... ... ...
970 2022-08-28 17.157914 -23.306973 -12.461653 -2.478731
971 2022-08-29 18.207251 -23.725280 -12.912440 -2.664828
972 2022-08-30 17.968302 -21.485150 -15.044877 -4.034060
973 2022-08-31 18.933869 -21.587237 -16.046081 -4.014128
974 2022-09-01 18.337284 -21.451739 -15.028398 -4.147177
975 2022-09-02 18.783912 -21.096423 -15.506137 -3.858413
976 2022-09-03 18.828378 -19.514168 -16.924864 -2.559397
977 2022-09-04 18.570511 -21.482747 -17.147155 -4.267082
978 2022-09-05 17.212119 -22.331837 -16.496095 -5.990857
979 2022-09-06 15.726498 -23.219830 -19.321023 -5.281962
980 2022-09-07 16.053999 -22.681168 -19.139536 -5.587482
981 2022-09-08 17.335456 -23.320514 -21.271891 -4.048170
982 2022-09-09 16.925567 -24.175782 -22.318101 -4.890968
983 2022-09-10 18.652048 -25.436089 -22.213175 -6.330027
984 2022-09-11 19.535978 -25.673818 -23.064392 -5.673334
985 2022-09-12 19.632355 -26.369119 -22.953900 -3.463010
986 2022-09-13 18.792587 -25.430236 -23.672206 -2.292810
987 2022-09-14 21.200064 -26.299390 -23.567706 -2.753986
988 2022-09-15 22.890536 -27.813985 -23.938164 -3.195772
989 2022-09-16 23.138447 -25.855344 -24.390078 -3.612660
990 2022-09-17 23.163567 -26.157774 -24.309178 -3.155928
991 2022-09-18 20.973996 -27.319631 -22.948785 -4.487101
992 2022-09-19 21.090756 -26.712847 -21.722990 -4.110810
993 2022-09-20 21.848391 -25.018472 -21.850579 -4.360523
994 2022-09-21 22.898217 -26.005364 -20.914269 -5.357187
995 2022-09-22 21.978310 -25.001244 -20.881626 -5.606849
996 2022-09-23 24.132628 -25.692639 -20.310265 -6.258977
997 2022-09-24 23.777794 -25.485527 -21.305882 -5.330284
998 2022-09-25 25.105611 -25.485135 -22.407657 -6.600228
999 2022-09-26 25.207099 -25.512208 -23.434955 -7.275999

1000 rows × 5 columns

0x0E-2 HDF5

写入 HDF5 存储

df.to_hdf('foo.h5','df')

从 HDF5 存储中读取数据

pd.read_hdf('foo.h5','df')

A B C D
2020-01-01 -1.397481 -0.453722 1.374079 -2.095767
2020-01-02 -1.363418 0.529293 0.178497 -2.364989
2020-01-03 -2.047267 -0.333020 0.379978 -2.628204
2020-01-04 -2.305870 -0.145598 -0.232491 -2.812290
2020-01-05 -1.851390 -1.378215 -1.202399 -4.574516
2020-01-06 -3.438928 -0.779975 -1.864979 -5.666847
2020-01-07 -4.201270 -0.827366 -0.796513 -5.267628
2020-01-08 -5.682762 -1.009404 -0.760801 -4.696243
2020-01-09 -4.362126 -0.768552 -0.747517 -5.562345
2020-01-10 -5.226357 -1.791920 0.290504 -5.214061
2020-01-11 -4.429691 -2.026951 -0.821212 -6.621030
2020-01-12 -4.222744 -2.875169 -0.654313 -5.926621
2020-01-13 -2.413314 -3.329878 -1.595397 -5.585677
2020-01-14 -2.769944 -3.536045 -1.463671 -5.098753
2020-01-15 -3.599683 -4.973067 -3.859425 -5.823875
2020-01-16 -3.486154 -4.792330 -2.866047 -6.312086
2020-01-17 -4.232714 -5.082053 -3.572924 -8.478300
2020-01-18 -5.346857 -5.787422 -4.313125 -8.595854
2020-01-19 -5.720494 -5.620498 -3.699862 -9.678113
2020-01-20 -5.264735 -4.938944 -2.354040 -8.942524
2020-01-21 -5.407854 -5.379811 -2.284447 -8.068545
2020-01-22 -5.793401 -5.501279 -2.254013 -7.997644
2020-01-23 -6.796539 -4.692782 -2.330458 -7.449650
2020-01-24 -7.502072 -4.085084 -1.810495 -6.540544
2020-01-25 -7.639024 -3.791416 1.214301 -7.599166
2020-01-26 -7.783544 -2.180918 1.628560 -7.928116
2020-01-27 -7.673699 -3.485888 0.640970 -8.124872
2020-01-28 -7.435098 -4.480212 -0.409425 -7.943383
2020-01-29 -6.843557 -5.069680 -1.650871 -7.311741
2020-01-30 -7.647267 -4.115332 -1.558887 -6.308304
... ... ... ... ...
2022-08-28 17.157914 -23.306973 -12.461653 -2.478731
2022-08-29 18.207251 -23.725280 -12.912440 -2.664828
2022-08-30 17.968302 -21.485150 -15.044877 -4.034060
2022-08-31 18.933869 -21.587237 -16.046081 -4.014128
2022-09-01 18.337284 -21.451739 -15.028398 -4.147177
2022-09-02 18.783912 -21.096423 -15.506137 -3.858413
2022-09-03 18.828378 -19.514168 -16.924864 -2.559397
2022-09-04 18.570511 -21.482747 -17.147155 -4.267082
2022-09-05 17.212119 -22.331837 -16.496095 -5.990857
2022-09-06 15.726498 -23.219830 -19.321023 -5.281962
2022-09-07 16.053999 -22.681168 -19.139536 -5.587482
2022-09-08 17.335456 -23.320514 -21.271891 -4.048170
2022-09-09 16.925567 -24.175782 -22.318101 -4.890968
2022-09-10 18.652048 -25.436089 -22.213175 -6.330027
2022-09-11 19.535978 -25.673818 -23.064392 -5.673334
2022-09-12 19.632355 -26.369119 -22.953900 -3.463010
2022-09-13 18.792587 -25.430236 -23.672206 -2.292810
2022-09-14 21.200064 -26.299390 -23.567706 -2.753986
2022-09-15 22.890536 -27.813985 -23.938164 -3.195772
2022-09-16 23.138447 -25.855344 -24.390078 -3.612660
2022-09-17 23.163567 -26.157774 -24.309178 -3.155928
2022-09-18 20.973996 -27.319631 -22.948785 -4.487101
2022-09-19 21.090756 -26.712847 -21.722990 -4.110810
2022-09-20 21.848391 -25.018472 -21.850579 -4.360523
2022-09-21 22.898217 -26.005364 -20.914269 -5.357187
2022-09-22 21.978310 -25.001244 -20.881626 -5.606849
2022-09-23 24.132628 -25.692639 -20.310265 -6.258977
2022-09-24 23.777794 -25.485527 -21.305882 -5.330284
2022-09-25 25.105611 -25.485135 -22.407657 -6.600228
2022-09-26 25.207099 -25.512208 -23.434955 -7.275999

1000 rows × 4 columns

0x0E-3 Excel

看网上说,相对于 xlwt、xlrd 之类的库,用 pandas 来对 excel 进行数据处理和分析挺不错的,正好最近也想试试呢。

写入 Excel

df.to_excel('foo.xlsx', sheet_name='Sheet1')

从 excel 中读取数据

pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])

Unnamed: 0 A B C D
0 2020-01-01 -1.397481 -0.453722 1.374079 -2.095767
1 2020-01-02 -1.363418 0.529293 0.178497 -2.364989
2 2020-01-03 -2.047267 -0.333020 0.379978 -2.628204
3 2020-01-04 -2.305870 -0.145598 -0.232491 -2.812290
4 2020-01-05 -1.851390 -1.378215 -1.202399 -4.574516
5 2020-01-06 -3.438928 -0.779975 -1.864979 -5.666847
6 2020-01-07 -4.201270 -0.827366 -0.796513 -5.267628
7 2020-01-08 -5.682762 -1.009404 -0.760801 -4.696243
8 2020-01-09 -4.362126 -0.768552 -0.747517 -5.562345
9 2020-01-10 -5.226357 -1.791920 0.290504 -5.214061
10 2020-01-11 -4.429691 -2.026951 -0.821212 -6.621030
11 2020-01-12 -4.222744 -2.875169 -0.654313 -5.926621
12 2020-01-13 -2.413314 -3.329878 -1.595397 -5.585677
13 2020-01-14 -2.769944 -3.536045 -1.463671 -5.098753
14 2020-01-15 -3.599683 -4.973067 -3.859425 -5.823875
15 2020-01-16 -3.486154 -4.792330 -2.866047 -6.312086
16 2020-01-17 -4.232714 -5.082053 -3.572924 -8.478300
17 2020-01-18 -5.346857 -5.787422 -4.313125 -8.595854
18 2020-01-19 -5.720494 -5.620498 -3.699862 -9.678113
19 2020-01-20 -5.264735 -4.938944 -2.354040 -8.942524
20 2020-01-21 -5.407854 -5.379811 -2.284447 -8.068545
21 2020-01-22 -5.793401 -5.501279 -2.254013 -7.997644
22 2020-01-23 -6.796539 -4.692782 -2.330458 -7.449650
23 2020-01-24 -7.502072 -4.085084 -1.810495 -6.540544
24 2020-01-25 -7.639024 -3.791416 1.214301 -7.599166
25 2020-01-26 -7.783544 -2.180918 1.628560 -7.928116
26 2020-01-27 -7.673699 -3.485888 0.640970 -8.124872
27 2020-01-28 -7.435098 -4.480212 -0.409425 -7.943383
28 2020-01-29 -6.843557 -5.069680 -1.650871 -7.311741
29 2020-01-30 -7.647267 -4.115332 -1.558887 -6.308304
... ... ... ... ... ...
970 2022-08-28 17.157914 -23.306973 -12.461653 -2.478731
971 2022-08-29 18.207251 -23.725280 -12.912440 -2.664828
972 2022-08-30 17.968302 -21.485150 -15.044877 -4.034060
973 2022-08-31 18.933869 -21.587237 -16.046081 -4.014128
974 2022-09-01 18.337284 -21.451739 -15.028398 -4.147177
975 2022-09-02 18.783912 -21.096423 -15.506137 -3.858413
976 2022-09-03 18.828378 -19.514168 -16.924864 -2.559397
977 2022-09-04 18.570511 -21.482747 -17.147155 -4.267082
978 2022-09-05 17.212119 -22.331837 -16.496095 -5.990857
979 2022-09-06 15.726498 -23.219830 -19.321023 -5.281962
980 2022-09-07 16.053999 -22.681168 -19.139536 -5.587482
981 2022-09-08 17.335456 -23.320514 -21.271891 -4.048170
982 2022-09-09 16.925567 -24.175782 -22.318101 -4.890968
983 2022-09-10 18.652048 -25.436089 -22.213175 -6.330027
984 2022-09-11 19.535978 -25.673818 -23.064392 -5.673334
985 2022-09-12 19.632355 -26.369119 -22.953900 -3.463010
986 2022-09-13 18.792587 -25.430236 -23.672206 -2.292810
987 2022-09-14 21.200064 -26.299390 -23.567706 -2.753986
988 2022-09-15 22.890536 -27.813985 -23.938164 -3.195772
989 2022-09-16 23.138447 -25.855344 -24.390078 -3.612660
990 2022-09-17 23.163567 -26.157774 -24.309178 -3.155928
991 2022-09-18 20.973996 -27.319631 -22.948785 -4.487101
992 2022-09-19 21.090756 -26.712847 -21.722990 -4.110810
993 2022-09-20 21.848391 -25.018472 -21.850579 -4.360523
994 2022-09-21 22.898217 -26.005364 -20.914269 -5.357187
995 2022-09-22 21.978310 -25.001244 -20.881626 -5.606849
996 2022-09-23 24.132628 -25.692639 -20.310265 -6.258977
997 2022-09-24 23.777794 -25.485527 -21.305882 -5.330284
998 2022-09-25 25.105611 -25.485135 -22.407657 -6.600228
999 2022-09-26 25.207099 -25.512208 -23.434955 -7.275999

1000 rows × 5 columns

0x0F 陷阱 Gotchas

如果你在运行中看到例如下面的异常。

if pd.Series([False, True, False]):
    print("I was true")
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-304-5c782b38cd2f> in <module>
----> 1 if pd.Series([False, True, False]):
      2     print("I was true")


D:\Programs\Anaconda\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
   1476         raise ValueError("The truth value of a {0} is ambiguous. "
   1477                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1478                          .format(self.__class__.__name__))
   1479 
   1480     __bool__ = __nonzero__


ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

可以参考 Comparisons 这里来寻求解释和下一步的方案。

也可以参考 Comparisons

(上面都是官方的文档地址)

(官方文档到这里就结束了)



0x10 小结

关于选取:

  • 选择整行/整列,或多个整行/多个整列的数据,可以用df[]df.loc[]df.iloc[]这三种用法。
  • 区域选取,分为标签索引整数索引
    • 标签索引:df.loc[]
    • 整数索引:df.iloc[]
  • 选取单元格df.at[]df.iat[]df.loc[]df.iloc[] 都可以,要注意参数。df[]不行!
  • 关于选取的返回值:
    • 返回值包括单行多列多行单列时,返回值为 Series 对象
    • 返回值包括多行多列时,返回值为 DataFrame 对象
    • 返回值仅为一个单元格(单行单列)时,返回值为(可能是 numpy 的)基本数据类型,例如 str, float64, int64 等
    • df[]因为不能精确到单元格,所以返回值一定 DataFrame 或 Series 对象
  • 当使用 DataFrame 的默认索引(整数索引)时,整数索引即为标签索引。

其实 pandas 还有很多有用的 API 这里面还没介绍,后面有空再慢慢了解呢。

做项目的时候看看别人做过的实例,再深入看看相关的 API 好了。

大概就这些内容了吧。

(溜了溜了

20200418 更新:

Jupyter Notebook 文档 (.ipynb) 放到 GitHub 上了,详见这里:https://github.com/miaotony/DataScience/blob/master/10_Minutes_to_pandas.ipynb

下载文档的话 戳这里(CDN,推荐) 呢!或者 也可以戳这里(GitHub raw)


0xFF References & Extensive Reading

官方的 10 Minutes to pandas

[python数据分析之pandas数据选取:df] df.loc[] df.iloc[] df.ix[] df.at[] df.iat[]

学习python中的pandas有没有好的教程推荐?

从Excel到Python:最常用的36个Pandas函数!

etc.


文章作者: MiaoTony
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来源 MiaoTony !
评论
 上一篇
CTF | 2020 Ha1cyon_CTF公开赛 WriteUp CTF | 2020 Ha1cyon_CTF公开赛 WriteUp
周末看了一下Ha1cyon_CTF“萌新”公开赛的题目,发现是一个难到爆炸的蹲WP劝退比赛。就写写部分WP,再来点小体会吧。
2020-04-21
下一篇 
基于acme.sh从Let's encrypt生成免费且自动更新的SSL证书 基于acme.sh从Let's encrypt生成免费且自动更新的SSL证书
通过Let's encrypt可以获得90天免费且可续期的SSL证书,而利用acme.sh可以自动生成和更新,这篇文章就来介绍一下配置的过程吧。
2020-03-28
  目录