Python基础(十一) | 超详细的Pandas库三万字总结(四)

举报
timerring 发表于 2022/10/07 09:37:36 2022/10/07
【摘要】 11.6 分组和数据透视表df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"], "data1": range(6), "data2": np.random.randint(0, 10, size=6)})df<style scoped> .dataframe...

11.6 分组和数据透视表

image-20221002211059551

df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
                  "data1": range(6),
                  "data2": np.random.randint(0, 10, size=6)})
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key data1 data2
0 A 0 1
1 B 1 4
2 C 2 9
3 C 3 9
4 B 4 1
5 A 5 9

(1)分组

  • 延迟计算
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>

这说明已经分好了,等待我们用什么样的方法进行处理后,再显示。

df.groupby("key").sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
key
A 5 10
B 5 6
C 5 11
df.groupby("key").mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
key
A 2.5 5.0
B 2.5 3.0
C 2.5 5.5

可以打印看看这是什么东西:

for i in df.groupby("key"):
    print(str(i))
('A',   key  data1  data2
0   A      0      2
5   A      5      8)
('B',   key  data1  data2
1   B      1      2
4   B      4      4)
('C',   key  data1  data2
2   C      2      8
3   C      3      3)
  • 按列取值
df.groupby("key")["data2"].sum()
key
A    10
B     6
C    11
Name: data2, dtype: int32
  • 按组迭代
for data, group in df.groupby("key"):
    print("{0:5} shape={1}".format(data, group.shape))
A     shape=(2, 3)
B     shape=(2, 3)
C     shape=(2, 3)
  • 调用方法
df.groupby("key")["data1"].describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
count mean std min 25% 50% 75% max
key
A 2.0 2.5 3.535534 0.0 1.25 2.5 3.75 5.0
B 2.0 2.5 2.121320 1.0 1.75 2.5 3.25 4.0
C 2.0 2.5 0.707107 2.0 2.25 2.5 2.75 3.0
  • 支持更复杂的操作
df.groupby("key").aggregate(["min", "median", "max"])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead tr th {
    text-align: left;
}

.dataframe thead tr:last-of-type th {
    text-align: right;
}
</style>
data1 data2
min median max min median max
key
A 0 2.5 5 2 5.0 8
B 1 2.5 4 2 3.0 4
C 2 2.5 3 3 5.5 8
  • 过滤
def filter_func(x):
    return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A    4.242641
B    1.414214
C    3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key data1 data2
0 A 0 2
2 C 2 8
3 C 3 3
5 A 5 8
  • 转换
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key data1 data2
0 A 0 2
1 B 1 2
2 C 2 8
3 C 3 3
4 B 4 4
5 A 5 8
df.groupby("key").transform(lambda x: x-x.mean())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
0 -2.5 -3.0
1 -1.5 -1.0
2 -0.5 2.5
3 0.5 -2.5
4 1.5 1.0
5 2.5 3.0
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key data1 data2
0 A 0 1
1 B 1 4
2 C 2 9
3 C 3 9
4 B 4 1
5 A 5 9
df.groupby("key").apply(lambda x: x-x.mean())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
0 -2.5 -4.0
1 -1.5 1.5
2 -0.5 0.0
3 0.5 0.0
4 1.5 -1.5
5 2.5 4.0
  • apply()方法
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key data1 data2
0 A 0 2
1 B 1 2
2 C 2 8
3 C 3 3
4 B 4 4
5 A 5 8
def norm_by_data2(x):
    x["data1"] /= x["data2"].sum()
    return x
df.groupby("key").apply(norm_by_data2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key data1 data2
0 A 0.000000 2
1 B 0.166667 2
2 C 0.181818 8
3 C 0.272727 3
4 B 0.666667 4
5 A 0.500000 8
  • 将列表、数组设为分组键

这里的L相当于一个新的标签替代原来的行标签。

L = [0, 1, 0, 1, 2, 0]
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
key data1 data2
0 A 0 2
1 B 1 2
2 C 2 8
3 C 3 3
4 B 4 4
5 A 5 8
df.groupby(L).sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
0 7 18
1 4 5
2 4 4
  • 用字典将索引映射到分组
df2 = df.set_index("key")
df2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
key
A 0 2
B 1 2
C 2 8
C 3 3
B 4 4
A 5 8
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
constant 10 17
first 5 10
  • 任意Python函数
df2.groupby(str.lower).mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
a 2.5 5.0
b 2.5 3.0
c 2.5 5.5
  • 多个有效值组成的列表

只有这两个数都相等,才会分到同一个组。

df2.groupby([str.lower, mapping]).mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
data1 data2
a first 2.5 5.0
b constant 2.5 3.0
c constant 2.5 5.5

【例1】 行星观测数据处理

import seaborn as sns

planets = sns.load_dataset("planets")
planets.shape
(1035, 6)
planets.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
planets.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
number orbital_period mass distance year
count 1035.000000 992.000000 513.000000 808.000000 1035.000000
mean 1.785507 2002.917596 2.638161 264.069282 2009.070531
std 1.240976 26014.728304 3.818617 733.116493 3.972567
min 1.000000 0.090706 0.003600 1.350000 1989.000000
25% 1.000000 5.442540 0.229000 32.560000 2007.000000
50% 1.000000 39.979500 1.260000 55.250000 2010.000000
75% 2.000000 526.005000 3.040000 178.500000 2012.000000
max 7.000000 730000.000000 25.000000 8500.000000 2014.000000
planets.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
decade = 10 * (planets["year"] // 10)
decade.head()
0    2000
1    2000
2    2010
3    2000
4    2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0    2000s
1    2000s
2    2010s
3    2000s
4    2000s
Name: decade, dtype: object
planets.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
planets.groupby(["method", decade]).sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
number orbital_period mass distance year
method decade
Astrometry 2010s 2 1.262360e+03 0.00000 35.75 4023
Eclipse Timing Variations 2000s 5 1.930800e+04 6.05000 261.44 6025
2010s 10 2.345680e+04 4.20000 1000.00 12065
Imaging 2000s 29 1.350935e+06 0.00000 956.83 40139
2010s 21 6.803750e+04 0.00000 1210.08 36208
Microlensing 2000s 12 1.732500e+04 0.00000 0.00 20070
2010s 15 4.750000e+03 0.00000 41440.00 26155
Orbital Brightness Modulation 2010s 5 2.127920e+00 0.00000 2360.00 6035
Pulsar Timing 1990s 9 1.900153e+02 0.00000 0.00 5978
2000s 1 3.652500e+04 0.00000 0.00 2003
2010s 1 9.070629e-02 0.00000 1200.00 2011
Pulsation Timing Variations 2000s 1 1.170000e+03 0.00000 0.00 2007
Radial Velocity 1980s 1 8.388800e+01 11.68000 40.57 1989
1990s 52 1.091561e+04 68.17820 723.71 55943
2000s 475 2.633526e+05 945.31928 15201.16 619775
2010s 424 1.809630e+05 316.47890 11382.67 432451
Transit 2000s 64 2.897102e+02 0.00000 31823.31 124462
2010s 712 8.087813e+03 1.47000 102419.46 673999
Transit Timing Variations 2010s 9 2.393505e+02 0.00000 3313.00 8050

这里使用两个中括号[[]],取出来是DF类型的数据,而一个中括号[]取出来是Serios的数据,前者更美观一点。

planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。