Python基础(十一) | 超详细的Pandas库三万字总结(四)
【摘要】 11.6 分组和数据透视表df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"], "data1": range(6), "data2": np.random.randint(0, 10, size=6)})df<style scoped> .dataframe...
11.6 分组和数据透视表
df = pd.DataFrame({"key":["A", "B", "C", "C", "B", "A"],
"data1": range(6),
"data2": np.random.randint(0, 10, size=6)})
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 1 |
1 | B | 1 | 4 |
2 | C | 2 | 9 |
3 | C | 3 | 9 |
4 | B | 4 | 1 |
5 | A | 5 | 9 |
(1)分组
- 延迟计算
df.groupby("key")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002276795A240>
这说明已经分好了,等待我们用什么样的方法进行处理后,再显示。
df.groupby("key").sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1 | data2 | |
---|---|---|
key | ||
A | 5 | 10 |
B | 5 | 6 |
C | 5 | 11 |
df.groupby("key").mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1 | data2 | |
---|---|---|
key | ||
A | 2.5 | 5.0 |
B | 2.5 | 3.0 |
C | 2.5 | 5.5 |
可以打印看看这是什么东西:
for i in df.groupby("key"):
print(str(i))
('A', key data1 data2
0 A 0 2
5 A 5 8)
('B', key data1 data2
1 B 1 2
4 B 4 4)
('C', key data1 data2
2 C 2 8
3 C 3 3)
- 按列取值
df.groupby("key")["data2"].sum()
key
A 10
B 6
C 11
Name: data2, dtype: int32
- 按组迭代
for data, group in df.groupby("key"):
print("{0:5} shape={1}".format(data, group.shape))
A shape=(2, 3)
B shape=(2, 3)
C shape=(2, 3)
- 调用方法
df.groupby("key")["data1"].describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
key | ||||||||
A | 2.0 | 2.5 | 3.535534 | 0.0 | 1.25 | 2.5 | 3.75 | 5.0 |
B | 2.0 | 2.5 | 2.121320 | 1.0 | 1.75 | 2.5 | 3.25 | 4.0 |
C | 2.0 | 2.5 | 0.707107 | 2.0 | 2.25 | 2.5 | 2.75 | 3.0 |
- 支持更复杂的操作
df.groupby("key").aggregate(["min", "median", "max"])
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
data1 | data2 | |||||
---|---|---|---|---|---|---|
min | median | max | min | median | max | |
key | ||||||
A | 0 | 2.5 | 5 | 2 | 5.0 | 8 |
B | 1 | 2.5 | 4 | 2 | 3.0 | 4 |
C | 2 | 2.5 | 3 | 3 | 5.5 | 8 |
- 过滤
def filter_func(x):
return x["data2"].std() > 3
df.groupby("key")["data2"].std()
key
A 4.242641
B 1.414214
C 3.535534
Name: data2, dtype: float64
df.groupby("key").filter(filter_func)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 2 |
2 | C | 2 | 8 |
3 | C | 3 | 3 |
5 | A | 5 | 8 |
- 转换
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 2 |
1 | B | 1 | 2 |
2 | C | 2 | 8 |
3 | C | 3 | 3 |
4 | B | 4 | 4 |
5 | A | 5 | 8 |
df.groupby("key").transform(lambda x: x-x.mean())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1 | data2 | |
---|---|---|
0 | -2.5 | -3.0 |
1 | -1.5 | -1.0 |
2 | -0.5 | 2.5 |
3 | 0.5 | -2.5 |
4 | 1.5 | 1.0 |
5 | 2.5 | 3.0 |
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 1 |
1 | B | 1 | 4 |
2 | C | 2 | 9 |
3 | C | 3 | 9 |
4 | B | 4 | 1 |
5 | A | 5 | 9 |
df.groupby("key").apply(lambda x: x-x.mean())
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1 | data2 | |
---|---|---|
0 | -2.5 | -4.0 |
1 | -1.5 | 1.5 |
2 | -0.5 | 0.0 |
3 | 0.5 | 0.0 |
4 | 1.5 | -1.5 |
5 | 2.5 | 4.0 |
- apply()方法
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 2 |
1 | B | 1 | 2 |
2 | C | 2 | 8 |
3 | C | 3 | 3 |
4 | B | 4 | 4 |
5 | A | 5 | 8 |
def norm_by_data2(x):
x["data1"] /= x["data2"].sum()
return x
df.groupby("key").apply(norm_by_data2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key | data1 | data2 | |
---|---|---|---|
0 | A | 0.000000 | 2 |
1 | B | 0.166667 | 2 |
2 | C | 0.181818 | 8 |
3 | C | 0.272727 | 3 |
4 | B | 0.666667 | 4 |
5 | A | 0.500000 | 8 |
- 将列表、数组设为分组键
这里的L相当于一个新的标签替代原来的行标签。
L = [0, 1, 0, 1, 2, 0]
df
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 2 |
1 | B | 1 | 2 |
2 | C | 2 | 8 |
3 | C | 3 | 3 |
4 | B | 4 | 4 |
5 | A | 5 | 8 |
df.groupby(L).sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1 | data2 | |
---|---|---|
0 | 7 | 18 |
1 | 4 | 5 |
2 | 4 | 4 |
- 用字典将索引映射到分组
df2 = df.set_index("key")
df2
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1 | data2 | |
---|---|---|
key | ||
A | 0 | 2 |
B | 1 | 2 |
C | 2 | 8 |
C | 3 | 3 |
B | 4 | 4 |
A | 5 | 8 |
mapping = {"A": "first", "B": "constant", "C": "constant"}
df2.groupby(mapping).sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1 | data2 | |
---|---|---|
constant | 10 | 17 |
first | 5 | 10 |
- 任意Python函数
df2.groupby(str.lower).mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1 | data2 | |
---|---|---|
a | 2.5 | 5.0 |
b | 2.5 | 3.0 |
c | 2.5 | 5.5 |
- 多个有效值组成的列表
只有这两个数都相等,才会分到同一个组。
df2.groupby([str.lower, mapping]).mean()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
data1 | data2 | ||
---|---|---|---|
a | first | 2.5 | 5.0 |
b | constant | 2.5 | 3.0 |
c | constant | 2.5 | 5.5 |
【例1】 行星观测数据处理
import seaborn as sns
planets = sns.load_dataset("planets")
planets.shape
(1035, 6)
planets.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
planets.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|
count | 1035.000000 | 992.000000 | 513.000000 | 808.000000 | 1035.000000 |
mean | 1.785507 | 2002.917596 | 2.638161 | 264.069282 | 2009.070531 |
std | 1.240976 | 26014.728304 | 3.818617 | 733.116493 | 3.972567 |
min | 1.000000 | 0.090706 | 0.003600 | 1.350000 | 1989.000000 |
25% | 1.000000 | 5.442540 | 0.229000 | 32.560000 | 2007.000000 |
50% | 1.000000 | 39.979500 | 1.260000 | 55.250000 | 2010.000000 |
75% | 2.000000 | 526.005000 | 3.040000 | 178.500000 | 2012.000000 |
max | 7.000000 | 730000.000000 | 25.000000 | 8500.000000 | 2014.000000 |
planets.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
decade = 10 * (planets["year"] // 10)
decade.head()
0 2000
1 2000
2 2010
3 2000
4 2000
Name: year, dtype: int64
decade = decade.astype(str) + "s"
decade.name = "decade"
decade.head()
0 2000s
1 2000s
2 2010s
3 2000s
4 2000s
Name: decade, dtype: object
planets.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
planets.groupby(["method", decade]).sum()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
number | orbital_period | mass | distance | year | ||
---|---|---|---|---|---|---|
method | decade | |||||
Astrometry | 2010s | 2 | 1.262360e+03 | 0.00000 | 35.75 | 4023 |
Eclipse Timing Variations | 2000s | 5 | 1.930800e+04 | 6.05000 | 261.44 | 6025 |
2010s | 10 | 2.345680e+04 | 4.20000 | 1000.00 | 12065 | |
Imaging | 2000s | 29 | 1.350935e+06 | 0.00000 | 956.83 | 40139 |
2010s | 21 | 6.803750e+04 | 0.00000 | 1210.08 | 36208 | |
Microlensing | 2000s | 12 | 1.732500e+04 | 0.00000 | 0.00 | 20070 |
2010s | 15 | 4.750000e+03 | 0.00000 | 41440.00 | 26155 | |
Orbital Brightness Modulation | 2010s | 5 | 2.127920e+00 | 0.00000 | 2360.00 | 6035 |
Pulsar Timing | 1990s | 9 | 1.900153e+02 | 0.00000 | 0.00 | 5978 |
2000s | 1 | 3.652500e+04 | 0.00000 | 0.00 | 2003 | |
2010s | 1 | 9.070629e-02 | 0.00000 | 1200.00 | 2011 | |
Pulsation Timing Variations | 2000s | 1 | 1.170000e+03 | 0.00000 | 0.00 | 2007 |
Radial Velocity | 1980s | 1 | 8.388800e+01 | 11.68000 | 40.57 | 1989 |
1990s | 52 | 1.091561e+04 | 68.17820 | 723.71 | 55943 | |
2000s | 475 | 2.633526e+05 | 945.31928 | 15201.16 | 619775 | |
2010s | 424 | 1.809630e+05 | 316.47890 | 11382.67 | 432451 | |
Transit | 2000s | 64 | 2.897102e+02 | 0.00000 | 31823.31 | 124462 |
2010s | 712 | 8.087813e+03 | 1.47000 | 102419.46 | 673999 | |
Transit Timing Variations | 2010s | 9 | 2.393505e+02 | 0.00000 | 3313.00 | 8050 |
这里使用两个中括号[[]],取出来是DF类型的数据,而一个中括号[]取出来是Serios的数据,前者更美观一点。
planets.groupby(["method", decade])[["number"]].sum().unstack().fillna(0)
【版权声明】本文为华为云社区用户原创内容,未经允许不得转载,如需转载请自行联系原作者进行授权。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)