如何在Pandas中结合Groupby和多个聚合函数

在 Pandas 中，我们可以使用 groupby 函数对 DataFrame（数据集）中的数据进行分组统计，并提供聚合函数进行计算。通常，我们使用一个或少数几个函数对数据进行统计，但当我们需要进行多个统计时，就需要结合多个聚合函数进行计算。

以下是在 Pandas 中结合 groupby 和多个聚合函数进行计算的攻略：

1. 创建一个DataFrame并导入需要的库

首先，我们需要创建一个 DataFrame 并导入 Pandas 库:

import pandas as pd
import numpy as np

df = pd.DataFrame({
                    'A' : ['foo', 'bar', 'foo', 'bar',
                            'foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three',
                            'two', 'two', 'one', 'three'],
                    'C' : [1.0, 2.0, 3.0, 4.0,
                            5.0, 6.0, 7.0, 8.0],
                    'D' : [10, 20, 30, 40, 50, 60, 70, 80]
                  })
df

2. 用一个分组键和多个分组汇总函数分组

假设我们要按 A 列进行分组，然后使用多个聚合函数提取汇总数据（例如求平均值、最小值、最大值和总和），我们可以使用以下代码：

grouped = df.groupby('A').agg({'C': np.mean, 'D': [np.min, np.max, np.sum]})
grouped

这段代码的目的是按 A 列进行分组，然后提取 C 列的平均值，以及 D 列的最小值、最大值和总和，最终使用 agg 聚合函数进行计算。此外，我们使用了 numpy 库函数求每个函数的值。

3. 对数据进行重命名并删除多层索引

最后，我们可以重命名表头并删除多层索引。

grouped.columns = ['_'.join(col).strip() for col in grouped.columns.values]
grouped.reset_index(inplace=True)
grouped

这段代码的目的是将多层表头进行合并并重命名（例如 C_mean，D_minimum，D_maximum 和 D_sum），然后使用 reset_index() 函数删除多层索引。

完整代码如下：

import pandas as pd
import numpy as np

df = pd.DataFrame({
                    'A' : ['foo', 'bar', 'foo', 'bar',
                            'foo', 'bar', 'foo', 'foo'],
                    'B' : ['one', 'one', 'two', 'three',
                            'two', 'two', 'one', 'three'],
                    'C' : [1.0, 2.0, 3.0, 4.0,
                            5.0, 6.0, 7.0, 8.0],
                    'D' : [10, 20, 30, 40, 50, 60, 70, 80]
                  })

grouped = df.groupby('A').agg({'C': np.mean, 'D': [np.min, np.max, np.sum]})
grouped.columns = ['_'.join(col).strip() for col in grouped.columns.values]
grouped.reset_index(inplace=True)
grouped

运行结果如下：

     A  C_mean  D_amin  D_amax  D_sum
0  bar     4.0      20      60    120
1  foo     4.8      10      80    315

这说明按照 A 列进行分组，并计算 C 列的平均值，以及 D 列的最小值、最大值和总和的结果。

1. 创建一个DataFrame并导入需要的库

2. 用一个分组键和多个分组汇总函数分组

3. 对数据进行重命名并删除多层索引

你可能也喜欢

Python Pandas pandas.read_sql函数实例用法

Pandas中resample方法详解

Pandas GroupBy对象 索引与迭代方法

Pandas GroupBy对象索引与迭代方法