#### [SOLVED] Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

By Roman

I have a data frame `df` and I use several columns from it to `groupby`:

``````df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
``````

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

In short: How do I get group-wise statistics for a dataframe?

#### 5 comments # One Function to Rule Them All: `GroupBy.describe`

Returns `count`, `mean`, `std`, and other useful statistics per-group.

``````df.groupby(['col1', 'col2'])['col3', 'col4'].describe()
``````

``````# Setup
np.random.seed(0)
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
``````

``````from IPython.display import display

with pd.option_context('precision', 2):
display(df.groupby(['A', 'B'])['C'].describe())

count  mean   std   min   25%   50%   75%   max
A   B
bar one      1.0  0.40   NaN  0.40  0.40  0.40  0.40  0.40
three    1.0  2.24   NaN  2.24  2.24  2.24  2.24  2.24
two      1.0 -0.98   NaN -0.98 -0.98 -0.98 -0.98 -0.98
foo one      2.0  1.36  0.58  0.95  1.15  1.36  1.56  1.76
three    1.0 -0.15   NaN -0.15 -0.15 -0.15 -0.15 -0.15
two      2.0  1.42  0.63  0.98  1.20  1.42  1.65  1.87
``````

To get specific statistics, just select them,

``````df.groupby(['A', 'B'])['C'].describe()[['count', 'mean']]

count      mean
A   B
bar one      1.0  0.400157
three    1.0  2.240893
two      1.0 -0.977278
foo one      2.0  1.357070
three    1.0 -0.151357
two      2.0  1.423148
``````

`describe` works for multiple columns (change `['C']` to `['C', 'D']`—or remove it altogether—and see what happens, the result is a MultiIndexed columned dataframe).

You also get different statistics for string data. Here's an example,

``````df2 = df.assign(D=list('aaabbccc')).sample(n=100, replace=True)
``````

``````with pd.option_context('precision', 2):
display(df2.groupby(['A', 'B'])
.describe(include='all')
.dropna(how='all', axis=1))

C                                                   D
count  mean       std   min   25%   50%   75%   max count unique top freq
A   B
bar one    14.0  0.40  5.76e-17  0.40  0.40  0.40  0.40  0.40    14      1   a   14
three  14.0  2.24  4.61e-16  2.24  2.24  2.24  2.24  2.24    14      1   b   14
two     9.0 -0.98  0.00e+00 -0.98 -0.98 -0.98 -0.98 -0.98     9      1   c    9
foo one    22.0  1.43  4.10e-01  0.95  0.95  1.76  1.76  1.76    22      2   a   13
three  15.0 -0.15  0.00e+00 -0.15 -0.15 -0.15 -0.15 -0.15    15      1   c   15
two    26.0  1.49  4.48e-01  0.98  0.98  1.87  1.87  1.87    26      2   b   15
``````

For more information, see the documentation. #### @Mahendra 2019-04-11 14:05:06

Create a group object and call methods like below example:

``````grp = df.groupby(['col1',  'col2',  'col3'])

grp.max()
grp.mean()
grp.describe()
`````` ## Quick Answer:

The simplest way to get row counts per group is by calling `.size()`, which returns a `Series`:

``````df.groupby(['col1','col2']).size()
``````

Usually you want this result as a `DataFrame` (instead of a `Series`) so you can do:

``````df.groupby(['col1', 'col2']).size().reset_index(name='counts')
``````

If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

## Detailed example:

Consider the following example dataframe:

``````In : df
Out:
col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17
``````

First let's use `.size()` to get the row counts:

``````In : df.groupby(['col1', 'col2']).size()
Out:
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64
``````

Then let's use `.size().reset_index(name='counts')` to get the row counts:

``````In : df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out:
col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1
``````

### Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

``````In : (df
...: .groupby(['col1', 'col2'])
...: .agg({
...:     'col3': ['mean', 'count'],
...:     'col4': ['median', 'min', 'count']
...: }))
Out:
col4                  col3
median   min count      mean count
col1 col2
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1
``````

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using `join`. It looks like this:

``````In : gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...:  .reset_index()
...: )
...:
Out:
col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63
``````

### Footnotes

The code used to generate the test data is shown below:

``````In : import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...:         ['A', 'B'],
...:         ['A', 'B'],
...:         ['A', 'B'],
...:         ['A', 'B'],
...:         ['C', 'D'],
...:         ['C', 'D'],
...:         ['C', 'D'],
...:         ['E', 'F'],
...:         ['E', 'F'],
...:         ['G', 'H']
...:         ])
...:
...: df = pd.DataFrame(
...:     np.hstack([keys,np.random.randn(10,4).round(2)]),
...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
``````

Disclaimer:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop `NaN` entries in the mean calculation without telling you about it. #### @Quickbeam2k1 2016-08-17 11:26:44

Hey, I really like your solution, particularly the last, where you use method chaining. However, since it is often necessary, to apply different aggregation functions to different columns, one could also concat the resulting data frames using pd.concat. This maybe easier to read than subsqeuent chaining #### @LancelotHolmes 2017-02-28 02:35:19

nice solution,but for `In : counts_df = pd.DataFrame(df.groupby('col1').size().rename('counts'))` , maybe it's better to set the size() as a new column if you'd like to manipulate the dataframe for further analysis,which should be `counts_df = pd.DataFrame(df.groupby('col1').size().reset_index(name='cou‌​nts')` #### @Nickolay 2018-05-28 08:17:22

Thanks for the "Including results for more statistics" bit! Since my next search was about flattening the resulting multiindex on columns, I'll link to the answer here: stackoverflow.com/a/50558529/1026 #### @Peter.k 2019-01-18 10:31:03

Great! Could you please give me a hint how to add `isnull` to this query to have it in one column as well? `'col4': ['median', 'min', 'count', 'isnull']` #### @Nimesh 2017-11-27 09:17:56

We can easily do it by using groupby and count. But, we should remember to use reset_index().

``````df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
`````` #### @Adrien Pacifico 2018-07-09 00:59:40

This solution works as long as there is no null value in the columns, otherwise it can be misleading (count will be lower than the actual number of observation by group). #### @Boud 2013-10-15 15:49:28

On `groupby` object, the `agg` function can take a list to apply several aggregation methods at once. This should give you the result you need:

``````df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
`````` #### @rysqui 2014-12-17 06:14:56

I think you need the column reference to be a list. Do you perhaps mean: `df[['col1','col2','col3','col4']].groupby(['col1','col2']).a‌​gg(['mean', 'count'])` #### @Jaan 2015-07-22 06:58:09

This creates four count columns, but how to get only one? (The question asks for "an additional column" and that's what I would like too.) #### @Pedro M Duarte 2015-09-26 19:43:37

Please see my answer if you want to get only one `count` column per group. #### @Abhishek Bhatia 2017-10-02 21:28:32

What if I have a separate called Counts and instead of count the rows of the grouped type, I need to add along the column Counts.

### [SOLVED] Group by a dataframe in Pandas with common values across columns

• 2016-06-16 01:20:26
• morfara
• 373 View
• 1 Score
• 1 Answer
• Tags:   python pandas

### [SOLVED] Hierarchical grouping in pandas

• 2016-03-01 21:43:52
• Emre
• 829 View
• 0 Score
• 2 Answer
• Tags:   pandas group-by