By David D


2014-02-17 11:54:08 8 Comments

I can't figure out the difference between Pandas .aggregate and .apply functions.
Take the following as an example: I load a dataset, do a groupby, define a simple function, and either user .agg or .apply.

As you may see, the printing statement within my function results in the same output after using .agg and .apply. The result, on the other hand is different. Why is that?

import pandas
import pandas as pd
iris = pd.read_csv('iris.csv')
by_species = iris.groupby('Species')
def f(x):
    ...:     print type(x)
    ...:     print x.head(3)
    ...:     return 1

Using apply:

by_species.apply(f)
#<class 'pandas.core.frame.DataFrame'>
#   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
#0           5.1          3.5           1.4          0.2  setosa
#1           4.9          3.0           1.4          0.2  setosa
#2           4.7          3.2           1.3          0.2  setosa
#<class 'pandas.core.frame.DataFrame'>
#   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
#0           5.1          3.5           1.4          0.2  setosa
#1           4.9          3.0           1.4          0.2  setosa
#2           4.7          3.2           1.3          0.2  setosa
#<class 'pandas.core.frame.DataFrame'>
#    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width     Species
#50           7.0          3.2           4.7          1.4  versicolor
#51           6.4          3.2           4.5          1.5  versicolor
#52           6.9          3.1           4.9          1.5  versicolor
#<class 'pandas.core.frame.DataFrame'>
#     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
#100           6.3          3.3           6.0          2.5  virginica
#101           5.8          2.7           5.1          1.9  virginica
#102           7.1          3.0           5.9          2.1  virginica
#Out[33]: 
#Species
#setosa        1
#versicolor    1
#virginica     1
#dtype: int64

Using agg

by_species.agg(f)
#<class 'pandas.core.frame.DataFrame'>
#   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
#0           5.1          3.5           1.4          0.2  setosa
#1           4.9          3.0           1.4          0.2  setosa
#2           4.7          3.2           1.3          0.2  setosa
#<class 'pandas.core.frame.DataFrame'>
#    Sepal.Length  Sepal.Width  Petal.Length  Petal.Width     Species
#50           7.0          3.2           4.7          1.4  versicolor
#51           6.4          3.2           4.5          1.5  versicolor
#52           6.9          3.1           4.9          1.5  versicolor
#<class 'pandas.core.frame.DataFrame'>
#     Sepal.Length  Sepal.Width  Petal.Length  Petal.Width    Species
#100           6.3          3.3           6.0          2.5  virginica
#101           5.8          2.7           5.1          1.9  virginica
#102           7.1          3.0           5.9          2.1  virginica
#Out[34]: 
#           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
#Species                                                         
#setosa                 1            1             1            1
#versicolor             1            1             1            1
#virginica              1            1             1            1

4 comments

@Kunal 2019-10-26 10:32:15

The main difference between apply and aggregate is:

apply()- 
    cannot be applied to multiple groups together 
    For apply() - We have to get_group()
    ERROR : -iris.groupby('Species').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
    Work Fine:-iris.groupby('Species').get_group('Setosa').apply({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})# It will throw error
        #because functions are applied to one data frame

agg()- 
    can be applied to multiple groups together
    For apply() - We do not have to get_group() 
    iris.groupby('Species').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})
    iris.groupby('Species').get_group('versicolor').agg({'Sepal.Length':['min','max'],'Sepal.Width':['mean','min']})        

@Martin Alexandersson 2018-08-17 15:59:33

When using apply to a groupby I have encountered that .apply will return the grouped columns. There is a note in the documentation (pandas.pydata.org/pandas-docs/stable/groupby.html):

"...Thus the grouped columns(s) may be included in the output as well as set the indices."

.aggregate will not return the grouped columns.

@Surya 2016-05-03 05:18:57

(Note: These comparisons are relevant for DataframeGroupby objects)

Some plausible advantages of using .agg() compared to .apply(), for DataFrame GroupBy objects would be:

  1. .agg() gives the flexibility of applying multiple functions at once, or pass a list of function to each column.

  2. Also, applying different functions at once to different columns of dataframe.

That means you have pretty much control over each column with each operation.

Here is the link for more details: http://pandas.pydata.org/pandas-docs/version/0.13.1/groupby.html


However, the apply function could be limited to apply one function to each column of the dataframe at a time. So, you might have to call the apply function repeatedly to call upon different operations to the same column.

Here are some example comparisons for .apply() vs .agg() for DataframeGroupBy objects :

Given the following dataframe:

In [261]: df = pd.DataFrame({"name":["Foo", "Baar", "Foo", "Baar"], "score_1":[5,10,15,10], "score_2" :[10,15,10,25], "score_3" : [10,20,30,40]})

In [262]: df
Out[262]: 
   name  score_1  score_2  score_3
0   Foo        5       10       10
1  Baar       10       15       20
2   Foo       15       10       30
3  Baar       10       25       40

Lets first see the operations using .apply():

In [263]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.sum())
Out[263]: 
name  score_1
Baar  10         40
Foo   5          10
      15         10
Name: score_2, dtype: int64

In [264]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.min())
Out[264]: 
name  score_1
Baar  10         15
Foo   5          10
      15         10
Name: score_2, dtype: int64

In [265]: df.groupby(["name", "score_1"])["score_2"].apply(lambda x : x.mean())
Out[265]: 
name  score_1
Baar  10         20.0
Foo   5          10.0
      15         10.0
Name: score_2, dtype: float64

Now, look at the same operations using .agg( ) effortlessly:

In [276]: df.groupby(["name", "score_1"]).agg({"score_3" :[np.sum, np.min, np.mean, np.max], "score_2":lambda x : x.mean()})
Out[276]: 
              score_2 score_3               
             <lambda>     sum amin mean amax
name score_1                                
Baar 10            20      60   20   30   40
Foo  5             10      10   10   10   10
     15            10      30   30   30   30

So, .agg() could be really handy at handling the DataFrameGroupBy objects, as compared to .apply(). But, if you are handling only pure dataframe objects and not DataFrameGroupBy objects, then apply() can be very useful, as apply() can apply a function along any axis of the dataframe.

(For Eg: axis = 0 implies column-wise operation with .apply(), which is a default mode, and axis = 1 would imply for row-wise operation while dealing with pure dataframe objects).

@Allen Wang 2019-06-17 20:12:38

Also, you can use apply if you need a function to access more than one column at once

@TomAugspurger 2014-02-17 14:26:24

apply applies the function to each group (your Species). Your function returns 1, so you end up with 1 value for each of 3 groups.

agg aggregates each column (feature) for each group, so you end up with one value per column per group.

Do read the groupby docs, they're quite helpful. There are also a bunch of tutorials floating around the web.

@QM.py 2017-12-05 06:46:05

Thus, if I want to use my func in the whole groups I should choose apply, and if a single column in each group, agg is a better choice.

Related Questions

Sponsored Content

18 Answered Questions

[SOLVED] What's the difference between lists and tuples?

  • 2009-03-09 15:41:25
  • Lucas Gabriel Sánchez
  • 404683 View
  • 986 Score
  • 18 Answer
  • Tags:   python list tuples

22 Answered Questions

[SOLVED] Difference between __str__ and __repr__?

11 Answered Questions

[SOLVED] What is the difference between pip and conda?

25 Answered Questions

[SOLVED] Difference between staticmethod and classmethod

20 Answered Questions

8 Answered Questions

27 Answered Questions

9 Answered Questions

[SOLVED] Difference between map, applymap and apply methods in Pandas

7 Answered Questions

[SOLVED] What are the differences between type() and isinstance()?

10 Answered Questions

[SOLVED] What is the difference between dict.items() and dict.iteritems()?

Sponsored Content