By jojo12


2013-03-29 14:48:13 8 Comments

How do I find all rows in a pandas dataframe which have the max value for count column, after grouping by ['Sp','Mt'] columns?

Example 1: the following dataFrame, which I group by ['Sp','Mt']:

   Sp   Mt Value   count
0  MM1  S1   a      **3**
1  MM1  S1   n      2
2  MM1  S3   cb     5
3  MM2  S3   mk      **8**
4  MM2  S4   bg     **10**
5  MM2  S4   dgd      1
6  MM4  S2  rd     2
7  MM4  S2   cb      2
8  MM4  S2   uyi      **7**

Expected output: get the result rows whose count is max between the groups, like:

0  MM1  S1   a      **3**
1 3  MM2  S3   mk      **8**
4  MM2  S4   bg     **10** 
8  MM4  S2   uyi      **7**

Example 2: this dataframe, which I group by ['Sp','Mt']:

   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8

For the above example, I want to get all the rows where count equals max, in each group e.g :

MM2  S4   bg     10
MM4  S2   cb     8
MM4  S2   uyi    8

11 comments

@escha 2019-07-08 16:37:59

Try using "nlargest" on the groupby object. The advantage of using nlargest is that it returns the index of the rows where "the nlargest item(s)" were fetched from. Note: we slice the second(1) element of our index since our index in this case consist of tuples(eg.(s1, 0)).

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

d = df.groupby('mt')['count'].nlargest(1) # pass 1 since we want the max

df.iloc[[i[1] for i in d.index], :] # pass the index of d as list comprehension

enter image description here

@Surya 2019-04-10 02:38:11

Realizing that "applying" "nlargest" to groupby object works just as fine:

Additional advantage - also can fetch top n values if required:

In [85]: import pandas as pd

In [86]: df = pd.DataFrame({
    ...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
    ...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    ...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    ...: 'count' : [3,2,5,8,10,1,2,2,7]
    ...: })

## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
   count  mt   sp  val
0      3  S1  MM1    a
1      5  S3  MM1   cb
2      8  S3  MM2   mk
3     10  S4  MM2   bg
4      7  S2  MM4  uyi

@WeNYoBen 2019-01-04 14:55:23

You may not need to do with group by , using sort_values+ drop_duplicates

df.sort_values('count').drop_duplicates(['Sp','Mt'],keep='last')
Out[190]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

Also almost same logic by using tail

df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]: 
    Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10

@joh-mue 2019-01-14 10:03:03

I've been using this functional style for many group operations:

df = pd.DataFrame({
   'Sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
   'Mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
   'Val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
   'Count' : [3,2,5,8,10,1,2,2,7]
})

df.groupby('Mt')\
  .apply(lambda group: group[group.Count == group.Count.max()])\
  .reset_index(drop=True)

    sp  mt  val  count
0  MM1  S1    a      3
1  MM4  S2  uyi      7
2  MM2  S3   mk      8
3  MM2  S4   bg     10

.reset_index(drop=True) gets you back to the original index by dropping the group-index.

@George Liu 2018-08-08 18:25:07

df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))

@blueear 2018-07-24 10:45:21

Use groupby and idxmax methods:

  1. transfer col date to datetime:

    df['date']=pd.to_datetime(df['date'])
    
  2. get the index of max of column date, after groupyby ad_id:

    idx=df.groupby(by='ad_id')['date'].idxmax()
    
  3. get the wanted data:

    df_max=df.loc[idx,]
    

Out[54]:

ad_id  price       date
7     22      2 2018-06-11
6     23      2 2018-06-22
2     24      2 2018-06-30
3     28      5 2018-06-22

@Surya 2017-07-07 00:40:35

Easy solution would be to apply : idxmax() function to get indices of rows with max values. This would filter out all the rows with max value in the group.

In [365]: import pandas as pd

In [366]: df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

In [367]: df                                                                                                       
Out[367]: 
   count  mt   sp  val
0      3  S1  MM1    a
1      2  S1  MM1    n
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
5      1  S4  MM2  dgb
6      2  S2  MM4   rd
7      2  S2  MM4   cb
8      7  S2  MM4  uyi


### Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]                                                       
Out[368]: 
   count  mt   sp  val
0      3  S1  MM1    a
2      5  S3  MM1   cb
3      8  S3  MM2   mk
4     10  S4  MM2   bg
8      7  S2  MM4  uyi

### Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values                                                        
Out[369]: array([0, 2, 3, 4, 8])

@Max Power 2017-12-19 11:55:23

The questioner here specified "I want to get ALL the rows where count equals max in each group", while idxmax Return[s] index of first occurrence of maximum over requested axis" according to the docs (0.21).

@Rani 2016-11-16 10:14:22

You can sort the dataFrame by count and then remove duplicates. I think it's easier:

df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])

@Qy Zuo 2017-07-20 08:14:05

Awsome! It works well with NaNs

@Nolan Conaway 2017-09-27 18:23:53

Very nice! Fast with largish frames (25k rows)

@Samir 2018-12-27 17:14:52

For those who are somewhat new with Python, you will need to assign this to a new variable, it doesn't change the current df variable.

@TMrtSmith 2019-02-04 13:11:47

@Samir or use inplace = True as an argument to drop_duplicates

@Woods Chen 2019-04-10 02:50:40

This is a great answer when need only one of rows with the same max values, however it wont work as expected if I need all the rows with max values.

@Rani 2019-04-10 04:28:11

@WoodsChen it should work. It keeps the max value per key of [sp,mt]

@Woods Chen 2019-04-11 09:37:07

I mean if the dataframe is pd.DataFrame({'sp':[1, 1, 2], 'mt':[1, 1, 2], 'value':[2, 2, 3]}, then there will be 2 rows with the same max value 2 in the group where sp==1 and mt==2. @Rani

@Rani 2019-04-11 15:54:37

@WoodsChen, it drops duplicates of [sp, mt], therefore in your example, the output should be only one row.

@PAC 2015-07-02 12:52:33

For me, the easiest solution would be keep value when count is equal to the maximum. Therefore, the following one line command is enough :

df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]

@landewednack 2014-02-11 17:54:50

Having tried the solution suggested by Zelazny on a relatively large DataFrame (~400k rows) I found it to be very slow. Here is an alternative that I found to run orders of magnitude faster on my data set.

df = pd.DataFrame({
    'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
    'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
    'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
    'count' : [3,2,5,8,10,1,2,2,7]
    })

df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})

df_grouped = df_grouped.reset_index()

df_grouped = df_grouped.rename(columns={'count':'count_max'})

df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])

df = df[df['count'] == df['count_max']]

@goh 2014-07-11 06:30:52

indeed this is much faster. transform seems to be slow for large dataset.

@tommy.carstensen 2017-03-26 00:47:26

Can you add comments to explain what each line does?

@Roland 2017-05-04 21:25:45

fwiw: I found the more elegant-looking solution from @Zelazny7 took a long time to execute for my set of ~100K rows, but this one ran pretty quickly. (I'm running a now way-obsolete 0.13.0, which might account for slowness).

@Qy Zuo 2017-07-20 07:38:28

But doing this df[df['count'] == df['count_max']] will lose NaN rows, as well as the answers above.

@Touya D. Serdan 2018-09-18 05:37:42

I highly suggest to use this approach, for bigger data frames it is much faster to use .appy() or .agg().

@Zelazny7 2013-03-29 15:09:40

In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count

To get the indices of the original DF you can do:

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

Note that if you have multiple max values per group, all will be returned.

Update

On a hail mary chance that this is what the OP is requesting:

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7

@Zelazny7 2013-03-29 15:37:28

Can you please edit your question using code formatting to clarify your request?

@jojo12 2013-03-29 16:19:17

I'm sorry but i don't know how to format code in comment, so i posted my details following the first question

@mathlover 2017-11-15 20:35:01

@Zelazny7, is there a way to adopt this answer to apply to grouping by a column and then looking at 2 columns and doing a max of them to get a greater of the two? I can't get that to work. What I currently have is:def Greater(Merge, maximumA, maximumB): a = Merge[maximumA] b = Merge[maximumB] return max(a, b) Merger.groupby("Search_Term").apply(Greater,"Ratio_x","Ratio‌​_y")

@Mike Palmice 2018-01-03 20:36:10

@Zelazny7 I'm using the second, idx approach. But, I can only afford to a single maximum for each group (and my data has a few duplicate-max's). is there a way to get around this with your solution?

@Ladenkov Vladislav 2018-02-18 18:09:48

actually, that does not work for me. I can not track the problem, because dataframe if quit big, but the solution by @Rani works good

@Zephyr 2018-11-13 15:51:32

Hi Zealzny, If I want to take top 3 maximum row instead of one max value, How can I tweak your code?

@Woods Chen 2019-04-10 02:54:17

transform method may have pool performance when the data set is large enough, get the max value first then merge the dataframes will be better.

Related Questions

Sponsored Content

9 Answered Questions

[SOLVED] Select rows from a DataFrame based on values in a column in pandas

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

5 Answered Questions

6 Answered Questions

[SOLVED] Pandas DataFrame Groupby two columns and get counts

4 Answered Questions

[SOLVED] Python Pandas: Get index of rows which column matches certain value

14 Answered Questions

[SOLVED] "Large data" work flows using pandas

7 Answered Questions

[SOLVED] Change data type of columns in Pandas

9 Answered Questions

[SOLVED] Converting a Pandas GroupBy output from Series to DataFrame

18 Answered Questions

[SOLVED] Set value for particular cell in pandas DataFrame using index

7 Answered Questions

[SOLVED] Filter dataframe rows if value in column is in a set list of values

Sponsored Content