By PointXIV


2014-04-30 17:29:50 8 Comments

I'm using groupby on a pandas dataframe to drop all rows that don't have the minimum of a specific column. Something like this:

df1 = df.groupby("item", as_index=False)["diff"].min()

However, if I have more than those two columns, the other columns (e.g. otherstuff in my example) get dropped. Can I keep those columns using groupby, or am I going to have to find a different way to drop the rows?

My data looks like:

    item    diff   otherstuff
   0   1       2            1
   1   1       1            2
   2   1       3            7
   3   2      -1            0
   4   2       1            3
   5   2       4            9
   6   2      -6            2
   7   3       0            0
   8   3       2            9

and should end up like:

    item   diff  otherstuff
   0   1      1           2
   1   2     -6           2
   2   3      0           0

but what I'm getting is:

    item   diff
   0   1      1           
   1   2     -6           
   2   3      0                 

I've been looking through the documentation and can't find anything. I tried:

df1 = df.groupby(["item", "otherstuff"], as_index=false)["diff"].min()

df1 = df.groupby("item", as_index=false)["diff"].min()["otherstuff"]

df1 = df.groupby("item", as_index=false)["otherstuff", "diff"].min()

But none of those work (I realized with the last one that the syntax is meant for aggregating after a group is created).

4 comments

@jezrael 2019-01-08 07:43:51

You can use DataFrame.sort_values with DataFrame.drop_duplicates:

df = df.sort_values(by='diff').drop_duplicates(subset='item')
print (df)
   item  diff  otherstuff
6     2    -6           2
7     3     0           0
1     1     1           2

If possible multiple minimal values per groups and want all min rows use boolean indexing with transform for minimal values per groups:

print (df)
   item  diff  otherstuff
0     1     2           1
1     1     1           2 <-multiple min
2     1     1           7 <-multiple min
3     2    -1           0
4     2     1           3
5     2     4           9
6     2    -6           2
7     3     0           0
8     3     2           9

print (df.groupby("item")["diff"].transform('min'))
0    1
1    1
2    1
3   -6
4   -6
5   -6
6   -6
7    0
8    0
Name: diff, dtype: int64

df = df[df.groupby("item")["diff"].transform('min') == df['diff']]
print (df)
   item  diff  otherstuff
1     1     1           2
2     1     1           7
6     2    -6           2
7     3     0           0

@Nic Scozzaro 2018-09-17 22:01:59

If you know that all of your "items" have more than one record you can sort, then use duplicated:

df.sort_values(by='diff').duplicated(subset='item', keep='first')

@citynorman 2018-03-25 13:04:15

The above answer worked great if there is / you want one min. In my case there could be multiple mins and I wanted all rows equal to min which .idxmin() doesn't give you. This worked

def filter_group(dfg, col):
    return dfg[dfg[col] == dfg[col].min()]

df = pd.DataFrame({'g': ['a'] * 6 + ['b'] * 6, 'v1': (list(range(3)) + list(range(3))) * 2, 'v2': range(12)})
df.groupby('g',group_keys=False).apply(lambda x: filter_group(x,'v1'))

As an aside, .filter() is also relevant to this question but didn't work for me.

@DSM 2014-04-30 17:43:00

Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:

>>> df.loc[df.groupby("item")["diff"].idxmin()]
   item  diff  otherstuff
1     1     1           2
6     2    -6           2
7     3     0           0

[3 rows x 3 columns]

Method #2: sort by diff, and then take the first element in each item group:

>>> df.sort_values("diff").groupby("item", as_index=False).first()
   item  diff  otherstuff
0     1     1           2
1     2    -6           2
2     3     0           0

[3 rows x 3 columns]

Note that the resulting indices are different even though the row content is the same.

@PointXIV 2014-04-30 17:48:27

The first returns all NaN, but the second method works. Thank you!

@sfortney 2014-08-21 20:57:37

Thank you for providing two ways of doing this. I have found multiple instances that one works when the other does not and visa versa. Seems like it depends on if your elements are numerical or not. Good to have both.

@Fractaly 2019-01-26 04:59:23

Thanks for this! I found the first one to be very slow and incorrect, but the second one was fast and did just what I wanted. I was using multiple indices simultaneously, and that may have messed with things.

Related Questions

Sponsored Content

27 Answered Questions

[SOLVED] Renaming columns in pandas

18 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame

7 Answered Questions

[SOLVED] Apply multiple functions to multiple groupby columns

2 Answered Questions

[SOLVED] How to pivot a dataframe

11 Answered Questions

[SOLVED] Get the Row(s) which have the max count in groups using groupby

16 Answered Questions

[SOLVED] "Large data" work flows using pandas

10 Answered Questions

[SOLVED] How to select rows from a DataFrame based on column values?

6 Answered Questions

[SOLVED] Pandas DataFrame Groupby two columns and get counts

3 Answered Questions

[SOLVED] Pandas sum by groupby, but exclude certain columns

Sponsored Content