By Abhishek Thakur


2014-03-06 08:31:09 8 Comments

I have a pandas data frame df like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second column as lists in rows:

A [1,2]
B [5,5,4]
C [6]

Is it possible to do something like this using pandas groupby?

12 comments

@Abhilash Awasthi 2020-08-23 08:56:46

Answer based on @EdChum's comment on his answer. Comment is this -

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think 

Let's first create a dataframe with 500k categories in first column and total df shape 20 million as mentioned in question.

df = pd.DataFrame(columns=['a', 'b'])
df['a'] = (np.random.randint(low=0, high=500000, size=(20000000,))).astype(str)
df['b'] = list(range(20000000))
print(df.shape)
df.head()
# Sort data by first column 
df.sort_values(by=['a'], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)

# Create a temp column
df['temp_idx'] = list(range(df.shape[0]))

# Take all values of b in a separate list
all_values_b = list(df.b.values)
print(len(all_values_b))
# For each category in column a, find min and max indexes
gp_df = df.groupby(['a']).agg({'temp_idx': [np.min, np.max]})
gp_df.reset_index(inplace=True)
gp_df.columns = ['a', 'temp_idx_min', 'temp_idx_max']

# Now create final list_b column, using min and max indexes for each category of a and filtering list of b. 
gp_df['list_b'] = gp_df[['temp_idx_min', 'temp_idx_max']].apply(lambda x: all_values_b[x[0]:x[1]+1], axis=1)

print(gp_df.shape)
gp_df.head()

This above code takes 2 minutes for 20 million rows and 500k categories in first column.

@Metrd 2020-05-22 12:34:23

The easiest way I have see no achieve most of the same thing at least for one column which is similar to Anamika's answer just with the tuple syntax for the aggregate function.

df.groupby('a').agg(b=('b','unique'), c=('c','unique'))

@Mithril 2020-05-06 08:22:39

It is time to use agg instead of apply .

When

df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6], 'c': [1,2,5,5,4,6]})

If you want multiple columns stack into list , result in pd.DataFrame

df.groupby('a')[['b', 'c']].agg(list)
# or 
df.groupby('a').agg(list)

If you want single column in list, result in ps.Series

df.groupby('a')['b'].agg(list)
#or
df.groupby('a')['b'].apply(list)

Note, result in pd.DataFrame is about 10x slower than result in ps.Series when you only aggregate single column, use it in multicolumns case .

@Ganesh Kharad 2019-06-10 11:33:24

Here I have grouped elements with "|" as a separator

    import pandas as pd

    df = pd.read_csv('input.csv')

    df
    Out[1]:
      Area  Keywords
    0  A  1
    1  A  2
    2  B  5
    3  B  5
    4  B  4
    5  C  6

    df.dropna(inplace =  True)
    df['Area']=df['Area'].apply(lambda x:x.lower().strip())
    print df.columns
    df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

    df_op.to_csv('output.csv')
    Out[2]:
    df_op
    Area  Keywords

    A       [1| 2]
    B    [5| 5| 4]
    C          [6]

@EdChum 2014-03-06 10:28:32

You can do this using groupby to group on the column of interest and then apply list to every group:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

@Abhishek Thakur 2014-03-06 11:12:19

This takes a lot of time if the dataset is huge, say 10million rows. Is there any faster way to do this? The number of uniques in 'a' is however around 500k

@EdChum 2014-03-06 11:32:33

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think

@Andarin 2016-06-24 10:54:24

When I tried this solution with my problem (having multiple columns to groupBy and to group), it didn't work - pandas sent 'Function does not reduce'. Then I used tuplefollowing the second answer here: stackoverflow.com/questions/19530568/… . See second answer in stackoverflow.com/questions/27439023/… for explanation.

@Sriram Arvind Lakshmanakumar 2019-01-18 10:59:09

This solution is good, but is there a way to store set of list, meaning can i remove the duplicates and then store it?

@EdChum 2019-01-18 11:02:43

you mean df.groupby('a')['b'].apply(lambda x:list(set(x)))

@Catiger3331 2019-04-18 14:58:17

You don't need to use groupby. Just take a set on column 'a', and do a subset to the dataframe of 'A', 'B', etc. Then fetch column 'b' in the subset and put those values in a list.

@Outcast 2019-06-07 15:19:26

But how the code is written if you had another column c which also had numbers which had to be put in a list?

@EdChum 2019-06-07 15:31:31

@PoeteMaudit Sorry I don't understand what you're asking and asking questions in comments is bad form in SO. Are you asking how to concatenate multiple columns into a single list?

@Outcast 2019-06-07 15:32:36

No worries, I was asking for this basically: stackoverflow.com/a/53088007/9024698

@Edward Aung 2019-07-08 00:35:27

in pandas 0.23.x, apply does not work. I needed to use 'agg' function.

@EdChum 2019-07-08 11:06:25

@EdwardAung this still works for me using pandas version '0.24.2', you'd have to post an example where this fails

@ic_fl2 2019-10-22 06:59:07

If you make it df.groupby('a')['b'].apply(list).apply(pd.Series) you get columns with on entry each instead of one column with lists, , which can be very useful.

@Dave Liu 2019-11-12 00:39:04

Empirically, I found .apply(np.array) to be slightly faster on my 25K dataset.

@Vanshika 2019-07-04 17:07:02

If looking for a unique list while grouping multiple columns this could probably help:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

@cs95 2019-04-24 22:35:32

Use any of the following groupby and agg recipes.

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

@Kai 2019-05-02 15:51:32

are the methods above guaranteed to preserve order? meaning that elements from the same row (but different columns, b and c in your code above) will have the same index in the resulting lists?

@cs95 2019-05-02 16:37:23

@Kai oh, good question. Yes and no. GroupBy sorts the output by the grouper key values. However the sort is generally stable so the relative ordering per group is preserved. To disable the sorting behavior entirely, use groupby(..., sort=False). Here, it'd make no difference since I'm grouping on column A which is already sorted.

@Kai 2019-05-02 17:28:17

i'm sorry, i don't understand your answer. Can you explain in more detail. I think this deserves it's own question..

@Federico Gentile 2019-12-05 12:24:58

This is a very good answer! Is there also a way to make the values of the list unique? something like .agg(pd.Series.tolist.unique) maybe?

@cs95 2019-12-05 14:48:24

@FedericoGentile you can use a lambda. Here's one way: df.groupby('a')['b'].agg(lambda x: list(set(x)))

@Moondra 2020-06-30 22:44:39

@cs95 Hi ColdSpeed! Is there a way to get aggregate the list values of each columns list into one column? Instead of separate b and c columns, instead just create a column that has all the values. Thank you.

@cs95 2020-06-30 23:07:14

@Moondra Not sure, perhaps you want df.groupby('a').agg(lambda x: x.to_numpy().ravel().tolist())

@B. M. 2017-03-02 08:42:03

If performance is important go down to numpy level:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

Tests:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

@ru111 2019-03-12 17:35:59

How could we use this if we are grouping by two or more keys e.g. with .groupby([df.index.month, df.index.day]) instead of just .groupby('a')?

@BEN_YO 2018-11-30 20:59:27

Let us using df.groupby with list and Series constructor

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

@Markus Dutschke 2018-10-31 16:25:24

To solve this for several columns of a dataframe:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi's answer. Thank you!

@Anamika Modi 2018-09-27 06:28:03

A handy way to achieve this would be:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

@BallpointBen 2018-10-11 17:43:13

lambda args: f(args) is equivalent to f

@cs95 2019-06-07 15:31:34

Actually, just agg(list) is enough. Also see here.

@Akshay Sehgal 2020-04-08 01:11:00

!! I was just googling for some syntax and realised my own notebook was referenced for the solution lol. Thanks for linking this. Just to add, since 'list' is not a series function, you will have to either use it with apply df.groupby('a').apply(list) or use it with agg as part of a dict df.groupby('a').agg({'b':list}). You could also use it with lambda (which I recommend) since you can do so much more with it. Example: df.groupby('a').agg({'c':'first', 'b': lambda x: x.unique().tolist()}) which lets you apply a series function to the col c and a unique then a list function to col b.

@Acorbe 2014-03-06 10:12:46

As you were saying the groupby method of a pd.DataFrame object can do the job.

Example

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

To get elements of single groups, you can do, for instance

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

Related Questions

Sponsored Content

43 Answered Questions

[SOLVED] How to make a flat list out of list of lists?

27 Answered Questions

[SOLVED] How do I check if a list is empty?

  • 2008-09-10 06:20:11
  • Ray
  • 3090090 View
  • 3233 Score
  • 27 Answer
  • Tags:   python list

64 Answered Questions

[SOLVED] How do you split a list into evenly sized chunks?

7 Answered Questions

[SOLVED] Convert list of dictionaries to a pandas DataFrame

22 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas

19 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

27 Answered Questions

[SOLVED] Renaming columns in pandas

22 Answered Questions

24 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

9 Answered Questions

Sponsored Content