By Abhishek Thakur


2014-03-06 08:31:09 8 Comments

I have a pandas data frame like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second column as lists in rows:

A [1,2]
B [5,5,4]
C [6]

Is it possible to do something like this using pandas groupby?

9 comments

@EdChum - Reinstate Monica 2014-03-06 10:28:32

You can do this using groupby to group on the column of interest and then apply list to every group:

In [1]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
        df

Out[1]: 
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

In [2]: df.groupby('a')['b'].apply(list)
Out[2]: 
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

In [3]: df1 = df.groupby('a')['b'].apply(list).reset_index(name='new')
        df1
Out[3]: 
   a        new
0  A     [1, 2]
1  B  [5, 5, 4]
2  C        [6]

@Abhishek Thakur 2014-03-06 11:12:19

This takes a lot of time if the dataset is huge, say 10million rows. Is there any faster way to do this? The number of uniques in 'a' is however around 500k

@EdChum - Reinstate Monica 2014-03-06 11:32:33

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think

@EdChum - Reinstate Monica 2014-03-06 11:40:03

@AbhishekThakur actually that won't work as idxmin will not work for strings, you would need to store the beginning and end indices values

@DSM 2014-03-06 12:21:49

Unless I'm missing something (no morning coffee yet) you're doing a separate groupby for each row.

@EdChum - Reinstate Monica 2014-03-06 13:00:14

@DSM Hmm. yes that does appear to be the case

@Andarin 2016-06-24 10:54:24

When I tried this solution with my problem (having multiple columns to groupBy and to group), it didn't work - pandas sent 'Function does not reduce'. Then I used tuplefollowing the second answer here: stackoverflow.com/questions/19530568/… . See second answer in stackoverflow.com/questions/27439023/… for explanation.

@Poka 2018-08-12 03:26:08

@EdChum how to select all rows if I have one more column . df.group by('a')['b']['c']

@Sriram Arvind Lakshmanakumar 2019-01-18 10:59:09

This solution is good, but is there a way to store set of list, meaning can i remove the duplicates and then store it?

@EdChum - Reinstate Monica 2019-01-18 11:02:43

you mean df.groupby('a')['b'].apply(lambda x:list(set(x)))

@Catiger3331 2019-04-18 14:58:17

You don't need to use groupby. Just take a set on column 'a', and do a subset to the dataframe of 'A', 'B', etc. Then fetch column 'b' in the subset and put those values in a list.

@Poete Maudit 2019-06-07 15:19:26

But how the code is written if you had another column c which also had numbers which had to be put in a list?

@EdChum - Reinstate Monica 2019-06-07 15:31:31

@PoeteMaudit Sorry I don't understand what you're asking and asking questions in comments is bad form in SO. Are you asking how to concatenate multiple columns into a single list?

@Poete Maudit 2019-06-07 15:32:36

No worries, I was asking for this basically: stackoverflow.com/a/53088007/9024698

@Edward Aung 2019-07-08 00:35:27

in pandas 0.23.x, apply does not work. I needed to use 'agg' function.

@EdChum - Reinstate Monica 2019-07-08 11:06:25

@EdwardAung this still works for me using pandas version '0.24.2', you'd have to post an example where this fails

@ic_fl2 2019-10-22 06:59:07

If you make it df.groupby('a')['b'].apply(list).apply(pd.Series) you get columns with on entry each instead of one column with lists, , which can be very useful.

@Dave Liu 2019-11-12 00:39:04

Empirically, I found .apply(np.array) to be slightly faster on my 25K dataset.

@Vanshika 2019-07-04 17:07:02

If looking for a unique list while grouping multiple columns this could probably help:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

@Ganesh Kharad 2019-06-10 11:33:24

Here I have grouped elements with "|" as a separator import pandas as pd

df = pd.read_csv('input.csv')

df
Out[1]:
  Area  Keywords
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

df.dropna(inplace =  True)
df['Area']=df['Area'].apply(lambda x:x.lower().strip())
print df.columns
df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

df_op.to_csv('output.csv')
Out[2]:
df_op
Area  Keywords

A       [1| 2]
B    [5| 5| 4]
C          [6]

@cs95 2019-04-24 22:35:32

Use any of the following groupby and agg recipes.

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

@Kai 2019-05-02 15:51:32

are the methods above guaranteed to preserve order? meaning that elements from the same row (but different columns, b and c in your code above) will have the same index in the resulting lists?

@cs95 2019-05-02 16:37:23

@Kai oh, good question. Yes and no. GroupBy sorts the output by the grouper key values. However the sort is generally stable so the relative ordering per group is preserved. To disable the sorting behavior entirely, use groupby(..., sort=False). Here, it'd make no difference since I'm grouping on column A which is already sorted.

@Kai 2019-05-02 17:28:17

i'm sorry, i don't understand your answer. Can you explain in more detail. I think this deserves it's own question..

@Federico Gentile 2019-12-05 12:24:58

This is a very good answer! Is there also a way to make the values of the list unique? something like .agg(pd.Series.tolist.unique) maybe?

@cs95 2019-12-05 14:48:24

@FedericoGentile you can use a lambda. Here's one way: df.groupby('a')['b'].agg(lambda x: list(set(x)))

@B. M. 2017-03-02 08:42:03

If performance is important go down to numpy level:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

Tests:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

@ru111 2019-03-12 17:35:59

How could we use this if we are grouping by two or more keys e.g. with .groupby([df.index.month, df.index.day]) instead of just .groupby('a')?

@YO and BEN_W 2018-11-30 20:59:27

Let us using df.groupby with list and Series constructor

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

@Markus Dutschke 2018-10-31 16:25:24

To solve this for several columns of a dataframe:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi's answer. Thank you!

@Anamika Modi 2018-09-27 06:28:03

A handy way to achieve this would be:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

@BallpointBen 2018-10-11 17:43:13

lambda args: f(args) is equivalent to f

@cs95 2019-06-07 15:31:34

Actually, just agg(list) is enough. Also see here.

@Acorbe 2014-03-06 10:12:46

As you were saying the groupby method of a pd.DataFrame object can do the job.

Example

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

To get elements of single groups, you can do, for instance

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

Related Questions

Sponsored Content

40 Answered Questions

[SOLVED] How to make a flat list out of list of lists?

20 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

19 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

15 Answered Questions

[SOLVED] "Large data" work flows using pandas

62 Answered Questions

[SOLVED] How do you split a list into evenly sized chunks?

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

6 Answered Questions

23 Answered Questions

[SOLVED] Renaming columns in pandas

5 Answered Questions

[SOLVED] Convert list of dictionaries to a pandas DataFrame

9 Answered Questions

Sponsored Content