By Abhishek Thakur

2014-03-06 08:31:09 8 Comments

I have a pandas data frame like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second column as lists in rows:

A [1,2]
B [5,5,4]
C [6]

Is it possible to do something like this using pandas groupby?


@cs95 2019-04-24 22:35:32

Use any of the following groupby and agg recipes.

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:


           b          c
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

@Kai 2019-05-02 15:51:32

are the methods above guaranteed to preserve order? meaning that elements from the same row (but different columns, b and c in your code above) will have the same index in the resulting lists?

@cs95 2019-05-02 16:37:23

@Kai oh, good question. Yes and no. GroupBy sorts the output by the grouper key values. However the sort is generally stable so the relative ordering per group is preserved. To disable the sorting behavior entirely, use groupby(..., sort=False). Here, it'd make no difference since I'm grouping on column A which is already sorted.

@Kai 2019-05-02 17:28:17

i'm sorry, i don't understand your answer. Can you explain in more detail. I think this deserves it's own question..

@B. M. 2017-03-02 08:42:03

If performance is important go down to numpy level:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2


In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

@ru111 2019-03-12 17:35:59

How could we use this if we are grouping by two or more keys e.g. with .groupby([df.index.month,]) instead of just .groupby('a')?

@WeNYoBen 2018-11-30 20:59:27

Let us using df.groupby with list and Series constructor

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

@Markus Dutschke 2018-10-31 16:25:24

To solve this for several columns of a dataframe:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
           b          c
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi's answer. Thank you!

@Anamika Modi 2018-09-27 06:28:03

A handy way to achieve this would be:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations:

@BallpointBen 2018-10-11 17:43:13

lambda args: f(args) is equivalent to f

@jchaykow 2019-01-24 21:28:29

Also just df.groupby('a').agg({'b':list}) works

@EdChum 2014-03-06 10:28:32

You can do this using groupby to group on the column of interest and then apply list to every group:

In [1]:
# create the dataframe    
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

[6 rows x 2 columns]

In [76]:

A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

@Abhishek Thakur 2014-03-06 11:12:19

This takes a lot of time if the dataset is huge, say 10million rows. Is there any faster way to do this? The number of uniques in 'a' is however around 500k

@EdChum 2014-03-06 11:32:33

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think

@EdChum 2014-03-06 11:40:03

@AbhishekThakur actually that won't work as idxmin will not work for strings, you would need to store the beginning and end indices values

@DSM 2014-03-06 12:21:49

Unless I'm missing something (no morning coffee yet) you're doing a separate groupby for each row.

@EdChum 2014-03-06 13:00:14

@DSM Hmm. yes that does appear to be the case

@Abhishek Thakur 2014-03-06 15:17:16

Let me try the new method

@Andarin 2016-06-24 10:54:24

When I tried this solution with my problem (having multiple columns to groupBy and to group), it didn't work - pandas sent 'Function does not reduce'. Then I used tuplefollowing the second answer here:… . See second answer in… for explanation.

@slhck 2016-09-28 12:06:17

Would be nice to put the simplest solution at the top of your answer :) Just found it very helpful.

@EdChum 2016-09-28 12:07:23

@slhck sure will do, I tend to keep earlier attempts to show my train of thought but this adds noise

@Poka 2018-08-12 03:26:08

@EdChum how to select all rows if I have one more column . by('a')['b']['c']

@user1893354 2018-09-05 21:50:53

I spent WAY too much time trying to figure this out

@Sriram Arvind Lakshmanakumar 2019-01-18 10:59:09

This solution is good, but is there a way to store set of list, meaning can i remove the duplicates and then store it?

@EdChum 2019-01-18 11:02:43

you mean df.groupby('a')['b'].apply(lambda x:list(set(x)))

@Jinhua Wang 2019-03-31 13:11:48

This answer is so simplistic and elegant!

@Catiger3331 2019-04-18 14:58:17

You don't need to use groupby. Just take a set on column 'a', and do a subset to the dataframe of 'A', 'B', etc. Then fetch column 'b' in the subset and put those values in a list.

@Acorbe 2014-03-06 10:12:46

As you were saying the groupby method of a pd.DataFrame object can do the job.


 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))

 groups = df.groupby(df.L)

      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

To get elements of single groups, you can do, for instance


     L  N
  0  A  1
  1  A  2


     L  N
  2  B  5
  3  B  5
  4  B  4

Related Questions

Sponsored Content

22 Answered Questions

[SOLVED] How do I list all files of a directory?

  • 2010-07-08 19:31:22
  • duhhunjonn
  • 3202468 View
  • 3475 Score
  • 22 Answer
  • Tags:   python directory

33 Answered Questions

[SOLVED] Renaming columns in pandas

18 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

57 Answered Questions

[SOLVED] How do you split a list into evenly sized chunks?

17 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

13 Answered Questions

[SOLVED] "Large data" work flows using pandas

5 Answered Questions

4 Answered Questions

[SOLVED] Convert list of dictionaries to a pandas DataFrame

9 Answered Questions

[SOLVED] Converting a Pandas GroupBy object to DataFrame

Sponsored Content