By Abhishek Thakur


2014-03-06 08:31:09 8 Comments

I have a pandas data frame like:

a b
A 1
A 2
B 5
B 5
B 4
C 6

I want to group by the first column and get second column as lists in rows:

A [1,2]
B [5,5,4]
C [6]

Is it possible to do something like this using pandas groupby?

9 comments

@Vanshika 2019-07-04 17:07:02

If looking for a unique list while grouping multiple columns this could probably help:

df.groupby('a').agg(lambda x: list(set(x))).reset_index()

@Ganesh Kharad 2019-06-10 11:33:24

Here I have grouped elements with "|" as a separator import pandas as pd

df = pd.read_csv('input.csv')

df
Out[1]:
  Area  Keywords
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

df.dropna(inplace =  True)
df['Area']=df['Area'].apply(lambda x:x.lower().strip())
print df.columns
df_op = df.groupby('Area').agg({"Keywords":lambda x : "|".join(x)})

df_op.to_csv('output.csv')
Out[2]:
df_op
Area  Keywords

A       [1| 2]
B    [5| 5| 4]
C          [6]

@cs95 2019-04-24 22:35:32

Use any of the following groupby and agg recipes.

# Setup
df = pd.DataFrame({
  'a': ['A', 'A', 'B', 'B', 'B', 'C'],
  'b': [1, 2, 5, 5, 4, 6],
  'c': ['x', 'y', 'z', 'x', 'y', 'z']
})
df

   a  b  c
0  A  1  x
1  A  2  y
2  B  5  z
3  B  5  x
4  B  4  y
5  C  6  z

To aggregate multiple columns as lists, use any of the following:

df.groupby('a').agg(list)
df.groupby('a').agg(pd.Series.tolist)

           b          c
a                      
A     [1, 2]     [x, y]
B  [5, 5, 4]  [z, x, y]
C        [6]        [z]

To group-listify a single column only, convert the groupby to a SeriesGroupBy object, then call SeriesGroupBy.agg. Use,

df.groupby('a').agg({'b': list})  # 4.42 ms 
df.groupby('a')['b'].agg(list)    # 2.76 ms - faster

a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

@Kai 2019-05-02 15:51:32

are the methods above guaranteed to preserve order? meaning that elements from the same row (but different columns, b and c in your code above) will have the same index in the resulting lists?

@cs95 2019-05-02 16:37:23

@Kai oh, good question. Yes and no. GroupBy sorts the output by the grouper key values. However the sort is generally stable so the relative ordering per group is preserved. To disable the sorting behavior entirely, use groupby(..., sort=False). Here, it'd make no difference since I'm grouping on column A which is already sorted.

@Kai 2019-05-02 17:28:17

i'm sorry, i don't understand your answer. Can you explain in more detail. I think this deserves it's own question..

@B. M. 2017-03-02 08:42:03

If performance is important go down to numpy level:

import numpy as np

df = pd.DataFrame({'a': np.random.randint(0, 60, 600), 'b': [1, 2, 5, 5, 4, 6]*100})

def f(df):
         keys, values = df.sort_values('a').values.T
         ukeys, index = np.unique(keys, True)
         arrays = np.split(values, index[1:])
         df2 = pd.DataFrame({'a':ukeys, 'b':[list(a) for a in arrays]})
         return df2

Tests:

In [301]: %timeit f(df)
1000 loops, best of 3: 1.64 ms per loop

In [302]: %timeit df.groupby('a')['b'].apply(list)
100 loops, best of 3: 5.26 ms per loop

@ru111 2019-03-12 17:35:59

How could we use this if we are grouping by two or more keys e.g. with .groupby([df.index.month, df.index.day]) instead of just .groupby('a')?

@WeNYoBen 2018-11-30 20:59:27

Let us using df.groupby with list and Series constructor

pd.Series({x : y.b.tolist() for x , y in df.groupby('a')})
Out[664]: 
A       [1, 2]
B    [5, 5, 4]
C          [6]
dtype: object

@Markus Dutschke 2018-10-31 16:25:24

To solve this for several columns of a dataframe:

In [5]: df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6],'c'
   ...: :[3,3,3,4,4,4]})

In [6]: df
Out[6]: 
   a  b  c
0  A  1  3
1  A  2  3
2  B  5  3
3  B  5  4
4  B  4  4
5  C  6  4

In [7]: df.groupby('a').agg(lambda x: list(x))
Out[7]: 
           b          c
a                      
A     [1, 2]     [3, 3]
B  [5, 5, 4]  [3, 4, 4]
C        [6]        [4]

This answer was inspired from Anamika Modi's answer. Thank you!

@Anamika Modi 2018-09-27 06:28:03

A handy way to achieve this would be:

df.groupby('a').agg({'b':lambda x: list(x)})

Look into writing Custom Aggregations: https://www.kaggle.com/akshaysehgal/how-to-group-by-aggregate-using-py

@BallpointBen 2018-10-11 17:43:13

lambda args: f(args) is equivalent to f

@jchaykow 2019-01-24 21:28:29

Also just df.groupby('a').agg({'b':list}) works

@cs95 2019-06-07 15:31:34

Actually, just agg(list) is enough. Also see here.

@EdChum 2014-03-06 10:28:32

You can do this using groupby to group on the column of interest and then apply list to every group:

In [1]:
# create the dataframe    
df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,5,4,6]})
df
Out[1]:
   a  b
0  A  1
1  A  2
2  B  5
3  B  5
4  B  4
5  C  6

[6 rows x 2 columns]

In [76]:
df.groupby('a')['b'].apply(list)

Out[76]:
a
A       [1, 2]
B    [5, 5, 4]
C          [6]
Name: b, dtype: object

@Abhishek Thakur 2014-03-06 11:12:19

This takes a lot of time if the dataset is huge, say 10million rows. Is there any faster way to do this? The number of uniques in 'a' is however around 500k

@EdChum 2014-03-06 11:32:33

groupby is notoriously slow and memory hungry, what you could do is sort by column A, then find the idxmin and idxmax (probably store this in a dict) and use this to slice your dataframe would be faster I think

@EdChum 2014-03-06 11:40:03

@AbhishekThakur actually that won't work as idxmin will not work for strings, you would need to store the beginning and end indices values

@DSM 2014-03-06 12:21:49

Unless I'm missing something (no morning coffee yet) you're doing a separate groupby for each row.

@EdChum 2014-03-06 13:00:14

@DSM Hmm. yes that does appear to be the case

@Andarin 2016-06-24 10:54:24

When I tried this solution with my problem (having multiple columns to groupBy and to group), it didn't work - pandas sent 'Function does not reduce'. Then I used tuplefollowing the second answer here: stackoverflow.com/questions/19530568/… . See second answer in stackoverflow.com/questions/27439023/… for explanation.

@slhck 2016-09-28 12:06:17

Would be nice to put the simplest solution at the top of your answer :) Just found it very helpful.

@EdChum 2016-09-28 12:07:23

@slhck sure will do, I tend to keep earlier attempts to show my train of thought but this adds noise

@Poka 2018-08-12 03:26:08

@EdChum how to select all rows if I have one more column . df.group by('a')['b']['c']

@Sriram Arvind Lakshmanakumar 2019-01-18 10:59:09

This solution is good, but is there a way to store set of list, meaning can i remove the duplicates and then store it?

@EdChum 2019-01-18 11:02:43

you mean df.groupby('a')['b'].apply(lambda x:list(set(x)))

@Catiger3331 2019-04-18 14:58:17

You don't need to use groupby. Just take a set on column 'a', and do a subset to the dataframe of 'A', 'B', etc. Then fetch column 'b' in the subset and put those values in a list.

@Poete Maudit 2019-06-07 15:19:26

But how the code is written if you had another column c which also had numbers which had to be put in a list?

@EdChum 2019-06-07 15:31:31

@PoeteMaudit Sorry I don't understand what you're asking and asking questions in comments is bad form in SO. Are you asking how to concatenate multiple columns into a single list?

@Poete Maudit 2019-06-07 15:32:36

No worries, I was asking for this basically: stackoverflow.com/a/53088007/9024698

@Edward Aung 2019-07-08 00:35:27

in pandas 0.23.x, apply does not work. I needed to use 'agg' function.

@EdChum 2019-07-08 11:06:25

@EdwardAung this still works for me using pandas version '0.24.2', you'd have to post an example where this fails

@Acorbe 2014-03-06 10:12:46

As you were saying the groupby method of a pd.DataFrame object can do the job.

Example

 L = ['A','A','B','B','B','C']
 N = [1,2,5,5,4,6]

 import pandas as pd
 df = pd.DataFrame(zip(L,N),columns = list('LN'))


 groups = df.groupby(df.L)

 groups.groups
      {'A': [0, 1], 'B': [2, 3, 4], 'C': [5]}

which gives and index-wise description of the groups.

To get elements of single groups, you can do, for instance

 groups.get_group('A')

     L  N
  0  A  1
  1  A  2

  groups.get_group('B')

     L  N
  2  B  5
  3  B  5
  4  B  4

Related Questions

Sponsored Content

60 Answered Questions

[SOLVED] How do you split a list into evenly sized chunks?

22 Answered Questions

[SOLVED] Renaming columns in pandas

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

5 Answered Questions

17 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

18 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

22 Answered Questions

[SOLVED] How do I list all files of a directory?

  • 2010-07-08 19:31:22
  • duhhunjonn
  • 3363205 View
  • 3474 Score
  • 22 Answer
  • Tags:   python directory

14 Answered Questions

[SOLVED] "Large data" work flows using pandas

9 Answered Questions

[SOLVED] Converting a Pandas GroupBy output from Series to DataFrame

4 Answered Questions

[SOLVED] Convert list of dictionaries to a pandas DataFrame

Sponsored Content