By user2938093


2016-02-18 20:01:07 8 Comments

I have a pandas DataFrame with one column that looks like the following:

In [207]:df2.teams
Out[207]: 
0         [SF, NYG]
1         [SF, NYG]
2         [SF, NYG]
3         [SF, NYG]
4         [SF, NYG]
5         [SF, NYG]
6         [SF, NYG]
7         [SF, NYG]

I need to split this column of lists into 2 columns named team1 and team2 using pandas.

5 comments

@ailurid 2019-11-03 06:07:35

Based on the previous answers, here is another solution which returns the same result as df2.teams.apply(pd.Series) with a much faster run time:

pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

Timings:

In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [2]: %timeit df2['teams'].apply(pd.Series)

8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Kevin Markham 2019-08-14 16:24:16

This solution preserves the index of the df2 DataFrame, unlike any solution that uses tolist():

df3 = df2.teams.apply(pd.Series)
df3.columns = ['team1', 'team2']

Here's the result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

@Erfan 2019-10-17 11:43:10

Also one of the slowest apply you can do in pandas. You should avoid this method and use the accepted answer. In the timings of the top answer, this method is approx 1400 x slower @rajan

@Kevin Markham 2019-10-17 13:17:31

@Erfan Yes, but sometimes the user doesn't care whether an operation takes 1s or 1ms, and instead they care most about writing the simplest, most readable code! I acknowledge that readability/simplicity is subjective, but my point is simply that speed is not a priority for all users at all times.

@moritz 2020-01-22 15:58:49

Furthermore, I found out that the apply method works more reliably for expanding large arrays (1000+ items) on large data sets. The tolist() method killed my process when the data set exceeded 500k rows.

@Joseph Davison 2018-06-15 17:03:07

Much simpler solution:

pd.DataFrame(df2.teams.tolist(), columns=['team1', 'team2'])

Yields,

  team1 team2
-------------
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
7    SF   NYG

If you wanted to split a column of delimited strings rather than lists, you could similarly do:

pd.DataFrame(df.teams.str.split('<delim>', expand=True).values,
             columns=['team1', 'team2'])

@ikel 2019-11-03 17:00:29

what if each list has uneven number of elements?

@mikkokotila 2018-01-09 11:53:39

There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. I'm assuming that the column is called 'meta' in a dataframe df:

df2 = pd.DataFrame(df['meta'].str.split().values.tolist())

@otteheng 2018-01-11 16:29:36

I got an error but I resolved it by removing the str.split(). This was much simpler and has the advantage if you don't know the number of items in your list.

@jezrael 2016-02-18 20:06:49

You can use DataFrame constructor with lists created by converting to numpy array by values with tolist:

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index)
print (df2)
       teams team1 team2
0  [SF, NYG]    SF   NYG
1  [SF, NYG]    SF   NYG
2  [SF, NYG]    SF   NYG
3  [SF, NYG]    SF   NYG
4  [SF, NYG]    SF   NYG
5  [SF, NYG]    SF   NYG
6  [SF, NYG]    SF   NYG

And for new DataFrame:

df3 = pd.DataFrame(df2['teams'].values.tolist(), columns=['team1','team2'])
print (df3)
  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Solution with apply(pd.Series) is very slow:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [89]: %timeit df2['teams'].apply(pd.Series)
1 loop, best of 3: 1.15 s per loop

In [90]: %timeit pd.DataFrame(df2['teams'].values.tolist(), columns=['team1','team2'])
1000 loops, best of 3: 820 µs per loop

@user1700890 2017-11-06 15:16:42

Minor caveat, if you are using it on existing dataframe, make sure to reset index, otherwise it will not assign correctly.

@jezrael 2017-11-06 15:18:53

@user1700890 - yes, or specify index in DataFrame constructor df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index)

@dondapati 2018-02-14 13:43:46

@jezrael Why should we use values method could you provide the explanation.

@dondapati 2018-02-14 13:45:02

because pd.DataFrame(df2.teams.tolist(),columns=(['team1','team2'])) it gives same answer. is there any reason to use values

@jezrael 2018-02-14 13:47:16

@user7462639 - Because better performance. Check this - generally converting numpy arrays should be faster as convert Series - pandas functions are obviously slowier as numpy.

@Catbuilts 2018-11-20 11:08:23

It seems to me that apply() is likely to be slow for the most part. Should I avoid using the function if I have alternatives ?

@jezrael 2018-11-20 11:08:59

@Catbuilts - yes, if exist vectorize solution the best avoid it.

@Catbuilts 2018-11-20 11:19:39

@jezrael: Thanks for the advice. Btw, what do you mean by vectorize solution. numpy is a way of vectorize solution, right? What else can be considered as vectorize solution ? Thanks

@jezrael 2018-11-20 11:21:24

@Catbuilts - yes, obviously. Vectorized means generally no loops, so no apply, no for, no list comprehensions. But it depends what need exactly. Maybe also help this

@CheTesta 2019-02-11 09:31:10

@Catbuilts Indeed apply() might be slower but is the go-to method when input string and values are not equal across rows of the original Series!

@QHarr 2019-06-22 08:00:41

Here you know what the list length is and supply the new column names. What do you do if the list length varies? Do you find the max length of a list within the column then add enough columns for that and find a way to both generate new names and assign values for each row across as many columns as applicable? Do you know of an example of how to solve this please? +1 btw

@jezrael 2019-06-22 08:02:16

@QHarr - Then use pd.DataFrame(df2['teams'].values.tolist()).add_prefix('col')

Related Questions

Sponsored Content

40 Answered Questions

[SOLVED] How to make a flat list out of list of lists?

7 Answered Questions

[SOLVED] Change data type of columns in Pandas

20 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

19 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

32 Answered Questions

[SOLVED] How do I split a string on a delimiter in Bash?

62 Answered Questions

[SOLVED] How do you split a list into evenly sized chunks?

27 Answered Questions

[SOLVED] How do I check if a list is empty?

  • 2008-09-10 06:20:11
  • Ray Vega
  • 2630695 View
  • 3234 Score
  • 27 Answer
  • Tags:   python list

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

23 Answered Questions

[SOLVED] Renaming columns in pandas

5 Answered Questions

[SOLVED] Convert list of dictionaries to a pandas DataFrame

Sponsored Content