By user2938093


2016-02-18 20:01:07 8 Comments

I have a pandas DataFrame with one column:

import pandas as pd

df = pd.DataFrame(
    data={
        "teams": [
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
            ["SF", "NYG"],
        ]
    }
)

print(df)

Output:

       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

How can split this column of lists into 2 columns?

8 comments

@Sayandip Dutta 2020-07-31 20:28:05

Here's another solution using df.transform and df.set_index:

>>> (df['teams']
       .transform([lambda x:x[0], lambda x:x[1]])
       .set_axis(['team1','team2'],
                  axis=1,
                  inplace=False)
    )

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

@raphaelauv 2020-09-10 10:57:23

That's really nice

@Talis 2020-05-24 09:28:09

list comprehension

simple implementation with list comprehension ( my favorite)

df = pd.DataFrame([pd.Series(x) for x in df.teams])
df.columns = ['team_{}'.format(x+1) for x in df.columns]

timing on output:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.71 ms

output:

team_1  team_2
0   SF  NYG
1   SF  NYG
2   SF  NYG
3   SF  NYG
4   SF  NYG
5   SF  NYG
6   SF  NYG

@Isaac 2020-05-26 10:47:18

This kind of handles lists of different lengths - which is an improvement over many other answers, but results in items not being in their own columns.

@raphaelauv 2020-09-10 11:01:28

This solution is not based on pandas

@jezrael 2016-02-18 20:06:49

You can use DataFrame constructor with lists created by to_list:

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
       teams team1 team2
0  [SF, NYG]    SF   NYG
1  [SF, NYG]    SF   NYG
2  [SF, NYG]    SF   NYG
3  [SF, NYG]    SF   NYG
4  [SF, NYG]    SF   NYG
5  [SF, NYG]    SF   NYG
6  [SF, NYG]    SF   NYG

And for new DataFrame:

df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
print (df3)
  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

Solution with apply(pd.Series) is very slow:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [121]: %timeit df2['teams'].apply(pd.Series)
1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@user1700890 2017-11-06 15:16:42

Minor caveat, if you are using it on existing dataframe, make sure to reset index, otherwise it will not assign correctly.

@jezrael 2017-11-06 15:18:53

@user1700890 - yes, or specify index in DataFrame constructor df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index)

@dondapati 2018-02-14 13:43:46

@jezrael Why should we use values method could you provide the explanation.

@dondapati 2018-02-14 13:45:02

because pd.DataFrame(df2.teams.tolist(),columns=(['team1','team2'])) it gives same answer. is there any reason to use values

@jezrael 2018-02-14 13:47:16

@user7462639 - Because better performance. Check this - generally converting numpy arrays should be faster as convert Series - pandas functions are obviously slowier as numpy.

@Catbuilts 2018-11-20 11:08:23

It seems to me that apply() is likely to be slow for the most part. Should I avoid using the function if I have alternatives ?

@jezrael 2018-11-20 11:08:59

@Catbuilts - yes, if exist vectorize solution the best avoid it.

@Catbuilts 2018-11-20 11:19:39

@jezrael: Thanks for the advice. Btw, what do you mean by vectorize solution. numpy is a way of vectorize solution, right? What else can be considered as vectorize solution ? Thanks

@jezrael 2018-11-20 11:21:24

@Catbuilts - yes, obviously. Vectorized means generally no loops, so no apply, no for, no list comprehensions. But it depends what need exactly. Maybe also help this

@CheTesta 2019-02-11 09:31:10

@Catbuilts Indeed apply() might be slower but is the go-to method when input string and values are not equal across rows of the original Series!

@QHarr 2019-06-22 08:00:41

Here you know what the list length is and supply the new column names. What do you do if the list length varies? Do you find the max length of a list within the column then add enough columns for that and find a way to both generate new names and assign values for each row across as many columns as applicable? Do you know of an example of how to solve this please? +1 btw

@jezrael 2019-06-22 08:02:16

@QHarr - Then use pd.DataFrame(df2['teams'].values.tolist()).add_prefix('col')

@AMC 2020-05-01 11:21:32

@jezrael Can you update this to reflect the fact that Series have their own to_list() method?

@jezrael 2020-05-01 11:30:51

@AMC - Yop, there is also tolist

@AMC 2020-05-01 11:32:52

@jezrael I believe that to_list is currently just an alias for tolist anyway.

@jezrael 2020-05-01 11:34:11

@AMC - hmmm, there is some reason for prioritize this one? What so you think? Btw, downvoting if you it was not so nice :(

@jezrael 2020-05-01 11:39:57

@AMC - Also not this ?

@AMC 2020-05-01 11:40:40

@Shubham Sharma 2020-06-04 13:41:11

Read your post regarding the performance of df.apply(pd.Series), Nice post, Thanks for suggesting the post, its always good to learn stuff from you +1.

@jezrael 2020-06-04 13:41:35

@ShubhamSharma - Thank you.

@MasayoMusic 2020-07-16 05:49:54

@jezrael Using tolist vs apply seems to be converting my np.nan into None? It seems I can't do a simple replace df.replace(None, np.nan)

@Joseph Davison 2018-06-15 17:03:07

Much simpler solution:

pd.DataFrame(df2["teams"].to_list(), columns=['team1', 'team2'])

Yields,

  team1 team2
-------------
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG
7    SF   NYG

If you wanted to split a column of delimited strings rather than lists, you could similarly do:

pd.DataFrame(df["teams"].str.split('<delim>', expand=True).values,
             columns=['team1', 'team2'])

@ikel 2019-11-03 17:00:29

what if each list has uneven number of elements?

@AMC 2020-05-01 11:11:14

If you wanted to split a column of delimited strings rather than lists, you could similarly do: df["teams"].str.split('<delim>', expand=True) already returns a DataFrame, so it would probably be simpler to just rename the columns.

@Lucas 2020-03-02 12:48:13

The above solutions didn't work for me since I have nan observations in my dataframe. In my case df2[['team1','team2']] = pd.DataFrame(df2.teams.values.tolist(), index= df2.index) yields:

object of type 'float' has no len()

I solve this using list comprehension. Here the replicable example:

import pandas as pd
import numpy as np
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
            ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2.loc[2,'teams'] = np.nan
df2.loc[4,'teams'] = np.nan
df2

output:

        teams
0   [SF, NYG]
1   [SF, NYG]
2   NaN
3   [SF, NYG]
4   NaN
5   [SF, NYG]
6   [SF, NYG]

df2['team1']=np.nan
df2['team2']=np.nan

solving with list comprehension:

for i in [0,1]:
    df2['team{}'.format(str(i+1))]=[k[i] if isinstance(k,list) else k for k in df2['teams']]

df2

yields:

    teams   team1   team2
0   [SF, NYG]   SF  NYG
1   [SF, NYG]   SF  NYG
2   NaN        NaN  NaN
3   [SF, NYG]   SF  NYG
4   NaN        NaN  NaN
5   [SF, NYG]   SF  NYG
6   [SF, NYG]   SF  NYG

@ailurid 2019-11-03 06:07:35

Based on the previous answers, here is another solution which returns the same result as df2.teams.apply(pd.Series) with a much faster run time:

pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

Timings:

In [1]:
import pandas as pd
d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [2]: %timeit df2['teams'].apply(pd.Series)

8.27 s ± 2.73 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit pd.DataFrame([{x: y for x, y in enumerate(item)} for item in df2['teams'].values.tolist()], index=df2.index)

35.4 ms ± 5.22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Kevin Markham 2019-08-14 16:24:16

This solution preserves the index of the df2 DataFrame, unlike any solution that uses tolist():

df3 = df2.teams.apply(pd.Series)
df3.columns = ['team1', 'team2']

Here's the result:

  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

@Erfan 2019-10-17 11:43:10

Also one of the slowest apply you can do in pandas. You should avoid this method and use the accepted answer. In the timings of the top answer, this method is approx 1400 x slower @rajan

@Kevin Markham 2019-10-17 13:17:31

@Erfan Yes, but sometimes the user doesn't care whether an operation takes 1s or 1ms, and instead they care most about writing the simplest, most readable code! I acknowledge that readability/simplicity is subjective, but my point is simply that speed is not a priority for all users at all times.

@moritz 2020-01-22 15:58:49

Furthermore, I found out that the apply method works more reliably for expanding large arrays (1000+ items) on large data sets. The tolist() method killed my process when the data set exceeded 500k rows.

@dasilvadaniel 2020-02-16 19:47:47

This is a great solution because it works well with lists of different sizes.

@AMC 2020-05-01 11:17:31

@KevinMarkham they care most about writing the simplest, most readable code Is pd.DataFrame(df["teams"].to_list(), columns=["team_1", "team_2"]) really so much more complicated?

@mikkokotila 2018-01-09 11:53:39

There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. I'm assuming that the column is called 'meta' in a dataframe df:

df2 = pd.DataFrame(df['meta'].str.split().values.tolist())

@otteheng 2018-01-11 16:29:36

I got an error but I resolved it by removing the str.split(). This was much simpler and has the advantage if you don't know the number of items in your list.

@AMC 2020-05-01 11:18:36

There seems to be a syntactically simpler way, and therefore easier to remember, as opposed to the proposed solutions. Really? Because this is practically identical to the top answer which was posted years earlier. The only difference is the part which isn't related to this specific question.

Related Questions

Sponsored Content

64 Answered Questions

[SOLVED] How do you split a list into evenly sized chunks?

7 Answered Questions

[SOLVED] Convert list of dictionaries to a pandas DataFrame

9 Answered Questions

[SOLVED] Change column type from string to float in Pandas

22 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas

34 Answered Questions

[SOLVED] How do I sort a dictionary by value?

19 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

27 Answered Questions

[SOLVED] Renaming columns in pandas

18 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

27 Answered Questions

[SOLVED] Add one row to pandas DataFrame

24 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

Sponsored Content