By Vincent

2012-10-01 20:42:16 8 Comments

I have a pandas dataframe in which one column of text strings contains comma-separated values. I want to split each CSV field and create a new row per entry (assume that CSV are clean and need only be split on ','). For example, a should become b:

In [7]: a
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [8]: b
  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

So far, I have tried various simple functions, but the .apply method seems to only accept one row as return value when it is used on an axis, and I can't get .transform to work. Any suggestions would be much appreciated!

Example data:

from pandas import DataFrame
import numpy as np
a = DataFrame([{'var1': 'a,b,c', 'var2': 1},
               {'var1': 'd,e,f', 'var2': 2}])
b = DataFrame([{'var1': 'a', 'var2': 1},
               {'var1': 'b', 'var2': 1},
               {'var1': 'c', 'var2': 1},
               {'var1': 'd', 'var2': 2},
               {'var1': 'e', 'var2': 2},
               {'var1': 'f', 'var2': 2}])

I know this won't work because we lose DataFrame meta-data by going through numpy, but it should give you a sense of what I tried to do:

def fun(row):
    letters = row['var1']
    letters = letters.split(',')
    out = np.array([row] * len(letters))
    out['var1'] = letters
a['idx'] = range(a.shape[0])
z = a.groupby('idx')


@Naga Kiran 2018-10-24 16:29:52

There is a possibility to split and explode the dataframe without changing the structure of dataframe


    var1    var2
0   a,b,c   1
1   d,e,f   2

#Get the indexes which are repetative with the split 
df = df.reindex(df.index.repeat(df.var1.str.split(',').apply(len)))
#Assign the split values to dataframe column  
df['var1'] = np.hstack(df['var1'].drop_duplicates().str.split(','))


    var1    var2
0   a   1
0   b   1
0   c   1
1   d   2
1   e   2
1   f   2

@MaxU 2016-11-06 13:12:51

UPDATE2: more generic vectorized function, which will work for multiple normal and multiple list columns

def explode(df, lst_cols, fill_value='', preserve_index=False):
    # make sure `lst_cols` is list-alike
    if (lst_cols is not None
        and len(lst_cols) > 0
        and not isinstance(lst_cols, (list, tuple, np.ndarray, pd.Series))):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)
    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()
    # preserve original index values    
    idx = np.repeat(df.index.values, lens)
    # create "exploded" DF
    res = (pd.DataFrame({
                col:np.repeat(df[col].values, lens)
                for col in idx_cols},
             .assign(**{col:np.concatenate(df.loc[lens>0, col].values)
                            for col in lst_cols}))
    # append those rows that have empty lists
    if (lens == 0).any():
        # at least one list in cells is empty
        res = (res.append(df.loc[lens==0, idx_cols], sort=False)
    # revert the original index order
    res = res.sort_index()
    # reset index if requested
    if not preserve_index:        
        res = res.reset_index(drop=True)
    return res


Multiple list columns - all list columns must have the same # of elements in each row:

In [134]: df
   aaa  myid        num          text
0   10     1  [1, 2, 3]  [aa, bb, cc]
1   11     2         []            []
2   12     3     [1, 2]      [cc, dd]
3   13     4         []            []

In [135]: explode(df, ['num','text'], fill_value='')
   aaa  myid num text
0   10     1   1   aa
1   10     1   2   bb
2   10     1   3   cc
3   11     2
4   12     3   1   cc
5   12     3   2   dd
6   13     4

preserving original index values:

In [136]: explode(df, ['num','text'], fill_value='', preserve_index=True)
   aaa  myid num text
0   10     1   1   aa
0   10     1   2   bb
0   10     1   3   cc
1   11     2
2   12     3   1   cc
2   12     3   2   dd
3   13     4


df = pd.DataFrame({
 'aaa': {0: 10, 1: 11, 2: 12, 3: 13},
 'myid': {0: 1, 1: 2, 2: 3, 3: 4},
 'num': {0: [1, 2, 3], 1: [], 2: [1, 2], 3: []},
 'text': {0: ['aa', 'bb', 'cc'], 1: [], 2: ['cc', 'dd'], 3: []}

CSV column:

In [46]: df
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

In [47]: explode(df.assign(var1=df.var1.str.split(',')), 'var1')
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

using this little trick we can convert CSV-like column to list column:

In [48]: df.assign(var1=df.var1.str.split(','))
              var1  var2 var3
0        [a, b, c]     1   XX
1  [d, e, f, x, y]     2   ZZ

UPDATE: generic vectorized approach (will work also for multiple columns):

Original DF:

In [177]: df
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ


first let's convert CSV strings to lists:

In [178]: lst_col = 'var1' 

In [179]: x = df.assign(**{lst_col:df[lst_col].str.split(',')})

In [180]: x
              var1  var2 var3
0        [a, b, c]     1   XX
1  [d, e, f, x, y]     2   ZZ

Now we can do this:

In [181]: pd.DataFrame({
     ...:     col:np.repeat(x[col].values, x[lst_col].str.len())
     ...:     for col in x.columns.difference([lst_col])
     ...: }).assign(**{lst_col:np.concatenate(x[lst_col].values)})[x.columns.tolist()]
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

OLD answer:

Inspired by @AFinkelstein solution, i wanted to make it bit more generalized which could be applied to DF with more than two columns and as fast, well almost, as fast as AFinkelstein's solution):

In [2]: df = pd.DataFrame(
   ...:    [{'var1': 'a,b,c', 'var2': 1, 'var3': 'XX'},
   ...:     {'var1': 'd,e,f,x,y', 'var2': 2, 'var3': 'ZZ'}]
   ...: )

In [3]: df
        var1  var2 var3
0      a,b,c     1   XX
1  d,e,f,x,y     2   ZZ

In [4]: (df.set_index(df.columns.drop('var1',1).tolist())
   ...:    .var1.str.split(',', expand=True)
   ...:    .stack()
   ...:    .reset_index()
   ...:    .rename(columns={0:'var1'})
   ...:    .loc[:, df.columns]
   ...: )
  var1  var2 var3
0    a     1   XX
1    b     1   XX
2    c     1   XX
3    d     2   ZZ
4    e     2   ZZ
5    f     2   ZZ
6    x     2   ZZ
7    y     2   ZZ

@WeNYoBen 2017-09-01 16:45:34

dude, if you can open a discussion in Git pandas , I think we do need a build in function like this !!! I have seen so many question about unlistify and unnesting in SO for pandas

@Jaskaran Singh Puri 2018-08-21 13:52:46

how to use this for multiple columns. Like if I have comma separated data in 2 columns and want to do it in sequence?

@MaxU 2018-08-21 15:14:49

@JaskaranSinghPuri, you want to convert all CSV columns to lists first.

@Guido 2018-11-28 14:12:52

Unfornately, it doesn't work if your list elements are tuples. But after converting the entire tuple to string, it works like a charm!

@krassowski 2019-01-23 16:53:20

This solution appears to be the fastest one except for the case there are many very short lists, see for benchmarks.

@krassowski 2019-01-22 23:45:09

I have been struggling with out-of-memory experience using various way to explode my lists so I prepared some benchmarks to help me decide which answers to upvote. I tested five scenarios with varying proportions of the list length to the number of lists. Sharing the results below:

Time: (less is better, click to view large version)


Peak memory usage: (less is better)

Peak memory usage


  • @MaxU's answer (update 2), codename concatenate offers the best speed in almost every case, while keeping the peek memory usage low,
  • see @DMulligan's answer (codename stack) if you need to process lots of rows with relatively small lists and can afford increased peak memory,
  • the accepted @Chang's answer works well for data frames that have a few rows but very large lists.

Full details (functions and benchmarking code) are in this GitHub gist. Please note that the benchmark problem was simplified and did not include splitting of strings into the list - which most solutions performed in a similar fashion.

@MaxU 2019-01-23 17:03:07

Nice comparison! Do you mind to post a code, that you used for plotting the benchmarks ?

@krassowski 2019-01-23 17:14:45

Please see this link: (already included in the answer) - IMO it would be a bit too long to paste it all here.

@piRSquared 2018-08-08 17:23:44


import pandas as pd
import numpy as np

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})


explode_str(a, 'var1', ',')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

Let's create a new dataframe d that has lists

d = a.assign(var1=lambda d: d.var1.str.split(','))

explode_list(d, 'var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

General Comments

I'll use np.arange with repeat to produce dataframe index positions that I can use with iloc.


Why don't I use loc?

Because the index may not be unique and using loc will return every row that matches a queried index.

Why don't you use the values attribute and slice that?

When calling values, if the entirety of the the dataframe is in one cohesive "block", Pandas will return a view of the array that is the "block". Otherwise Pandas will have to cobble together a new array. When cobbling, that array must be of a uniform dtype. Often that means returning an array with dtype that is object. By using iloc instead of slicing the values attribute, I alleviate myself from having to deal with that.

Why do you use assign?

When I use assign using the same column name that I'm exploding, I overwrite the existing column and maintain its position in the dataframe.

Why are the index values repeat?

By virtue of using iloc on repeated positions, the resulting index shows the same repeated pattern. One repeat for each element the list or string.
This can be reset with reset_index(drop=True)

For Strings

I don't want to have to split the strings prematurely. So instead I count the occurrences of the sep argument assuming that if I were to split, the length of the resulting list would be one more than the number of separators.

I then use that sep to join the strings then split.

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

For Lists

Similar as for strings except I don't need to count occurrences of sep because its already split.

I use Numpy's concatenate to jam the lists together.

import pandas as pd
import numpy as np

def explode_list(df, col):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.len())
    return df.iloc[i].assign(**{col: np.concatenate(s)})

@Tim 2019-03-26 21:51:17

I like this one. Really concise and the performance should be really good too. One question though: is df.iloc[i] the same as repeating rows of the dataframe or is it more efficient than that? Thanks!

@Erfan 2019-05-14 22:53:24

This answer should be higher.

@piRSquared 2019-05-14 23:24:02

@Erfan thanks, that’s kind of you.

@cgels 2018-06-05 23:42:09

The string function split can take an option boolean argument 'expand'.

Here is a solution using this argument:

a.var1.str.split(",",expand=True).set_index(a.var2).stack().reset_index(level=1, drop=True).reset_index().rename(columns={0:"var1"})

@Dennis Golomazov 2018-01-05 20:16:01

Based on the excellent @DMulligan's solution, here is a generic vectorized (no loops) function which splits a column of a dataframe into multiple rows, and merges it back to the original dataframe. It also uses a great generic change_column_order function from this answer.

def change_column_order(df, col_name, index):
    cols = df.columns.tolist()
    cols.insert(index, col_name)
    return df[cols]

def split_df(dataframe, col_name, sep):
    orig_col_index = dataframe.columns.tolist().index(col_name)
    orig_index_name =
    orig_columns = dataframe.columns
    dataframe = dataframe.reset_index()  # we need a natural 0-based index for proper merge
    index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
    df_split = pd.DataFrame(
        .stack().reset_index(level=1, drop=1), columns=[col_name])
    df = dataframe.drop(col_name, axis=1)
    df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
    df = df.set_index(index_col_name) = orig_index_name
    # merge adds the column to the last place, so we need to move it back
    return change_column_order(df, col_name, orig_col_index)


df = pd.DataFrame([['a:b', 1, 4], ['c:d', 2, 5], ['e:f:g:h', 3, 6]], 
                  columns=['Name', 'A', 'B'], index=[10, 12, 13])
        Name    A   B
    10   a:b     1   4
    12   c:d     2   5
    13   e:f:g:h 3   6

split_df(df, 'Name', ':')
    Name    A   B
10   a       1   4
10   b       1   4
12   c       2   5
12   d       2   5
13   e       3   6
13   f       3   6    
13   g       3   6    
13   h       3   6    

Note that it preserves the original index and order of the columns. It also works with dataframes which have non-sequential index.

@Evan 2018-02-01 03:07:31

this cracked this one for me, nice work:

@Ted Petrou 2017-11-04 17:34:22

Here is a fairly straightforward message that uses the split method from pandas str accessor and then uses NumPy to flatten each row into a single array.

The corresponding values are retrieved by repeating the non-split column the correct number of times with np.repeat.

var1 = df.var1.str.split(',', expand=True).values.ravel()
var2 = np.repeat(df.var2.values, len(var1) / len(df))

pd.DataFrame({'var1': var1,
              'var2': var2})

  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5    f     2

@Michael Dorner 2018-06-21 14:08:05

That could be a very beautiful answer. Unfortunately, it does not scale for lots of columns, does it?

@Ankit Maheshwari 2017-06-18 10:27:08

Another solution that uses python copy package

import copy
new_observations = list()
def pandas_explode(df, column_to_explode):
    new_observations = list()
    for row in df.to_dict(orient='records'):
        explode_values = row[column_to_explode]
        del row[column_to_explode]
        if type(explode_values) is list or type(explode_values) is tuple:
            for explode_value in explode_values:
                new_observation = copy.deepcopy(row)
                new_observation[column_to_explode] = explode_value
            new_observation = copy.deepcopy(row)
            new_observation[column_to_explode] = explode_values
    return_df = pd.DataFrame(new_observations)
    return return_df

df = pandas_explode(df, column_name)

@Daniel Himmelstein 2016-10-09 17:57:42

Here's a function I wrote for this common task. It's more efficient than the Series/stack methods. Column order and names are retained.

def tidy_split(df, column, sep='|', keep=False):
    Split the values of a column and expand so the new DataFrame has one split
    value per row. Filters rows where the column is missing.

    df : pandas.DataFrame
        dataframe with the column to split and expand
    column : str
        the column to split and expand
    sep : str
        the string used to split the column's values
    keep : bool
        whether to retain the presplit value as it's own row

        Returns a dataframe with the same columns as `df`.
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
        for value in values:
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

With this function, the original question is as simple as:

tidy_split(a, 'var1', sep=',')

@bold 2016-12-12 00:09:38

Really efficient & great help for my problem !

@Harry M 2018-07-24 19:43:13

This worked the best for me!

@Derryn Webster-Knife 2016-06-19 15:42:16

Just used jiln's excellent answer from above, but needed to expand to split multiple columns. Thought I would share.

def splitDataFrameList(df,target_column,separator):
''' df = dataframe to split,
target_column = the column containing the values to split
separator = the symbol used to perform the split

returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
The values in the other columns are duplicated across the newly divided rows.
def splitListToRows(row, row_accumulator, target_columns, separator):
    split_rows = []
    for target_column in target_columns:
    # Seperate for multiple columns
    for i in range(len(split_rows[0])):
        new_row = row.to_dict()
        for j in range(len(split_rows)):
            new_row[target_columns[j]] = split_rows[j][i]
new_rows = []
df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
new_df = pd.DataFrame(new_rows)
return new_df

@inodb 2015-06-24 21:01:57

Similar question as: pandas: How do I split text in a column into multiple rows?

You could do:

>> a=pd.DataFrame({"var1":"a,b,c d,e,f".split(),"var2":[1,2]})
>> s = a.var1.str.split(",").apply(pd.Series, 1).stack()
>> s.index = s.index.droplevel(-1)
>> del a['var1']
>> a.join(s)
   var2 var1
0     1    a
0     1    b
0     1    c
1     2    d
1     2    e
1     2    f

@Jesse 2017-06-04 07:13:49

It works after add one more rename code = 'var1'

@jlln 2015-04-21 09:02:49

I came up with a solution for dataframes with arbitrary numbers of columns (while still only separating one column's entries at a time).

def splitDataFrameList(df,target_column,separator):
    ''' df = dataframe to split,
    target_column = the column containing the values to split
    separator = the symbol used to perform the split

    returns: a dataframe with each entry for the target column separated, with each element moved into a new row. 
    The values in the other columns are duplicated across the newly divided rows.
    def splitListToRows(row,row_accumulator,target_column,separator):
        split_row = row[target_column].split(separator)
        for s in split_row:
            new_row = row.to_dict()
            new_row[target_column] = s
    new_rows = []
    df.apply(splitListToRows,axis=1,args = (new_rows,target_column,separator))
    new_df = pandas.DataFrame(new_rows)
    return new_df

@KWubbufetowicz 2016-06-22 18:51:20

nice but sadly slow because of this todict() conversion :(

@Pavel 2015-03-17 21:07:42

I have come up with the following solution to this problem:

def iter_var1(d):
    for _, row in d.iterrows():
        for v in row["var1"].split(","):
            yield (v, row["var2"])

new_a = DataFrame.from_records([i for i in iter_var1(a)],
        columns=["var1", "var2"])

@DMulligan 2015-01-28 00:28:46

After painful experimentation to find something faster than the accepted answer, I got this to work. It ran around 100x faster on the dataset I tried it on.

If someone knows a way to make this more elegant, by all means please modify my code. I couldn't find a way that works without setting the other columns you want to keep as the index and then resetting the index and re-naming the columns, but I'd imagine there's something else that works.

b = DataFrame(a.var1.str.split(',').tolist(), index=a.var2).stack()
b = b.reset_index()[[0, 'var2']] # var1 variable is currently labeled 0
b.columns = ['var1', 'var2'] # renaming var1

@cyril 2017-04-15 00:06:43

This solution worked significantly faster and appears to use less memory,

@Dennis Golomazov 2018-01-05 19:20:50

This is a nice vectorized pandas solution, I was looking for that. Thanks!

@user5359531 2018-08-23 22:10:07

When I try this on my own dataset, I keep getting TypeError: object of type 'float' has no len() at the very first step (DataFrame(df.var1.str.split(',').tolist()))

@Flair 2018-10-01 23:00:20

@user5359531 your dataset probably has some NaN in that column, so the replacement is b = DataFrame(a.var1.str.split(',').values.tolist(), index=a.var2).stack()

@Chang She 2012-10-01 21:15:03

How about something like this:

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
  index  0
0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

Then you just have to rename the columns

@Vincent 2012-10-02 00:22:15

Looks like this is going to work. Thanks for your help! In general, though, is there a prefered approach to Split-Apply-Combine where Apply returns a dataframe of arbitrary size (but consistent for all chunks), and Combine just vstacks the returned DFs?

@Chang She 2012-10-02 01:43:48

GroupBy.apply should work (I just tried it against master). However, in this case you don't really need to go through the extra step of grouping since you're generating the data by row right?

@Vincent 2012-10-02 03:00:45

Yes, that's right. Thanks for the tip. iterrows is nice.

@horatio1701d 2014-06-25 20:20:25

Hey guys. Sorry to jump into this so late but wondering if there is not a better solution to this. I'm trying to experiment with iterrows for the first time since that seems like the ticket for this. I'm also confused by the solution proposed. What does the "_" represent? Can you possibly explain how the solution works? --Thank you

@horatio1701d 2014-06-25 21:54:00

Can the solution be extended to more than two columns?

@horatio1701d 2014-09-03 18:06:26

Is there an implementation with the new API enhancements that might make this a little more performant? This implementation takes far too long to be practical on my large datasets.

@MaxU 2017-02-02 22:30:29

please check this vectorized approach...

@Ando Jurai 2017-05-30 11:43:48

@horatio1701d, _ is a nameplace meaning that you will not take the output into account. here the output is the unpacking of iterrows() for each line, hence both the index and the content of the line. Hence you ignore the index and use the content of the line. The solution works by using a trick; the repeated content is actually passed as "data", the second parameter is a list and represent the index of the Series. I tried it out because I was a little puzzled too but this is clever, while not so efficient computationally-wise.

@krassowski 2019-01-23 16:55:28

This approach seems to be very memory efficient, especially for big lists, see for benchmarks.

Related Questions

Sponsored Content

16 Answered Questions

[SOLVED] Selecting multiple columns in a pandas dataframe

18 Answered Questions

[SOLVED] Get list from pandas DataFrame column headers

17 Answered Questions

[SOLVED] How to iterate over rows in a DataFrame in Pandas?

14 Answered Questions

[SOLVED] Select rows from a DataFrame based on values in a column in pandas

23 Answered Questions

[SOLVED] Adding new column to existing DataFrame in Python pandas

18 Answered Questions

[SOLVED] Add one row to pandas DataFrame

15 Answered Questions

[SOLVED] Delete column from pandas DataFrame by column name

13 Answered Questions

[SOLVED] "Large data" work flows using pandas

12 Answered Questions

[SOLVED] How do I get the row count of a pandas DataFrame?

13 Answered Questions

[SOLVED] How to drop rows of Pandas DataFrame whose value in certain columns is NaN

Sponsored Content