By JD Long


2010-05-17 17:38:24 8 Comments

I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame.

I got some pointers from an earlier question which was trying to do something similar but more complex.

Here's an example of what I am starting with (this is grossly simplified for illustration):

listOfDataFrames <- vector(mode = "list", length = 100)

for (i in 1:100) {
    listOfDataFrames[[i]] <- data.frame(a=sample(letters, 500, rep=T),
                             b=rnorm(500), c=rnorm(500))
}

I am currently using this:

  df <- do.call("rbind", listOfDataFrames)

9 comments

@joeklieg 2018-02-27 20:05:08

Use bind_rows() from the dplyr package:

bind_rows(list_of_dataframes, .id = "column_label")

@Sibo Jiang 2018-04-29 20:49:20

Nice solution. .id = "column_label" adds the unique row names based on the list element names.

@JD Long 2019-01-12 12:03:47

since it's 2018 and dplyr is both fast and a solid tool to use, I've changed this to the accepted answer. The years, they fly by!

@David Arenburg 2019-03-25 14:22:46

This was posted 3 times on this very same thread in 2015 , in 2016 and in 2017 so accepting/upvoting this answer makes absolutely no sense.

@rmf 2016-07-21 16:32:13

bind-plot

Code:

library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
plyr::rbind.fill(dflist),
dplyr::bind_rows(dflist),
data.table::rbindlist(dflist),
plyr::ldply(dflist,data.frame),
do.call("rbind",dflist),
times=1000)

ggplot2::autoplot(mb)

Session:

R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

> packageVersion("plyr")
[1] ‘1.8.4’
> packageVersion("dplyr")
[1] ‘0.5.0’
> packageVersion("data.table")
[1] ‘1.9.6’

UPDATE: Rerun 31-Jan-2018. Ran on the same computer. New versions of packages. Added seed for seed lovers.

enter image description here

set.seed(21)
library(microbenchmark)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
  plyr::rbind.fill(dflist),
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  plyr::ldply(dflist,data.frame),
  do.call("rbind",dflist),
  times=1000)

ggplot2::autoplot(mb)+theme_bw()


R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

> packageVersion("plyr")
[1] ‘1.8.4’
> packageVersion("dplyr")
[1] ‘0.7.2’
> packageVersion("data.table")
[1] ‘1.10.4’

@C8H10N4O2 2016-10-19 13:46:31

This is a great answer. I ran the same thing (same OS, same packages, different randomization because you don't set.seed) but saw some differences in worst-case performance. rbindlist actually had the best worst-case as well as best typical-case in my results

@Nova 2017-08-22 17:04:43

An updated visual for those wanting to compare some of the recent answers (I wanted to compare the purrr to dplyr solution). Basically I combined answers from @TheVTM and @rmf.

enter image description here

Code:

library(microbenchmark)
library(data.table)
library(tidyverse)

dflist <- vector(length=10,mode="list")
for(i in 1:100)
{
  dflist[[i]] <- data.frame(a=runif(n=260),b=runif(n=260),
                            c=rep(LETTERS,10),d=rep(LETTERS,10))
}


mb <- microbenchmark(
  dplyr::bind_rows(dflist),
  data.table::rbindlist(dflist),
  purrr::map_df(dflist, bind_rows),
  do.call("rbind",dflist),
  times=500)

ggplot2::autoplot(mb)

Session Info:

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Package Versions:

> packageVersion("tidyverse")
[1] ‘1.1.1’
> packageVersion("data.table")
[1] ‘1.10.0’

@f0nzie 2017-06-22 15:30:44

The only thing that the solutions with data.table are missing is the identifier column to know from which dataframe in the list the data is coming from.

Something like this:

df_id <- data.table::rbindlist(listOfDataFrames, idcol = TRUE)

The idcol parameter adds a column (.id) identifying the origin of the dataframe contained in the list. The result would look to something like this:

.id a         b           c
1   u   -0.05315128 -1.31975849 
1   b   -1.00404849 1.15257952  
1   y   1.17478229  -0.91043925 
1   q   -1.65488899 0.05846295  
1   c   -1.43730524 0.95245909  
1   b   0.56434313  0.93813197  

@yeedle 2017-05-17 02:26:22

Here's another way this can be done (just adding it to the answers because reduce is a very effective functional tool that is often overlooked as a replacement for loops. In this particular case, neither of these are significantly faster than do.call)

using base R:

df <- Reduce(rbind, listOfDataFrames)

or, using the tidyverse:

library(tidyverse) # or, library(dplyr); library(purrr)
df <- listOfDataFrames %>% reduce(bind_rows)

@Nick 2017-05-16 13:27:19

How it should be done in the tidyverse:

df.dplyr.purrr <- listOfDataFrames %>% map_df(bind_rows)

@yeedle 2017-05-17 02:16:59

df_dplyr_purrr if you want to be a tidyverse purist...

@Nick 2017-05-17 11:08:15

@yeedle Thanks - almost let that one slip ;)

@see24 2018-07-24 14:51:39

Why would you use map if bind_rows can take a list of dataframes?

@Nick 2018-07-25 15:12:15

@see24 bind_rows have since been updated.

@andrekos 2013-08-28 13:49:09

For the purpose of completeness, I thought the answers to this question required an update. "My guess is that using do.call("rbind", ...) is going to be the fastest approach that you will find..." It was probably true for May 2010 and some time after, but in about Sep 2011 a new function rbindlist was introduced in the data.table package version 1.8.2, with a remark that "This does the same as do.call("rbind",l), but much faster". How much faster?

library(rbenchmark)
benchmark(
  do.call = do.call("rbind", listOfDataFrames),
  plyr_rbind.fill = plyr::rbind.fill(listOfDataFrames), 
  plyr_ldply = plyr::ldply(listOfDataFrames, data.frame),
  data.table_rbindlist = as.data.frame(data.table::rbindlist(listOfDataFrames)),
  replications = 100, order = "relative", 
  columns=c('test','replications', 'elapsed','relative')
  ) 

                  test replications elapsed relative
4 data.table_rbindlist          100    0.11    1.000
1              do.call          100    9.39   85.364
2      plyr_rbind.fill          100   12.08  109.818
3           plyr_ldply          100   15.14  137.636

@KarateSnowMachine 2013-09-18 05:52:42

Thank you so much for this -- I was pulling my hair out because my data sets were getting too big for ldplying a bunch of long, molten data frames. Anyways, I got an incredible speedup by using your rbindlist suggestion.

@andyteucher 2014-07-15 22:56:47

And one more for completeness: dplyr::rbind_all(listOfDataFrames) will do the trick as well.

@rafa.pereira 2015-09-14 15:37:16

is there an equivalent to rbindlist but that append the data frames by column ? something like a cbindlist ?

@Henrik 2018-02-26 13:26:52

@rafa.pereira There is a recent feature request: add function cbindlist

@Graeme Frost 2019-04-02 14:52:24

I was also pulling my hair out because do.call() had been running on a list of data frames for 18 hours, and still hadn't finished, thank you!!!

@TheVTM 2015-04-29 00:32:15

There is also bind_rows(x, ...) in dplyr.

> system.time({ df.Base <- do.call("rbind", listOfDataFrames) })
   user  system elapsed 
   0.08    0.00    0.07 
> 
> system.time({ df.dplyr <- as.data.frame(bind_rows(listOfDataFrames)) })
   user  system elapsed 
   0.01    0.00    0.02 
> 
> identical(df.Base, df.dplyr)
[1] TRUE

@user1617979 2015-06-01 18:06:29

technically speaking you do not need the as.data.frame - all that does it makes it exclusively a data.frame, as opposed to also a table_df (from deplyr)

@Shane 2010-05-17 17:54:31

One other option is to use a plyr function:

df <- ldply(listOfDataFrames, data.frame)

This is a little slower than the original:

> system.time({ df <- do.call("rbind", listOfDataFrames) })
   user  system elapsed 
   0.25    0.00    0.25 
> system.time({ df2 <- ldply(listOfDataFrames, data.frame) })
   user  system elapsed 
   0.30    0.00    0.29
> identical(df, df2)
[1] TRUE

My guess is that using do.call("rbind", ...) is going to be the fastest approach that you will find unless you can do something like (a) use a matrices instead of a data.frames and (b) preallocate the final matrix and assign to it rather than growing it.

Edit 1:

Based on Hadley's comment, here's the latest version of rbind.fill from CRAN:

> system.time({ df3 <- rbind.fill(listOfDataFrames) })
   user  system elapsed 
   0.24    0.00    0.23 
> identical(df, df3)
[1] TRUE

This is easier than rbind, and marginally faster (these timings hold up over multiple runs). And as far as I understand it, the version of plyr on github is even faster than this.

@hadley 2010-05-18 00:34:29

rbind.fill in the latest version of plyr is considerably faster than do.call and rbind

@Matt Bannert 2010-11-29 15:32:34

interesting. for me rbind.fill was the fastest. Weird enough, do.call / rbind did not return identical TRUE, even if i could ne find a difference. The other two were equal but plyr was slower.

@baptiste 2013-08-28 15:13:52

I() could replace data.frame in your ldply call

@baptiste 2013-08-28 15:14:25

there's also melt.list in reshape(2)

@smci 2018-03-16 02:47:04

do.call(function(...) rbind(..., make.row.names=F), df) is useful if you don't want the automatically-generated unique rownames.

@see24 2018-07-25 16:39:51

bind_rows() is fastest according to rmd's answer and I think it is the most straight forward. It also has the feature of adding an id column

Related Questions

Sponsored Content

7 Answered Questions

[SOLVED] How do I get the number of elements in a list?

  • 2009-11-11 00:30:54
  • y2k
  • 3082309 View
  • 1783 Score
  • 7 Answer
  • Tags:   python list

39 Answered Questions

[SOLVED] How to make a flat list out of list of lists

25 Answered Questions

[SOLVED] How do I concatenate two lists in Python?

20 Answered Questions

30 Answered Questions

[SOLVED] How do I check if a list is empty?

  • 2008-09-10 06:20:11
  • Ray Vega
  • 2257151 View
  • 3237 Score
  • 30 Answer
  • Tags:   python list

13 Answered Questions

[SOLVED] How to join (merge) data frames (inner, outer, left, right)

15 Answered Questions

[SOLVED] How to clone or copy a list?

28 Answered Questions

[SOLVED] Finding the index of an item given a list containing it in Python

  • 2008-10-07 01:39:38
  • Eugene M
  • 3277842 View
  • 2697 Score
  • 28 Answer
  • Tags:   python list indexing

19 Answered Questions

[SOLVED] R - list to data frame

  • 2010-11-19 16:40:52
  • Btibert3
  • 601558 View
  • 444 Score
  • 19 Answer
  • Tags:   r list dataframe

19 Answered Questions

[SOLVED] Drop data frame columns by name

  • 2011-01-05 14:34:29
  • Btibert3
  • 1228069 View
  • 767 Score
  • 19 Answer
  • Tags:   r dataframe r-faq

Sponsored Content