By bshor


2011-11-11 08:16:54 8 Comments

I have a list of many data.frames that I want to merge. The issue here is that each data.frame differs in terms of the number of rows and columns, but they all share the key variables (which I've called "var1" and "var2" in the code below). If the data.frames were identical in terms of columns, I could merely rbind, for which plyr's rbind.fill would do the job, but that's not the case with these data.

Because the merge command only works on 2 data.frames, I turned to the Internet for ideas. I got this one from here, which worked perfectly in R 2.7.2, which is what I had at the time:

merge.rec <- function(.list, ...){
    if(length(.list)==1) return(.list[[1]])
    Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...)
}

And I would call the function like so:

df <- merge.rec(my.list, by.x = c("var1", "var2"), 
                by.y = c("var1", "var2"), all = T, suffixes=c("", ""))

But in any R version after 2.7.2, including 2.11 and 2.12, this code fails with the following error:

Error in match.names(clabs, names(xi)) : 
  names do not match previous names

(Incidently, I see other references to this error elsewhere with no resolution).

Is there any way to solve this?

7 comments

@Moody_Mudskipper 2019-03-03 12:44:12

The function eat of my package safejoin has such feature, if you give it a list of data.frames as a second input it will join them recursively to the first input.

Borrowing and extending the accepted answer's data :

x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)
z2 <- data_frame(i = c("a","b","c"), l = rep(100L,3),l2 = rep(100L,3)) # for later

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
eat(x, list(y,z), .by = "i")
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <int>
# 1 a         1    NA     9
# 2 b         2     4    NA
# 3 c         3     5     7

We don't have to take all columns, we can use select helpers from tidyselect and choose (as we start from .x all .x columns are kept):

eat(x, list(y,z), starts_with("l") ,.by = "i")
# # A tibble: 3 x 3
#   i         j     l
#   <chr> <int> <int>
# 1 a         1     9
# 2 b         2    NA
# 3 c         3     7

or remove specific ones:

eat(x, list(y,z), -starts_with("l") ,.by = "i")
# # A tibble: 3 x 3
#   i         j     k
#   <chr> <int> <int>
# 1 a         1    NA
# 2 b         2     4
# 3 c         3     5

If the list is named the names will be used as prefixes :

eat(x, dplyr::lst(y,z), .by = "i")
# # A tibble: 3 x 4
#   i         j   y_k   z_l
#   <chr> <int> <int> <int>
# 1 a         1    NA     9
# 2 b         2     4    NA
# 3 c         3     5     7

If there are column conflicts the .conflict argument allows you to resolve it, for example by taking the first/second one, adding them, coalescing them, or nesting them.

keep first :

eat(x, list(y, z, z2), .by = "i", .conflict = ~.x)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <int>
# 1 a         1    NA     9
# 2 b         2     4    NA
# 3 c         3     5     7

keep last:

eat(x, list(y, z, z2), .by = "i", .conflict = ~.y)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <dbl>
# 1 a         1    NA   100
# 2 b         2     4   100
# 3 c         3     5   100

add:

eat(x, list(y, z, z2), .by = "i", .conflict = `+`)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <dbl>
# 1 a         1    NA   109
# 2 b         2     4    NA
# 3 c         3     5   107

coalesce:

eat(x, list(y, z, z2), .by = "i", .conflict = dplyr::coalesce)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <int> <dbl>
# 1 a         1    NA     9
# 2 b         2     4   100
# 3 c         3     5     7

nest:

eat(x, list(y, z, z2), .by = "i", .conflict = ~tibble(first=.x, second=.y))
# # A tibble: 3 x 4
#   i         j     k l$first $second
#   <chr> <int> <int>   <int>   <int>
# 1 a         1    NA       9     100
# 2 b         2     4      NA     100
# 3 c         3     5       7     100

NA values can be replaced by using the .fill argument.

eat(x, list(y, z), .by = "i", .fill = 0)
# # A tibble: 3 x 4
#   i         j     k     l
#   <chr> <int> <dbl> <dbl>
# 1 a         1     0     9
# 2 b         2     4     0
# 3 c         3     5     7

By default it's an enhanced left_join but all dplyr joins are supported through the .mode argument, fuzzy joins are also supported through the match_fun argument (it's wrapped around the package fuzzyjoin) or giving a formula such as ~ X("var1") > Y("var2") & X("var3") < Y("var4") to the by argument.

@Elias EstatisticsEU 2018-10-17 12:32:07

I had a list of dataframes with no common id column.
I had missing data on many dfs. There were Null values. The dataframes were produced using table function. The Reduce, Merging, rbind, rbind.fill, and their like could not help me to my aim. My aim was to produce an understandable merged dataframe, irrelevant of the missing data and common id column.

Therefore, I made the following function. Maybe this function can help someone.

##########################################################
####             Dependencies                        #####
##########################################################

# Depends on Base R only

##########################################################
####             Example DF                          #####
##########################################################

# Example df
ex_df           <- cbind(c( seq(1, 10, 1), rep("NA", 0), seq(1,10, 1) ), 
                         c( seq(1, 7, 1),  rep("NA", 3), seq(1, 12, 1) ), 
                         c( seq(1, 3, 1),  rep("NA", 7), seq(1, 5, 1), rep("NA", 5) ))

# Making colnames and rownames
colnames(ex_df) <- 1:dim(ex_df)[2]
rownames(ex_df) <- 1:dim(ex_df)[1]

# Making an unequal list of dfs, 
# without a common id column
list_of_df      <- apply(ex_df=="NA", 2, ( table) )

it is following the function

##########################################################
####             The function                        #####
##########################################################


# The function to rbind it
rbind_null_df_lists <- function ( list_of_dfs ) {
  length_df     <- do.call(rbind, (lapply( list_of_dfs, function(x) length(x))))
  max_no        <- max(length_df[,1])
  max_df        <- length_df[max(length_df),]
  name_df       <- names(length_df[length_df== max_no,][1])
  names_list    <- names(list_of_dfs[ name_df][[1]])

  df_dfs <- list()
  for (i in 1:max_no ) {

    df_dfs[[i]]            <- do.call(rbind, lapply(1:length(list_of_dfs), function(x) list_of_dfs[[x]][i]))

  }

  df_cbind               <- do.call( cbind, df_dfs )
  rownames( df_cbind )   <- rownames (length_df)
  colnames( df_cbind )   <- names_list

  df_cbind

}

Running the example

##########################################################
####             Running the example                 #####
##########################################################

rbind_null_df_lists ( list_of_df )

@Paul Rougieux 2015-12-21 10:22:46

Another question asked specifically how to perform multiple left joins using dplyr in R . The question was marked as a duplicate of this one so I answer here, using the 3 sample data frames below:

library(dplyr)
x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)

Update June 2018: I divided the answer in three sections representing three different ways to perform the merge. You probably want to use the purrr way if you are already using the tidyverse packages. For comparison purposes below, you'll find a base R version using the same sample dataset.

Join them with reduce from the purrr package

The purrr package provides a reduce function which has a concise syntax:

library(tidyverse)
list(x, y, z) %>% reduce(left_join, by = "i")
#  A tibble: 3 x 4
#  i       j     k     l
#  <chr> <int> <int> <int>
# 1 a      1    NA     9
# 2 b      2     4    NA
# 3 c      3     5     7

You can also perform other joins, such as a full_join or inner_join:

list(x, y, z) %>% reduce(full_join, by = "i")
# A tibble: 4 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 a     1     NA     9
# 2 b     2     4      NA
# 3 c     3     5      7
# 4 d     NA    6      8

list(x, y, z) %>% reduce(inner_join, by = "i")
# A tibble: 1 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 c     3     5     7

dplyr::left_join() with base R Reduce()

list(x,y,z) %>%
    Reduce(function(dtf1,dtf2) left_join(dtf1,dtf2,by="i"), .)

#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7

Base R merge() with base R Reduce()

And for comparison purposes, here is a base R version of the left join

 Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "i", all.x = TRUE),
        list(x,y,z))
#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7

@bshor 2016-12-21 04:44:50

The full_join variant works perfectly, and looks a lot less scary than the accepted answer. Not much of a speed difference, though.

@DaveRGP 2017-06-30 08:42:32

@Axeman is right, but you might be able to avoid (visibly) returning a list of data frames at all by using map_dfr() or map_dfc()

@R'n'E 2019-03-07 16:35:21

I though I could join a number of DF based on a pattern using ´ls(pattern = "DF_name_contains_this" )´, but no. Used ´noquote( paste(())´, but I'm still producing a character vector instead of a list of DF. I ended up typing the names, which is obnoxious.

@Paul Rougieux 2019-03-18 13:45:59

Another question provides a python implementation: list of pandas data frames dfs = [df1, df2, df3] then reduce(pandas.merge, dfs).

@dmi3kno 2017-07-28 10:59:54

I will reuse the data example from @PaulRougieux

x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)

Here's a short and sweet solution using purrr and tidyr

library(tidyverse)

 list(x, y, z) %>% 
  map_df(gather, key=key, value=value, -i) %>% 
  spread(key, value)

@Charles 2011-11-11 17:12:36

Reduce makes this fairly easy:

merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)

Here's a fully example using some mock data:

set.seed(1)
list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
#    x  a  b         y
#12 12 NA 18        NA
#13 13 NA 19        NA
#14 14 NA 20 0.4976992
#15 15 NA NA 0.7176185
#16 16 NA NA 0.3841037
#17 19 NA NA 0.3800352

And here's an example using these data to replicate my.list:

merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list)
merged.data.frame[, 1:12]

#  matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y
#1   ALGIERE   200 RI      026       S         NA   <NA>   NA   NA   NA         NA   <NA>
#2     ALVES   100 RI      019       S         NA   <NA>   NA   NA   NA         NA   <NA>
#3    BADEAU   100 RI      032       S         NA   <NA>   NA   NA   NA         NA   <NA>

Note: It looks like this is arguably a bug in merge. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.frame which does make.unique the names, causing the rbind to fail.

# first merge will end up with 'name.x' & 'name.y'
merge(my.list[[1]], my.list[[2]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y
#<0 rows> (or 0-length row.names)
# as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again
merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y senate1995   name         votes.year  
#<0 rows> (or 0-length row.names)
# the next merge will fail as 'name' will get renamed to a pre-existing field.

Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge. Eg:

my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,
      names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list))

The merge/Reduce will then work fine.

@bshor 2011-11-11 21:49:57

Thanks! I saw this solution also on the link from Ramnath. Looks easy enough. But I get the following error: "Error in match.names(clabs, names(xi)) : names do not match previous names". The variables I'm matching on are all present in all the dataframes in the list, so I'm not catching what this error is telling me.

@bshor 2011-11-14 19:28:50

I tested this solution on R2.7.2 and I get the same match.names error. So there's some more fundamental problem with this solution and my data. I used the code: Reduce(function(x, y) merge(x, y, all=T,by.x=match.by, by.y=match.by), my.list, accumulate=F)

@Charles 2011-11-14 20:12:57

Strange, I added the code that I tested it with which runs fine. I guess there is some field-renaming occurring based on the merge args you're using? The merged result must still have the relevant keys in order to be merged with the subsequent data frame.

@Ben Bolker 2011-11-14 22:10:19

I suspect something happening with empty data frames. I tried out some examples like this: empty <- data.frame(x=numeric(0),a=numeric(0); L3 <- c(empty,empty,list.of.data.frames,empty,empty,empty) and got some weird stuff happening that I haven't figured out yet.

@bshor 2011-11-15 19:56:16

@Charles You're onto something. Your code runs fine above for me. And when I adapt it to mine, it runs fine too -- except that it does a merge ignoring the key variables I want. When I try to add key variables rather than leave them out, I get a new error "Error in is.null(x) : 'x' is missing". The code line is "test.reduce <- Reduce(function(...) merge(by=match.by, all=T), my.list)" where match.by are the vector of key variable names I want merged by.

@bshor 2011-11-15 20:08:14

@BenBolker No it can't be empty data frames; your code isn't right. It should be L3 <- list(empty,empty, data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)),empty,empty,empty) and then m3 = Reduce(function(...) merge(..., all=T), L3) works just fine.

@bshor 2011-11-15 20:11:46

@Charles Sorry that last code line is wrong. When properly rewritten as test.reduce <- Reduce(function(...) merge(..., by=match.by, all=T), my.list) I get the same old "match.names" error.

@Charles 2011-11-15 20:28:54

Still not able to replicate the problem - see updated answer. Can you provide better sample data? Maybe just save(my.list, file='my.list.RData') and upload?

@bshor 2011-11-15 22:48:25

@Charles You've gone above and beyond -- thanks. I uploaded replication code and real data that is accessed via url to show you the problem. Thanks for showing me pastebin.

@Charles 2011-11-16 13:26:14

The reason it worked without match.by is because it wasn't doing field renaming on name, but rather including that in key.

@bshor 2011-11-16 18:51:13

@Charles Wow; this did it! Both "Recall" and "Reduce" solutions work fine now in 2.12. Thank you. I've never really run into a bug in core R code before... I do wonder why "Recall" worked in 2.7.2 but not now.

@Ramnath 2011-11-11 15:24:39

You can do it using merge_all in the reshape package. You can pass parameters to merge using the ... argument

reshape::merge_all(list_of_dataframes, ...)

Here is an excellent resource on different methods to merge data frames.

@SFun28 2011-11-11 15:29:02

looks like I just replicated merge_recurse =) good to know this function already exists.

@Ramnath 2011-11-11 15:33:36

yes. whenever i have an idea, i always check if @hadley has already done it, and most of the times he has :-)

@bshor 2011-11-11 21:42:43

I'm a little confused; should I do merge_all or merge_recurse? In any case, when I try to add in my additional arguments to either, I get the error "formal argument "all" matched by multiple actual arguments".

@Ramnath 2011-11-12 00:03:53

@bshor. it would be useful to post a few lines of your original data frames, so that your error is reproducible. you can easily do it using dput.

@hadley 2011-11-12 04:45:22

I think I dropped this from reshape2. Reduce + merge is just as simple.

@bshor 2011-11-14 18:39:17

@Ramnath Yikes. My list has 19 data frames, each about 48-50 rows and 600 columns! dput puts tons of data on screen. What's the best way to summarize?

@bshor 2011-11-14 18:53:30

@Ramnath I updated the original post with my attempt to use a real example from my code, but shortening up the data frames for exposition.

@Eduardo 2014-10-22 07:29:15

@Ramnath, link is dead, is there a mirror?

@SFun28 2011-11-11 15:13:55

You can use recursion to do this. I haven't verified the following, but it should give you the right idea:

MergeListOfDf = function( data , ... )
{
    if ( length( data ) == 2 ) 
    {
        return( merge( data[[ 1 ]] , data[[ 2 ]] , ... ) )
    }    
    return( merge( MergeListOfDf( data[ -1 ] , ... ) , data[[ 1 ]] , ... ) )
}

Related Questions

Sponsored Content

43 Answered Questions

[SOLVED] How to make a flat list out of list of lists

28 Answered Questions

[SOLVED] Finding the index of an item given a list containing it in Python

  • 2008-10-07 01:39:38
  • Eugene M
  • 3363418 View
  • 2777 Score
  • 28 Answer
  • Tags:   python list indexing

30 Answered Questions

[SOLVED] How do I check if a list is empty?

  • 2008-09-10 06:20:11
  • Ray Vega
  • 2361560 View
  • 3235 Score
  • 30 Answer
  • Tags:   python list

41 Answered Questions

[SOLVED] How to merge two dictionaries in a single expression?

20 Answered Questions

16 Answered Questions

[SOLVED] Remove rows with all or some NAs (missing values) in data.frame

13 Answered Questions

[SOLVED] How to join (merge) data frames (inner, outer, left, right)

14 Answered Questions

[SOLVED] How to clone or copy a list?

19 Answered Questions

[SOLVED] How to sort a dataframe by multiple column(s)

22 Answered Questions

[SOLVED] How do you merge two Git repositories?

Sponsored Content