How to Rbind Only the Common Columns of Two Data Sets

How to rbind only the common columns of two data sets

Use intersect to retrieve the common columns.

dfr1 <- data.frame(x = 1:5, y = runif(5), z = rnorm(5))
dfr2 <- data.frame(w = letters[1:5], x = 6:10, y = runif(5))
common_cols <- intersect(colnames(dfr1), colnames(dfr2))
rbind(
subset(dfr1, select = common_cols),
subset(dfr2, select = common_cols)
)

As pointed out in the comments, you can replace the last line with

rbind(
dfr1[, common_cols],
dfr2[, common_cols]
)

for a small performance and typing improvement.

rbind(
dfr1[common_cols],
dfr2[common_cols]
)

also works but I think that it's a tiny bit less clear.


You can also use dplyr equivalents for the last step.

library(dplyr)
bind_rows(
dfr1 %>% select({common_cols}),
dfr2 %>% select({common_cols})
)

How to rbind several named dataframes but keep only common columns?

  • Get dataframes in a list.
  • find out the common columns using Reduce + intersect
  • subset each dataframe from list with common columns
  • combine all the data together.
list_data <- mget(paste0("a",l))
common_cols <- Reduce(intersect, lapply(list_data, colnames))
result <- do.call(rbind, lapply(list_data, `[`, common_cols))

You can also make use of purrr::map_df which will make this shorter.

result <- purrr::map_df(list_data, `[`, common_cols)

Combine two data frames by rows (rbind) when they have different sets of columns

rbind.fill from the package plyr might be what you are looking for.

rbind data frames only for same columns

It's not the most elegant solution, but it works.

df <- data.frame()            # empty data.frame
base_names <- names(a) # base_names will reflect any data.frame that has 238 observations
list_df <- list(a, b, c) # list of all your data frames

for(item in list_df){ # create loop

items <- item[, base_names] # only select columns that match the 238 columns
df <- rbind(df, items) # append those to the data.frame

}

df # all data.frames rbinded

If you want to avoid loops, you can also use lapply

library(plyr)
library(dplyr)

df <- data.frame()
base_names <- names(a)
list_df <- list(a, b, c)

lapply(list_df,
function(x){

x_cols <- x[, base_names]
df <- rbind(df, x_cols)

}) %>% plyr::ldply(rbind)

How to rbind two data frames in R?

Use intersect to find the common columns and rbind them.

cols <- intersect(names(emp1), names(emp2))
rbind(emp1[cols], emp2[cols])

# emp_id salary
#1 1 63
#2 2 52
#3 3 6
#4 4 72
#5 5 85
#6 1 63
#7 2 52
#8 3 6
#9 4 72
#10 5 85

using rbind to combine all data sets the names of all data set start with common characters

Try first obtaining a vector of all matching objects using ls() with the pattern ^test:

dfs <- lapply(ls(pattern="^test"), function(x) get(x))
result <- rbindlist(dfs)

I am taking the suggestion by @Rohit to use rbindlist to make our lives easier to rbind together a list of data frames.

function to rbind list of dataframes different columns and rows

This can be achieved with a for loop (I think it could be achieved with mapply to, check ?mapply). The overall strategy is filling each df in the list with NAs (cbinding them) and then rbindlisting the resulting list:

library(data.table)

cols <- max(sapply(df, ncol))

# This is the length of the NA vectors that make the cbinding dfs:
lengths <- (cols - sapply(df, ncol))*sapply(df, nrow)

newdf <- list()

for (i in 1:length(df)){
if (ncol(df[[i]]) != cols){
newdf[[i]] <- cbind(df[[i]],
as.data.frame(matrix(rep(NA, lengths[i]),
ncol = lengths[i] / nrow(df[[i]]))))
} else {
newdf[[i]] <- df[[i]]
}
}

rbindlist(newdf, use.names = FALSE)

Which results in:

   d e    V1  V2
1: 4 c <NA> NA
2: 5 d <NA> NA
3: 1 a one NA
4: 2 b two NA
5: 3 c three NA
6: 6 e one 100
7: 7 f two 101
8: 8 g three 102

Efficient way to rbind data.frames with different columns

UPDATE: See this updated answer instead.

UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill argument to rbind. For example:

DT1 = data.table(a = 1:2, b = 1:2)
DT2 = data.table(a = 3:4, c = 1:2)

rbind(DT1, DT2, fill = TRUE)
# a b c
#1: 1 1 NA
#2: 2 2 NA
#3: 3 NA 1
#4: 4 NA 2


FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables

Note 1:

This solution uses data.table's rbindlist function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.

Note 2:

rbindlist when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.

And now here's using data.table (and benchmarking comparison with rbind.fill from plyr):

require(data.table)
rbind.fill.DT <- function(ll) {
# changed sapply to lapply to return a list always
all.names <- lapply(ll, names)
unq.names <- unique(unlist(all.names))
ll.m <- rbindlist(lapply(seq_along(ll), function(x) {
tt <- ll[[x]]
setattr(tt, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(tt, 0L)
invisible(alloc.col(tt))
tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]
setcolorder(tt, unq.names)
}))
}

rbind.fill.PLYR <- function(ll) {
rbind.fill(ll)
}

require(microbenchmark)
microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10)
# Unit: seconds
# expr min lq median uq max neval
# t1 <- rbind.fill.DT(ll) 10.8943 11.02312 11.26374 11.34757 11.51488 10
# t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724 10

# for comparison change t2 to data.table
setattr(t2, 'class', c('data.table', 'data.frame'))
data.table:::settruelength(t2, 0L)
invisible(alloc.col(t2))
setcolorder(t2, unique(unlist(sapply(ll, names))))

identical(t1, t2) # [1] TRUE

It should be noted that plyr's rbind.fill edges past this particular data.table solution until list size of about 500.

Benchmarking plot:

Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000). I've used microbenchmark with 10 reps on each of these different list lengths.

Sample Image

Benchmarking gist:

Here's the gist for benchmarking, in case anyone wants to replicate the results.



Related Topics



Leave a reply



Submit