Rbind Multiple Data Sets

Rbind multiple Data Frames in a loop

If file_list is a character vector of filenames that have since been loaded into variables in the local environment, then perhaps one of

do.call(rbind.data.frame, mget(ls(pattern = "^df\\s+\\.csv")))
do.call(rbind.data.frame, mget(paste0("df", seq_along(file_list), ".csv")))

The first assumes anything found (as df*.csv) in R's environment is appropriate to grab. It might not grab then in the correct order, so consider using sort or somehow ordering them yourself.

mget takes a string vector and retrieves the value of the object with each name from the given environment (current, by default), returning a list of values.

do.call(rbind.data.frame, ...) does one call to rbind, which is much much faster than iteratively rbinding.

elegant way to use rbind() on multiple dataframes with similar names?

Always try to rigorously capture relations between related instances of data, or related data and methods, or related methods. This generally helps ease aggregate manipulation such as your rbind requirement.

For your case, you should have defined your related data.frames as a single list from the beginning:

foo <- list(data.frame(...), data.frame(...), ... );

And then your requirement could be satisfied thusly:

do.call(rbind, foo );

If it's too late for that, then the solution involving repeated calls to get(), as described in the article to which you linked, can do the job.

How to rbind multiple dataframes with a while-loop?

We need replicate to return as a list

out <- setNames(replicate(10, test, simplify = FALSE), 
paste0("test", seq_len(10)))

If there are multiple datasets already created in the global env, get those in to a list and rbind within do.call

out <- do.call(rbind, mget(paste0("test", 1:10)))

Rbind multiple data.frames with row indicators

How about this one using base R functions:

cbind(indicator=c(rep("df1", nrow(df1)), rep("df2", nrow(df2))) ,dfTotal<-rbind(df1,df2))

would give you:

  indicator c.1..2..3..4.  rnorm.1.4. rnorm.1.4..1 rnorm.1.4..2
1 df1 1 -0.50219235 0.1169713 -0.82525943
2 df1 2 0.13153117 0.3186301 -0.35986213
3 df1 3 -0.07891709 -0.5817907 0.08988614
4 df1 4 0.88678481 0.7145327 0.09627446
5 df2 1 -0.20163395 -0.3888542 -0.43808998
6 df2 2 0.73984050 0.5108563 0.76406062
7 df2 3 0.12337950 -0.9138142 0.26196129
8 df2 4 -0.02931671 2.3102968 0.77340460

DATA

set.seed(100)
df1<-data.frame(c(1,2,3,4),rnorm(1:4),rnorm(1:4),rnorm(1:4))
df2<-data.frame(c(1,2,3,4),rnorm(1:4),rnorm(1:4),rnorm(1:4))
dfTotal<-rbind(df1,df2)

Combine multiple datasets in a more compact way

I like foreach for this purpose:

require(data.table)
require(foreach)

dt <- foreach( path = dataset_paths, .combine='rbind' ) %do% {

dt_t <- fread(path)

}

setkey(dt,frame)

Bind multiple datasets with multiple sheets in R

I'll create some sample xlsx files using openxlsx:

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "tab1")
openxlsx::writeData(wb, "tab1", data.frame(a = 1101:1103, b = 1111:1113))
openxlsx::addWorksheet(wb, "tab2")
openxlsx::writeData(wb, "tab2", data.frame(a = 1201:1203, b = 1211:1213))
openxlsx::addWorksheet(wb, "tab3")
openxlsx::writeData(wb, "tab3", data.frame(a = 1301:1303, b = 1311:1313))
openxlsx::saveWorkbook(wb, "book1.xlsx")

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "tab1")
openxlsx::writeData(wb, "tab1", data.frame(a = 2101:2103, b = 2111:2113))
openxlsx::addWorksheet(wb, "tab2")
openxlsx::writeData(wb, "tab2", data.frame(a = 2201:2203, b = 2211:2213))
openxlsx::addWorksheet(wb, "tab3")
openxlsx::writeData(wb, "tab3", data.frame(a = 2301:2303, b = 2311:2313))
openxlsx::saveWorkbook(wb, "book2.xlsx")

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb, "tab1")
openxlsx::writeData(wb, "tab1", data.frame(a = 3101:3103, b = 3111:3113))
openxlsx::addWorksheet(wb, "tab2")
openxlsx::writeData(wb, "tab2", data.frame(a = 3201:3203, b = 3211:3213))
openxlsx::addWorksheet(wb, "tab3")
openxlsx::writeData(wb, "tab3", data.frame(a = 3301:3303, b = 3311:3313))
openxlsx::saveWorkbook(wb, "book3.xlsx")

General flow

I'm not sure why you prefer to keep one frame per sheet; if you're doing the same thing to different groups of data, it can still make a lot of sense to have a single frame, keeping as much of the context as possible so that grouping comes naturally.

While base R does do grouping operations, I find them to be slightly less intuitive/flexible than when using the data.table or dplyr packages, so I'll stick with those two for the processing here (and leave you to determine which if either you wish to use, and then adapt your processing to do it group-wise).

Either way, here's my flow:

  1. We need a function that reads in all sheets within a workbook, then iterate this over the vector of file names;
  2. I'll demonstrate putting all data into one frame (my recommendation); and then
  3. I'll demonstrate grouping them by worksheet.

I'll start with data.table, but I'll provide the equivalents in dplyr later.

basic read-all-sheets function

readOneBook <- function(fn) {
shtnms <- openxlsx::getSheetNames(fn)
sheets <- lapply(setNames(nm = shtnms), openxlsx::readWorkbook, xlsxFile = fn)
sheets
}
readOneBook("book1.xlsx")
# $tab1
# a b
# 1 1101 1111
# 2 1102 1112
# 3 1103 1113
# $tab2
# a b
# 1 1201 1211
# 2 1202 1212
# 3 1203 1213
# $tab3
# a b
# 1 1301 1311
# 2 1302 1312
# 3 1303 1313

So we'll create a list for workbooks (which are lists of sheets) with

workbooks <- lapply(setNames(nm = list.files(pattern = "\\.xlsx$")), readOneBook)

data.table

Here's a list where each element is a workbook:

library(data.table)
lapply(workbooks, rbindlist, idcol = "sheet")
# $book1.xlsx
# sheet a b
# 1: tab1 1101 1111
# 2: tab1 1102 1112
# 3: tab1 1103 1113
# 4: tab2 1201 1211
# 5: tab2 1202 1212
# 6: tab2 1203 1213
# 7: tab3 1301 1311
# 8: tab3 1302 1312
# 9: tab3 1303 1313
# $book2.xlsx
# sheet a b
# 1: tab1 2101 2111
# 2: tab1 2102 2112
# 3: tab1 2103 2113
# 4: tab2 2201 2211
# 5: tab2 2202 2212
# 6: tab2 2203 2213
# 7: tab3 2301 2311
# 8: tab3 2302 2312
# 9: tab3 2303 2313
# $book3.xlsx
# sheet a b
# 1: tab1 3101 3111
# 2: tab1 3102 3112
# 3: tab1 3103 3113
# 4: tab2 3201 3211
# 5: tab2 3202 3212
# 6: tab2 3203 3213
# 7: tab3 3301 3311
# 8: tab3 3302 3312
# 9: tab3 3303 3313

And then combining this into one big frame:

rbindlist(
lapply(workbooks, rbindlist, idcol = "sheet"),
idcol = "workbook"
)
# workbook sheet a b
# 1: book1.xlsx tab1 1101 1111
# 2: book1.xlsx tab1 1102 1112
# 3: book1.xlsx tab1 1103 1113
# 4: book1.xlsx tab2 1201 1211
# 5: book1.xlsx tab2 1202 1212
# ---
# 23: book3.xlsx tab2 3202 3212
# 24: book3.xlsx tab2 3203 3213
# 25: book3.xlsx tab3 3301 3311
# 26: book3.xlsx tab3 3302 3312
# 27: book3.xlsx tab3 3303 3313

The list of sheets is slightly different, requiring a bit of "transpose" functionality. This safeguards against (1) sheets that are not present in all workbooks; and (2) different order of sheets.

commonsheets <- Reduce(intersect, lapply(workbooks, names))
commonsheets
# [1] "tab1" "tab2" "tab3"
lapply(setNames(nm = commonsheets),
function(sht) rbindlist(lapply(workbooks, `[[`, sht), idcol = "workbook"))
# $tab1
# workbook a b
# 1: book1.xlsx 1101 1111
# 2: book1.xlsx 1102 1112
# 3: book1.xlsx 1103 1113
# 4: book2.xlsx 2101 2111
# 5: book2.xlsx 2102 2112
# 6: book2.xlsx 2103 2113
# 7: book3.xlsx 3101 3111
# 8: book3.xlsx 3102 3112
# 9: book3.xlsx 3103 3113
# $tab2
# workbook a b
# 1: book1.xlsx 1201 1211
# 2: book1.xlsx 1202 1212
# 3: book1.xlsx 1203 1213
# 4: book2.xlsx 2201 2211
# 5: book2.xlsx 2202 2212
# 6: book2.xlsx 2203 2213
# 7: book3.xlsx 3201 3211
# 8: book3.xlsx 3202 3212
# 9: book3.xlsx 3203 3213
# $tab3
# workbook a b
# 1: book1.xlsx 1301 1311
# 2: book1.xlsx 1302 1312
# 3: book1.xlsx 1303 1313
# 4: book2.xlsx 2301 2311
# 5: book2.xlsx 2302 2312
# 6: book2.xlsx 2303 2313
# 7: book3.xlsx 3301 3311
# 8: book3.xlsx 3302 3312
# 9: book3.xlsx 3303 3313

dplyr

Same functionality, same effective data, so I'll just show the commands (which are really just replacing rbindlist with bind_cols and an argument-name change).

library(dplyr)

# list, one workbook per element
lapply(workbooks, rbindlist, idcol = "sheet")

# one big frame
bind_rows(
lapply(workbooks, bind_rows, .id = "sheet"),
.id = "workbook"
)

# list, one common sheet per element
lapply(setNames(nm = commonsheets),
function(sht) bind_rows(lapply(workbooks, `[[`, sht), .id = "workbook"))


Related Topics



Leave a reply



Submit