How to Speed Up Rbind

How to speed up rbind?

Here are a few options that I'm sure could be better:

library(data.table)
library(microbenchmark)

#function to generate your data
getData <- function(){
  data.frame(x=rnorm(10000),y=rnorm(10000),z=rnorm(10000))
}

#using data table's rbindlist each iteration
fDT1 <- function(n){
  dat <- getData()
  for(i in 1:n){
    dat <- rbindlist(list(dat,getData()))
  }
  return(data.frame(dat))
}

#using data table's rbindlist all at once
fDT2 <- function(n){
  return(data.frame(rbindlist(lapply(1:n,function(x) getData()))))
}

#pre-allocating a data frame
fPre <- function(n){
  dat <- data.frame(x=rep(0,n*10000),y=rep(0,n*10000),z=rep(0,n*10000))
  j <- 1
  for(i in 1:n){
    dat[j:(j+10000-1),] <- getData()
    j <- j + 10000
  }
  return(dat)
}

#standard do.call rbind
f2 <- function(n){
  return(do.call(rbind,lapply(1:n,function(x) getData())))
}

#current approach
f <- function(n){
  dat <- getData()
  for(i in 1:n){
    dat <- rbind(dat,getData())
  }
  return(dat)
}

As you can see using data.table's rbindlist() is a big improvement over base R's rbind() and there is a big benefit in appending rows all at once instead of in interations, however that may not be possible if there are memory concerns. You may also note that the speed improvements are nowhere near linear as the size of data increases.

 > microbenchmark(fDT2(5),fDT1(5),fPre(5),f2(5),f(5),
+                fDT2(25),fDT1(25),fPre(25),f2(25),f(25),
+                fDT2(75),fDT1(75),fPre(75),f2(75),f(75),
+                times=10)
Unit: milliseconds
     expr        min         lq     median         uq         max neval
  fDT2(5)   18.31207   18.63969   24.09943   25.45590    72.01725    10
  fDT1(5)   27.65459   29.25147   36.34158   77.79446    88.82556    10
  fPre(5)   34.96257   39.39723   41.24445   43.30319    68.75897    10
    f2(5)   30.85883   33.00292   36.29100   43.53619    93.15869    10
     f(5)   87.40869   97.97500  134.50600  138.65354   147.67676    10
 fDT2(25)   89.42274   99.39819  103.90944  146.44160   156.01653    10
 fDT1(25)  224.65745  229.78129  261.52388  280.85499   300.93488    10
 fPre(25)  371.12569  412.79876  431.80571  485.37727  1046.96923    10
   f2(25)  221.03669  252.08998  265.17357  271.82414   281.47096    10
    f(25) 1446.32145 1481.01998 1491.59203 1634.99936  1849.00590    10
 fDT2(75)  326.66743  334.15669  367.83848  467.85480   520.27142    10
 fDT1(75) 1749.83842 1882.27091 2066.95241 2278.55589  2419.07205    10
 fPre(75) 3701.16220 3968.64643 4162.70585 4234.39716  4356.09462    10
   f2(75) 1174.47546 1183.98860 1314.64585 1421.09483  1537.42903    10
    f(75) 9139.36935 9349.24412 9510.90888 9977.24621 10861.51206    10

How can I prevent rbind() from geting really slow as dataframe grows larger?

You are in the 2nd circle of hell, namely failing to pre-allocate data structures.

Growing objects in this fashion is a Very Very Bad Thing in R. Either pre-allocate and insert:

df <- data.frame(x = rep(NA,20000),y = rep(NA,20000))

or restructure your code to avoid this sort of incremental addition of rows. As discussed at the link I cite, the reason for the slowness is that each time you add a row, R needs to find a new contiguous block of memory to fit the data frame in. Lots 'o copying.

Fastest alternative to rbind.fill

In a speed comparison performed of rbind, bind_rows, and rbindlist by Ashwin Malshé in 2018 https://rstudio-pubs-static.s3.amazonaws.com/406521_7fc7b6c1dc374e9b8860e15a699d8bb0.html

In ascending order:

rbindlist from data.table is the fastest. It’s more than twice faster than bind_rows from dplyr.
bind_rows from dplyr, which was more than 10 times faster than rbind from base R
rbind base R

There are certainly a few extreme values in all 3 simulations but the medians are close to the means, suggesting small influence of extreme values!

How can I speed up a function combining rbind and lapply?

Over large datasets, data.table will be a lot quicker than dplyr:

library(data.table)
setDT(df)[, lapply(.SD, toString), by = c("Case","Month")][,.N, by = c("Fruits","Month")]

Why would rbind work faster than set for growing a data table?

set is more often an alternative to := for fast assignment to elements of a data.table. This is one example of how it's normally used.

As chinsoon12 points out, rbindlist(lapply(filepaths, fread)) should be a faster solution here. In terms of the example given, one option would be to define a list of the correct dimensions and use rbindlist:

list.way <- function() {
wildfire_data_list <- vector("list", length = 3)
for(tile in 1:3) {
    # Normally this data would be read in from an external file, but we'll make some dummy data for this example
    new_wildfire_data <- data.table(x = sample(1:1e6,1000), y = sample(1:1e6,1000), total_PM10 = sample(1:1e6,1000),
                                    total_PM2.5 = sample(1:1e6,1000), total_CH4 = sample(1:1e6,1000), total_CO = sample(1:1e6,1000), total_CO2 = sample(1:1e6,1000), total_NOx = sample(1:1e6,1000), total_SO2 = sample(1:1e6,1000), total_VOC = sample(1:1e6,1000), total_char = sample(1:1e6,1000))

    wildfire_data_list[[tile]] <- new_wildfire_data
}
wildfire_data <- rbindlist(wildfire_data_list)
return(wildfire_data)
}

Faster way to rbind data frames from a list of lists

An option would be to transpose the big_list and use bind_rows

library(dplyr)
library(purrr)
out_lst <- transpose(big_list) %>% 
                map(bind_rows)

R - Expected speed of do.call(rbind,...)

This sounds like a scenario where data.table could be dramatically (100-1000x) faster.

https://www.r-bloggers.com/concatenating-a-list-of-data-frames/

Is there a higher order replacement for do.call(rbind, ...)?

Curious what your benchmarks say if you replace above with:

allgamespbp <- data.table::rbindlist(prebindgames[1:1000])

Increase speed with rbindlist does not work with two for loops

This should be fast, and is quite simple:

test[rep(1:.N,Weight)]

Performance of rbind.data.frame

Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.

On my system, using data frames:

> system.time(result<-do.call(rbind, someParts))
   user  system elapsed 
  2.628   0.000   2.636

Building the list with all numeric matrices instead:

onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1, 
                   function(reps){onerowdfr2[rep(1, reps),]})

results in a lot faster rbind.

> system.time(result2<-do.call(rbind, someParts2))
   user  system elapsed 
  0.001   0.000   0.001

EDIT: Here's another possibility; it just combines each column in turn.

> system.time({
+   n <- 1:ncol(someParts[[1]])
+   names(n) <- names(someParts[[1]])
+   result <- as.data.frame(lapply(n, function(i) 
+                           unlist(lapply(someParts, `[[`, i))))
+ })
   user  system elapsed 
  0.810   0.000   0.813

Still not nearly as fast as using matrices though.

EDIT 2:

If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.

someParts2 <- lapply(someParts, function(x)
                     matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
  lev <- levels(a[[i]])
  result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}

The timing on my system is:

   user  system elapsed 
   0.090    0.00    0.091

How to Speed Up Rbind