Replace Rbind in For-Loop with Lapply? (2Nd Circle of Hell)

Replace rbind in for-loop with lapply? (2nd circle of hell)

The reason that using rbind in a loop like this is bad practice, is that in each iteration you enlarge your solution data frame and then copy it to a new object, which is a very slow process and can also lead to memory problems. One way around this is to create a list, whose ith component will store the output of the ith loop iteration. The final step is to call rbind on that list (just once at the end). This will look something like

my.list <- vector("list", nrow(myframe))
for(i in 1:nrow(myframe)){
# Call all necessary commands to create values
my.list[[i]] <- values
}
solution <- rbind(solution, do.call(rbind, my.list))

repeat the assigning of data frame in R

The R way to do this would be to keep all the data sets together in a named list. For that you can use the following, where n is the number of files.

nm <- paste0("P", 1:n)  ## create the names P1, P2, ..., Pn
dfList <- setNames(lapply(paste0(nm, "Rtest.txt"), read.delim), nm)

Now dfList will contain all the data sets. You can access them individually with dfList$P1 for P1, dfList$P2 for P2, and so on.

Difference between rbind() and bind_rows() in R

Apart from few more differences, one of the main reasons for using bind_rows over rbind is to combine two data frames having different number of columns. rbind throws an error in such a case whereas bind_rows assigns "NA" to those rows of columns missing in one of the data frames where the value is not provided by the data frames.

Try out the following code to see the difference:

a <- data.frame(a = 1:2, b = 3:4, c = 5:6)
b <- data.frame(a = 7:8, b = 2:3, c = 3:4, d = 8:9)

Results for the two calls are as follows:

rbind(a, b)
> rbind(a, b)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
library(dplyr)
bind_rows(a, b)
> bind_rows(a, b)
a b c d
1 1 3 5 NA
2 2 4 6 NA
3 7 2 3 8
4 8 3 4 9

How to make nested purrr map to extract rows based on dynamic variables instead of nested loop?

Since you mention other alternatives are also welcomed, consider base R. Several issues derive from your initial (non-purr) setup:

  1. One of the biggest issue of original code is using rbind inside a loop which leads to excessive copying in memory as explained in this SO thread, Replace rbind in for-loop with lapply? (2nd circle of hell) and Patrick Burn's R Internal - Circle 2: Growing Objects. To resolve, build a list of data frames that is appended outside of loop.

  2. The repeated use of scoping assignment, <<-, to affect the global environment from inside a local function appears to be unneeded, especially since temp objects are replaced with each loop so only last iteration will maintain. Often this operator is discouraged as it becomes tough to debug since global variables are adjusted. Functions are best handled when one object is returned.

  3. You initialize an empty data frame, df.exp before calling calc() but overwrite it inside the loop with <<-. Usually, after assigning an empty matrix or data frame, one assigns by rows inside loop but this is not done.

  4. Looping through unique() values can be replaced with by() or split() which also avoids using dplyr::filter() inside function. By the way, there are performance challenges of using pipes, %>% inside loops.

  5. Rather than for loop, use the apply family to build a list of objects after iteration such as lapply which avoids the bookkeeping of for loops which needs to initialize an empty list and assign elements to it (though there is nothing wrong with doing this approach). Also, in this way you avoid use of <<- within function.

Base R (using by, lapply, and do.call)

calc <- function(sub) {

## Extract records by "mid" excluding the first records
temp <- sub[2:nrow(temp),]

## Extract row number of "aprps==4"
r.aprps <- which(temp$aprps==4)

## Store exp dataframes in list
subdf_list <- lapply(1:length(r.aprps), function(j) {

## Extract movement by two pairs of rows based on "r.aprps"
temp2 <- temp[c((r.aprps[j]-1):r.aprps[j]),]

## Other operations in actual data set (just put example)
exp <- data.frame(mid=unique(temp2$mid), expsum=sum(temp2$exph))

return(exp)
})

df.exp <- do.call(rbind, subdf_list)
return(df.exp)
}

## subset by mid and pass subsets to calc()
df_list <- by(df, df$mid, calc)

## append all in final object
final_df <- do.call(rbind, df_list)

Because base::rbind.data.frame has some disadvantages, consider third-party packages as replacement of do.call(rbind, ...) such as dplyr::bind_rows() and data.table::rbindlist().

df.exp  <- dplyr::bind_rows(subdf_list) 
...
final_df <- dplyr::bind_rows(df_list)

df.exp <- data.table::rbindlist(subdf_list)
...
final_df <- data.table::rbindlist(df_list)

How can I prevent rbind() from geting really slow as dataframe grows larger?

You are in the 2nd circle of hell, namely failing to pre-allocate data structures.

Growing objects in this fashion is a Very Very Bad Thing in R. Either pre-allocate and insert:

df <- data.frame(x = rep(NA,20000),y = rep(NA,20000))

or restructure your code to avoid this sort of incremental addition of rows. As discussed at the link I cite, the reason for the slowness is that each time you add a row, R needs to find a new contiguous block of memory to fit the data frame in. Lots 'o copying.

First circle of R hell. 0.1 != 0.3/3

See these questions:

  • In R, what is the difference between these two?
  • Numeric comparison difficulty in R

Generally speaking, you can deal with this by including a tolerance level as per the second link above.

Memory efficient alternative to rbind - in-place rbind?

Right now I worked out the following solution:

nextrow = nrow(df)+1
df[nextrow:(nextrow+nrow(df.extension)-1),] = df.extension
# we need to assure unique row names
row.names(df) = 1:nrow(df)

Now I don't run out of memory. I think its because I store

object.size(df) + 2 * object.size(df.extension)

while with rbind R would need

object.size(rbind(df,df.extension)) + object.size(df) + object.size(df.extension). 

After that I use

rm(df.extension)
gc(reset=TRUE)

to free the memory I don't need anymore.

This solved my problem for now, but I feel that there is a more advanced way to do a memory efficient rbind. I appreciate any comments on this solution.



Related Topics



Leave a reply



Submit