Why Is Rbindlist "Better" Than Rbind

Why is rbindlist better than rbind?

rbindlist is an optimized version of do.call(rbind, list(...)), which is known for being slow when using rbind.data.frame


Where does it really excel

Some questions that show where rbindlist shines are

Fast vectorized merge of list of data.frames by row

Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

These have benchmarks that show how fast it can be.


rbind.data.frame is slow, for a reason

rbind.data.frame does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist doesn't do this kind of checking, and will join by position

eg

do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2

rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3

Some other limitations of rbindlist

It used to struggle to deal with factors, due to a bug that has since been fixed:

rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)

It has problems with duplicate column names

see
Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)


rbind.data.frame rownames can be frustrating

rbindlist can handle lists data.frames and data.tables, and will return a data.table without rownames

you can get in a muddle of rownames using do.call(rbind, list(...))
see

How to avoid renaming of rows when using rbind inside do.call?


Memory efficiency

In terms of memory rbindlist is implemented in C, so is memory efficient, it uses setattr to set attributes by reference

rbind.data.frame is implemented in R, it does lots of assigning, and uses attr<- (and class<- and rownames<- all of which will (internally) create copies of the created data.frame.

Why would rbind work faster than set for growing a data table?

set is more often an alternative to := for fast assignment to elements of a data.table. This is one example of how it's normally used.

As chinsoon12 points out, rbindlist(lapply(filepaths, fread)) should be a faster solution here. In terms of the example given, one option would be to define a list of the correct dimensions and use rbindlist:

list.way <- function() {
wildfire_data_list <- vector("list", length = 3)
for(tile in 1:3) {
# Normally this data would be read in from an external file, but we'll make some dummy data for this example
new_wildfire_data <- data.table(x = sample(1:1e6,1000), y = sample(1:1e6,1000), total_PM10 = sample(1:1e6,1000),
total_PM2.5 = sample(1:1e6,1000), total_CH4 = sample(1:1e6,1000), total_CO = sample(1:1e6,1000), total_CO2 = sample(1:1e6,1000), total_NOx = sample(1:1e6,1000), total_SO2 = sample(1:1e6,1000), total_VOC = sample(1:1e6,1000), total_char = sample(1:1e6,1000))

wildfire_data_list[[tile]] <- new_wildfire_data
}
wildfire_data <- rbindlist(wildfire_data_list)
return(wildfire_data)
}

Fastest alternative to rbind.fill

In a speed comparison performed of rbind, bind_rows, and rbindlist by Ashwin Malshé in 2018 https://rstudio-pubs-static.s3.amazonaws.com/406521_7fc7b6c1dc374e9b8860e15a699d8bb0.html

In ascending order:

  1. rbindlist from data.table is the fastest. It’s more than twice faster than bind_rows from dplyr.

  2. bind_rows from dplyr, which was more than 10 times faster than rbind from base R

  3. rbind base R

There are certainly a few extreme values in all 3 simulations but the medians are close to the means, suggesting small influence of extreme values!

How to use rbindlist(data) instead of do.call(rbind, data) in this case

If you use tstrsplit rather than str_split, they will be columns already rather than rows, so you can use as.data.table rather than rbinding them together.

test = c('a1b1', 'a2b2', 'a3b3')

library(data.table)
as.data.table(tstrsplit(tstrsplit(test, 'a')[[2]], 'b'))
#> V1 V2
#> <char> <char>
#> 1: 1 1
#> 2: 2 2
#> 3: 3 3

Created on 2022-02-17 by the reprex package (v2.0.1)

This will be much faster, e.g. < 1 second vs 18 seconds if the vector has 10,000 elements.

test = c('a1b1', 'a2b2', 'a3b3')

library(data.table)
library(stringr)
library(bench)

test <- sample(test, 1e5, TRUE)

mark(
tstrsplit =
as.data.table(tstrsplit(tstrsplit(test, 'a')[[2]], 'b'))
,
str_split = {
test2 <- rbindlist(test %>% str_split("a") %>% lapply(., function(x)
as.data.table(t(x))))

rbindlist(as.matrix(test2) %>% .[,2] %>% str_split("b") %>% lapply(., function(x)
as.data.table(t(x))))
}
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tstrsplit 134.8ms 138.7ms 7.16 9.54MB 1.79
#> 2 str_split 18.8s 18.8s 0.0532 3.11GB 2.66

Created on 2022-02-17 by the reprex package (v2.0.1)

After doing bind_rows() and rbind() on same data.tables , identical() = FALSE?

The identical checks for attributes which are not the same. With all.equal, there is an option not to check the attributes (check.attributes)

all.equal(DT_bindrows, DT_rbind, check.attributes = FALSE)
#[1] TRUE

If we check the str of both the datasets, it becomes clear

str(DT_bindrows)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
str(DT_rbind)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
# - attr(*, ".internal.selfref")=<externalptr> # reference attribute

By assigning the attribute to NULL, the identical returns TRUE

attr(DT_rbind, ".internal.selfref") <- NULL
identical(DT_bindrows, DT_rbind)
#[1] TRUE

Read files into R faster than while{rbind(read.table)}

Using data.table::rbindlist should be faster along with fread

library(data.table)
beginning <- as.Date("2019-12-01")
ending <- as.Date("2019-12-31")

out <- rbindlist(lapply(paste0(path, "logfile_",
seq(beginning, ending, by = "1 day")), fread))

Or you can also use dplyr::bind_rows

out <- dplyr::bind_rows(lapply(paste0(path, "logfile_", 
seq(beginning, ending, by = "1 day")), read.table, stringsAsFactors = FALSE))

rbind if column contains partial match R

This can be easily donde with data.table.

library(data.table)

You'll need to change the csv reading line, so each object is loaded as a data.table (if your csv are big, you'll notice this is fast too):

# myfilelist <- lapply(anno_files, read.delim,header=T)
myfilelist <- lapply(anno_files, function(x) assign(gsub("(.*)(\\.txt)", "\\1", x), fread(x), .GlobalEnv))

# myfilelist2 <- lapply(cancer_files, read.csv,sep="\t",header=T)
myfilelist2 <- lapply(cancer_files, function(x) assign(gsub("(.*)(\\.txt\\.cancervar)", "\\1", x), fread(x), .GlobalEnv))

Once your objects are created, I'll assume that you know what the pattern on the name is, so you can do something like:

R100_total = rbindlist(lapply(ls(pattern = "R100"), get))
R1080_total = rbindlist(lapply(ls(pattern = "R1080"), get))

Provided you don't have any other objects named with the same pattern that you don't want to rbind.



Related Topics



Leave a reply



Submit