Why is rbindlist better than rbind?
rbindlist
is an optimized version of do.call(rbind, list(...))
, which is known for being slow when using rbind.data.frame
Where does it really excel
Some questions that show where rbindlist
shines are
Fast vectorized merge of list of data.frames by row
Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply
These have benchmarks that show how fast it can be.
rbind.data.frame is slow, for a reason
rbind.data.frame
does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist
doesn't do this kind of checking, and will join by position
eg
do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2
rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3
Some other limitations of rbindlist
It used to struggle to deal with factors
, due to a bug that has since been fixed:
rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)
It has problems with duplicate column names
see
Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)
rbind.data.frame rownames can be frustrating
rbindlist
can handle lists
data.frames
and data.tables
, and will return a data.table without rownames
you can get in a muddle of rownames using do.call(rbind, list(...))
see
How to avoid renaming of rows when using rbind inside do.call?
Memory efficiency
In terms of memory rbindlist
is implemented in C
, so is memory efficient, it uses setattr
to set attributes by reference
rbind.data.frame
is implemented in R
, it does lots of assigning, and uses attr<-
(and class<-
and rownames<-
all of which will (internally) create copies of the created data.frame.
Why would rbind work faster than set for growing a data table?
set
is more often an alternative to :=
for fast assignment to elements of a data.table. This is one example of how it's normally used.
As chinsoon12 points out, rbindlist(lapply(filepaths, fread))
should be a faster solution here. In terms of the example given, one option would be to define a list of the correct dimensions and use rbindlist
:
list.way <- function() {
wildfire_data_list <- vector("list", length = 3)
for(tile in 1:3) {
# Normally this data would be read in from an external file, but we'll make some dummy data for this example
new_wildfire_data <- data.table(x = sample(1:1e6,1000), y = sample(1:1e6,1000), total_PM10 = sample(1:1e6,1000),
total_PM2.5 = sample(1:1e6,1000), total_CH4 = sample(1:1e6,1000), total_CO = sample(1:1e6,1000), total_CO2 = sample(1:1e6,1000), total_NOx = sample(1:1e6,1000), total_SO2 = sample(1:1e6,1000), total_VOC = sample(1:1e6,1000), total_char = sample(1:1e6,1000))
wildfire_data_list[[tile]] <- new_wildfire_data
}
wildfire_data <- rbindlist(wildfire_data_list)
return(wildfire_data)
}
Fastest alternative to rbind.fill
In a speed comparison performed of rbind
, bind_rows
, and rbindlist
by Ashwin Malshé in 2018 https://rstudio-pubs-static.s3.amazonaws.com/406521_7fc7b6c1dc374e9b8860e15a699d8bb0.html
In ascending order:
rbindlist
fromdata.table
is the fastest. It’s more than twice faster thanbind_rows
fromdplyr
.bind_rows
fromdplyr
, which was more than 10 times faster thanrbind
frombase R
rbind
base R
There are certainly a few extreme values in all 3 simulations but the medians are close to the means, suggesting small influence of extreme values!
How to use rbindlist(data) instead of do.call(rbind, data) in this case
If you use tstrsplit
rather than str_split
, they will be columns already rather than rows, so you can use as.data.table
rather than rbind
ing them together.
test = c('a1b1', 'a2b2', 'a3b3')
library(data.table)
as.data.table(tstrsplit(tstrsplit(test, 'a')[[2]], 'b'))
#> V1 V2
#> <char> <char>
#> 1: 1 1
#> 2: 2 2
#> 3: 3 3
Created on 2022-02-17 by the reprex package (v2.0.1)
This will be much faster, e.g. < 1 second vs 18 seconds if the vector has 10,000 elements.
test = c('a1b1', 'a2b2', 'a3b3')
library(data.table)
library(stringr)
library(bench)
test <- sample(test, 1e5, TRUE)
mark(
tstrsplit =
as.data.table(tstrsplit(tstrsplit(test, 'a')[[2]], 'b'))
,
str_split = {
test2 <- rbindlist(test %>% str_split("a") %>% lapply(., function(x)
as.data.table(t(x))))
rbindlist(as.matrix(test2) %>% .[,2] %>% str_split("b") %>% lapply(., function(x)
as.data.table(t(x))))
}
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tstrsplit 134.8ms 138.7ms 7.16 9.54MB 1.79
#> 2 str_split 18.8s 18.8s 0.0532 3.11GB 2.66
Created on 2022-02-17 by the reprex package (v2.0.1)
After doing bind_rows() and rbind() on same data.tables , identical() = FALSE?
The identical
checks for attributes
which are not the same. With all.equal
, there is an option not to check the attributes (check.attributes
)
all.equal(DT_bindrows, DT_rbind, check.attributes = FALSE)
#[1] TRUE
If we check the str
of both the datasets, it becomes clear
str(DT_bindrows)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
str(DT_rbind)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
# - attr(*, ".internal.selfref")=<externalptr> # reference attribute
By assigning the attribute to NULL, the identical
returns TRUE
attr(DT_rbind, ".internal.selfref") <- NULL
identical(DT_bindrows, DT_rbind)
#[1] TRUE
Read files into R faster than while{rbind(read.table)}
Using data.table::rbindlist
should be faster along with fread
library(data.table)
beginning <- as.Date("2019-12-01")
ending <- as.Date("2019-12-31")
out <- rbindlist(lapply(paste0(path, "logfile_",
seq(beginning, ending, by = "1 day")), fread))
Or you can also use dplyr::bind_rows
out <- dplyr::bind_rows(lapply(paste0(path, "logfile_",
seq(beginning, ending, by = "1 day")), read.table, stringsAsFactors = FALSE))
rbind if column contains partial match R
This can be easily donde with data.table
.
library(data.table)
You'll need to change the csv reading line, so each object is loaded as a data.table (if your csv are big, you'll notice this is fast too):
# myfilelist <- lapply(anno_files, read.delim,header=T)
myfilelist <- lapply(anno_files, function(x) assign(gsub("(.*)(\\.txt)", "\\1", x), fread(x), .GlobalEnv))
# myfilelist2 <- lapply(cancer_files, read.csv,sep="\t",header=T)
myfilelist2 <- lapply(cancer_files, function(x) assign(gsub("(.*)(\\.txt\\.cancervar)", "\\1", x), fread(x), .GlobalEnv))
Once your objects are created, I'll assume that you know what the pattern on the name is, so you can do something like:
R100_total = rbindlist(lapply(ls(pattern = "R100"), get))
R1080_total = rbindlist(lapply(ls(pattern = "R1080"), get))
Provided you don't have any other objects named with the same pattern that you don't want to rbind
.
Related Topics
Starting Shiny App After Password Input
Pass Arguments to Dplyr Functions
Dplyr::Select Function Clashes With Mass::Select
Wrap Long Axis Labels Via Labeller=Label_Wrap in Ggplot2
Merge Two Data Frames While Keeping the Original Row Order
Use Variable Names in Functions of Dplyr
Create New Variables With Mutate_At While Keeping the Original Ones
Ggplot2 Geom_Bar - How to Keep Order of Data.Frame
How to Change the Default Time Zone in R
How to Remove All Whitespace from a String
How to Replace Na With Mean by Group/Subset
Merging Two Data Frames Using Fuzzy/Approximate String Matching in R
How to Replace Na Values in a Table For Selected Columns
Geom_Rect and Alpha - Does This Work With Hard Coded Values