Difference between rbind() and bind_rows() in R
Apart from few more differences, one of the main reasons for using bind_rows
over rbind
is to combine two data frames having different number of columns. rbind
throws an error in such a case whereas bind_rows
assigns "NA
" to those rows of columns missing in one of the data frames where the value is not provided by the data frames.
Try out the following code to see the difference:
a <- data.frame(a = 1:2, b = 3:4, c = 5:6)
b <- data.frame(a = 7:8, b = 2:3, c = 3:4, d = 8:9)
Results for the two calls are as follows:
rbind(a, b)
> rbind(a, b)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
library(dplyr)
bind_rows(a, b)
> bind_rows(a, b)
a b c d
1 1 3 5 NA
2 2 4 6 NA
3 7 2 3 8
4 8 3 4 9
rbind/bind_rows two unequal data.frames
A possible solution, which requires names(dat2)[2] <- names(dat1)[4]
before binding the rows (there was a mismatch of column names):
library(tidyverse)
fit <- lm(mpg ~ hp, data = mtcars)
dat1 <- as.data.frame(coef(summary(fit)))
dat2 <- data.frame(Estimate = 2, pr = 0.1234567901, row.names = "Q")
names(dat2)[2] <- names(dat1)[4] # <--- This is CRUCIAL
dat1 %>%
bind_rows(dat2)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 30.09886054 1.6339210 18.421246 6.642736e-18
#> hp -0.06822828 0.0101193 -6.742389 1.787835e-07
#> Q 2.00000000 NA NA 1.234568e-01
How to rbind() / dplyr::bind_rows() / data.table::rbindlist() data frames which contain data frame columns?
The problem seems to be that the bind
functions have trouble with the row names of the data frame b
inside x
/y
. We can avoid this in basic R by renaming the rows (see below).
Important note: dplyr
is able to handle this example by now. No workarounds are required anymore.
# Setup
x <- data.frame(a=1)
x$b <- data.frame(z=2)
y <- data.frame(a=3)
y$b <- data.frame(z=4)
rbind(x, y) # still does not work
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
require(dplyr)
dplyr::bind_rows(x,y) # works!!!
#> a z
#> 1 1 2
#> 2 3 4
# Avoid conflicting row names
row.names(x) <- seq(nrow(y)+1, nrow(y)+nrow(x))
row.names(x$b) <- seq(nrow(y)+1, nrow(y)+nrow(x))
rbind(x, y) #works now, too
#> a z
#> 2 1 2
#> 1 3 4
Created on 2020-06-27 by the reprex package (v0.3.0)
Combine two data frames by rows (rbind) when they have different sets of columns
rbind.fill
from the package plyr
might be what you are looking for.
bind_rows of different data types
We can use rbindlist
from data.table
library(data.table)
rbindlist(list(ds_a, ds_b))
# x
#1: 1
#2: 2
#3: 3
#4: 4
#5: 5
#6: 6
#7: z1
#8: z2
After doing bind_rows() and rbind() on same data.tables , identical() = FALSE?
The identical
checks for attributes
which are not the same. With all.equal
, there is an option not to check the attributes (check.attributes
)
all.equal(DT_bindrows, DT_rbind, check.attributes = FALSE)
#[1] TRUE
If we check the str
of both the datasets, it becomes clear
str(DT_bindrows)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
str(DT_rbind)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
# - attr(*, ".internal.selfref")=<externalptr> # reference attribute
By assigning the attribute to NULL, the identical
returns TRUE
attr(DT_rbind, ".internal.selfref") <- NULL
identical(DT_bindrows, DT_rbind)
#[1] TRUE
why does rbind() work and bind_rows() not work in combining these sf objects?
I had the same issue before and I figured out that rbindlist
does help to combine the list, but you have to convert it back to sf
object using st_as_sf()
this works for me:
p <- data.table::rbindlist(list(rtmp,rtmp2),
use.names = TRUE,
fill = TRUE,
idcol = NULL)
st_as_sf(p)
Simple feature collection with 2 features and 1 field
geometry type: POLYGON
dimension: XY
bbox: xmin: 7201955 ymin: 927094.3 xmax: 7212183 ymax: 937804.6
epsg (SRID): NA
proj4string: NA
PROVCODE geometry
1 ON POLYGON ((7201955 935407, 7...
2 ON POLYGON ((6914891 896361.6,...
Why is rbindlist better than rbind?
rbindlist
is an optimized version of do.call(rbind, list(...))
, which is known for being slow when using rbind.data.frame
Where does it really excel
Some questions that show where rbindlist
shines are
Fast vectorized merge of list of data.frames by row
Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply
These have benchmarks that show how fast it can be.
rbind.data.frame is slow, for a reason
rbind.data.frame
does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist
doesn't do this kind of checking, and will join by position
eg
do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2
rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3
Some other limitations of rbindlist
It used to struggle to deal with factors
, due to a bug that has since been fixed:
rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)
It has problems with duplicate column names
see
Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)
rbind.data.frame rownames can be frustrating
rbindlist
can handle lists
data.frames
and data.tables
, and will return a data.table without rownames
you can get in a muddle of rownames using do.call(rbind, list(...))
see
How to avoid renaming of rows when using rbind inside do.call?
Memory efficiency
In terms of memory rbindlist
is implemented in C
, so is memory efficient, it uses setattr
to set attributes by reference
rbind.data.frame
is implemented in R
, it does lots of assigning, and uses attr<-
(and class<-
and rownames<-
all of which will (internally) create copies of the created data.frame.
Taking the difference between ntiles and then bind_rows in dplyr pipe
This does not need any pre-processing.
library(dplyr)
df %>%
group_by(date) %>%
filter(ntile %in% c(1,5)) %>%
arrange(ntile) %>%
summarise(ntile = paste(ntile[1], ntile[n()], sep = "-"),
score = score[1] - score[n()]) %>%
bind_rows({df %>% mutate(ntile = as.character(ntile))}, .) %>%
select(date, ntile, score)
# # A tibble: 474 x 3
# date ntile score
# <date> <chr> <dbl>
# 1 2005-08-31 1 -2.39
# 2 2005-09-30 1 0.573
# 3 2005-10-31 1 -1.61
# 4 2005-11-30 1 5.43
# 5 2005-12-31 1 0.106
# 6 2006-01-31 1 6.66
# 7 2006-02-28 1 0.613
# 8 2006-03-31 1 4.21
# 9 2006-04-30 1 0.107
# 10 2006-05-31 1 -3.62
# # ... with 464 more rows
This is the tail of data showing df$ntile == '1' - df$ntile == '5'
appended to the bottom:
.Last.value %>% tail %>% as.data.frame
# date ntile score
# 1 2018-07-31 1-5 -0.278
# 2 2018-08-31 1-5 -2.01
# 3 2018-09-30 1-5 0.307
# 4 2018-10-31 1-5 -1.36
# 5 2018-11-30 1-5 -1.33
# 6 2018-12-31 1-5 -1.44
Related Topics
How to Match by Nearest Date from Two Data Frames
MAC Os X R Error "Ld: Warning: Directory Not Found for Option"
Add a Horizontal Line to Plot and Legend in Ggplot2
How to Add a Factor Column to Dataframe Based on a Conditional Statement from Another Column
Modify X-Axis Labels in Each Facet
Efficiently Computing a Linear Combination of Data.Table Columns
Use R Code or Windows User Variable ("%Userprofile%") in Yaml
Collapse Continuous Integer Runs to Strings of Ranges
R: How to Find the Mode of a Vector
Randomly Insert Nas into Dataframe Proportionaly
Finding Row Index Containing Maximum Value Using R
Object Not Found Error with Ddply Inside a Function
How 'Poly()' Generates Orthogonal Polynomials? How to Understand the "Coefs" Returned