Difference Between Rbind() and Bind_Rows() in R

Difference between rbind() and bind_rows() in R

Apart from few more differences, one of the main reasons for using bind_rows over rbind is to combine two data frames having different number of columns. rbind throws an error in such a case whereas bind_rows assigns "NA" to those rows of columns missing in one of the data frames where the value is not provided by the data frames.

Try out the following code to see the difference:

a <- data.frame(a = 1:2, b = 3:4, c = 5:6)
b <- data.frame(a = 7:8, b = 2:3, c = 3:4, d = 8:9)

Results for the two calls are as follows:

rbind(a, b)
> rbind(a, b)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
library(dplyr)
bind_rows(a, b)
> bind_rows(a, b)
a b c d
1 1 3 5 NA
2 2 4 6 NA
3 7 2 3 8
4 8 3 4 9

rbind/bind_rows two unequal data.frames

A possible solution, which requires names(dat2)[2] <- names(dat1)[4] before binding the rows (there was a mismatch of column names):

library(tidyverse)

fit <- lm(mpg ~ hp, data = mtcars)

dat1 <- as.data.frame(coef(summary(fit)))

dat2 <- data.frame(Estimate = 2, pr = 0.1234567901, row.names = "Q")

names(dat2)[2] <- names(dat1)[4] # <--- This is CRUCIAL

dat1 %>%
bind_rows(dat2)

#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 30.09886054 1.6339210 18.421246 6.642736e-18
#> hp -0.06822828 0.0101193 -6.742389 1.787835e-07
#> Q 2.00000000 NA NA 1.234568e-01

How to rbind() / dplyr::bind_rows() / data.table::rbindlist() data frames which contain data frame columns?

The problem seems to be that the bind functions have trouble with the row names of the data frame b inside x/y. We can avoid this in basic R by renaming the rows (see below).

Important note: dplyr is able to handle this example by now. No workarounds are required anymore.

# Setup
x <- data.frame(a=1)
x$b <- data.frame(z=2)
y <- data.frame(a=3)
y$b <- data.frame(z=4)

rbind(x, y) # still does not work
#> Warning: non-unique value when setting 'row.names': '1'
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
require(dplyr)
dplyr::bind_rows(x,y) # works!!!
#> a z
#> 1 1 2
#> 2 3 4


# Avoid conflicting row names
row.names(x) <- seq(nrow(y)+1, nrow(y)+nrow(x))
row.names(x$b) <- seq(nrow(y)+1, nrow(y)+nrow(x))

rbind(x, y) #works now, too
#> a z
#> 2 1 2
#> 1 3 4

Created on 2020-06-27 by the reprex package (v0.3.0)

Combine two data frames by rows (rbind) when they have different sets of columns

rbind.fill from the package plyr might be what you are looking for.

bind_rows of different data types

We can use rbindlist from data.table

library(data.table)
rbindlist(list(ds_a, ds_b))
# x
#1: 1
#2: 2
#3: 3
#4: 4
#5: 5
#6: 6
#7: z1
#8: z2

After doing bind_rows() and rbind() on same data.tables , identical() = FALSE?

The identical checks for attributes which are not the same. With all.equal, there is an option not to check the attributes (check.attributes)

all.equal(DT_bindrows, DT_rbind, check.attributes = FALSE)
#[1] TRUE

If we check the str of both the datasets, it becomes clear

str(DT_bindrows)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
str(DT_rbind)
#Classes ‘data.table’ and 'data.frame': 2 obs. of 2 #variables:
# $ a: num 1 4
# $ b: num 2 3
# - attr(*, ".internal.selfref")=<externalptr> # reference attribute

By assigning the attribute to NULL, the identical returns TRUE

attr(DT_rbind, ".internal.selfref") <- NULL
identical(DT_bindrows, DT_rbind)
#[1] TRUE

why does rbind() work and bind_rows() not work in combining these sf objects?

I had the same issue before and I figured out that rbindlist does help to combine the list, but you have to convert it back to sf object using st_as_sf()

this works for me:

p <- data.table::rbindlist(list(rtmp,rtmp2),
use.names = TRUE,
fill = TRUE,
idcol = NULL)
st_as_sf(p)

Simple feature collection with 2 features and 1 field
geometry type: POLYGON
dimension: XY
bbox: xmin: 7201955 ymin: 927094.3 xmax: 7212183 ymax: 937804.6
epsg (SRID): NA
proj4string: NA
PROVCODE geometry
1 ON POLYGON ((7201955 935407, 7...
2 ON POLYGON ((6914891 896361.6,...

Why is rbindlist better than rbind?

rbindlist is an optimized version of do.call(rbind, list(...)), which is known for being slow when using rbind.data.frame


Where does it really excel

Some questions that show where rbindlist shines are

Fast vectorized merge of list of data.frames by row

Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

These have benchmarks that show how fast it can be.


rbind.data.frame is slow, for a reason

rbind.data.frame does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist doesn't do this kind of checking, and will join by position

eg

do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2

rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3

Some other limitations of rbindlist

It used to struggle to deal with factors, due to a bug that has since been fixed:

rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)

It has problems with duplicate column names

see
Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)


rbind.data.frame rownames can be frustrating

rbindlist can handle lists data.frames and data.tables, and will return a data.table without rownames

you can get in a muddle of rownames using do.call(rbind, list(...))
see

How to avoid renaming of rows when using rbind inside do.call?


Memory efficiency

In terms of memory rbindlist is implemented in C, so is memory efficient, it uses setattr to set attributes by reference

rbind.data.frame is implemented in R, it does lots of assigning, and uses attr<- (and class<- and rownames<- all of which will (internally) create copies of the created data.frame.

Taking the difference between ntiles and then bind_rows in dplyr pipe

This does not need any pre-processing.

library(dplyr)

df %>%
group_by(date) %>%
filter(ntile %in% c(1,5)) %>%
arrange(ntile) %>%
summarise(ntile = paste(ntile[1], ntile[n()], sep = "-"),
score = score[1] - score[n()]) %>%
bind_rows({df %>% mutate(ntile = as.character(ntile))}, .) %>%
select(date, ntile, score)
# # A tibble: 474 x 3
# date ntile score
# <date> <chr> <dbl>
# 1 2005-08-31 1 -2.39
# 2 2005-09-30 1 0.573
# 3 2005-10-31 1 -1.61
# 4 2005-11-30 1 5.43
# 5 2005-12-31 1 0.106
# 6 2006-01-31 1 6.66
# 7 2006-02-28 1 0.613
# 8 2006-03-31 1 4.21
# 9 2006-04-30 1 0.107
# 10 2006-05-31 1 -3.62
# # ... with 464 more rows

This is the tail of data showing df$ntile == '1' - df$ntile == '5' appended to the bottom:

.Last.value %>% tail %>% as.data.frame  

# date ntile score
# 1 2018-07-31 1-5 -0.278
# 2 2018-08-31 1-5 -2.01
# 3 2018-09-30 1-5 0.307
# 4 2018-10-31 1-5 -1.36
# 5 2018-11-30 1-5 -1.33
# 6 2018-12-31 1-5 -1.44


Related Topics



Leave a reply



Submit