rbindlist two data.tables where one has factor and other has character type for a column
UPDATE - This bug (#2650) was fixed on 17 May 2013 in v1.8.9
I believe that rbindlist
when applied to factors is combining the numerical values of the factors and using only the levels associated with the first list element.
As in this bug report:
http://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975
# Temporary workaround:
levs <- c(as.character(DT.1$x), as.character(DT.2$x))
DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]
rbindlist(list(DT.1, DT.2))
As another view of whats going on:
DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)
DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]
DT3
DT4
# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd
do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd
Edit as per comments:
as for observation 1, what's happening is similar to:
x <- factor(LETTERS[1:5])
x[6:10] <- letters[1:5]
x
# Notice however, if you are assigning a value that is already present
x[11] <- "S" # warning, since `S` is not one of the levels of x
x[12] <- "D" # all good, since `D` *is* one of the levels of x
data.table assignment involving factors
Here's a workaround:
dt1[dt2, z := i.y][!is.na(z), y := z][, z := NULL]
Note that z
is a character column and the second assignment works as expected, not really sure why the OP one doesn't.
Make rbindlist skip, ignore or change class attribute of the column
I came up with this inelegant solution that bypasses the problem. Basically, What I am doing is to assign the attributes of the columns of the first item of the list to the columns with the same names of all the other items of the list. Keep in mind that this solution is problematic and, depending on the project, it could be a very wrong practice as it has the potential to mess up your data. However, if what you need is to use rbindlist
to combine your dataframes, this makes the trick
dfs <- list(df1, df2)
varnames <- names(dfs[[1]]) # variable names
vattr <- purrr::map_chr(varnames, ~class(dfs[[1]][[.x]])) # variable attributes
for (i in seq_along(dfs)) {
# assign the same attributes of list 1 to the rest of the lists
for (j in seq_along(varnames)) {
if (varnames[[j]] %in% names(dfs[[i]])) {
class(dfs[[i]][[varnames[[j]]]]) <- vattr[[j]]
}
}
}
df_merged <- data.table::rbindlist(dfs, fill=TRUE, use.names=TRUE)
Best,
Why is rbindlist better than rbind?
rbindlist
is an optimized version of do.call(rbind, list(...))
, which is known for being slow when using rbind.data.frame
Where does it really excel
Some questions that show where rbindlist
shines are
Fast vectorized merge of list of data.frames by row
Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply
These have benchmarks that show how fast it can be.
rbind.data.frame is slow, for a reason
rbind.data.frame
does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist
doesn't do this kind of checking, and will join by position
eg
do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2
rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3
Some other limitations of rbindlist
It used to struggle to deal with factors
, due to a bug that has since been fixed:
rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)
It has problems with duplicate column names
see
Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)
rbind.data.frame rownames can be frustrating
rbindlist
can handle lists
data.frames
and data.tables
, and will return a data.table without rownames
you can get in a muddle of rownames using do.call(rbind, list(...))
see
How to avoid renaming of rows when using rbind inside do.call?
Memory efficiency
In terms of memory rbindlist
is implemented in C
, so is memory efficient, it uses setattr
to set attributes by reference
rbind.data.frame
is implemented in R
, it does lots of assigning, and uses attr<-
(and class<-
and rownames<-
all of which will (internally) create copies of the created data.frame.
rbindlist for factors with missing levels
I guess rbindlist
is faster because it doesn't do the checking of do.call(rbind.data.frame,...)
Why not to set the levels after binding?
Dt <- rbindlist(list(dt1, dt1))
setattr(Dt$x,"levels",letters) ## set attribute without a copy
from the ?setattr
:
setattr() is useful in many situations to set attributes by reference and can be used on any object or part of an object, not just data.tables.
rbind a list of data frames with different columns
You can use data.table
:
library(data.table)
rbindlist(myList, fill = TRUE)
# x1 x3 x4 x2
#1: 1 2 7 NA
#2: 3 3 8 4
#3: 9 2 9 5
Related Topics
How to Overlay an Image on to a Ggplot
How to Apply a Gradient Fill to a Geom_Rect Object in Ggplot2
R - Scaling Numeric Values Only in a Dataframe with Mixed Types
Applying Rolling Mean by Group in R
Loading Dplyr After Plyr Is Causing Issues
Problems with Dplyr and Posixlt Data
How to Show Every Second R Ggplot2 X-Axis Label Value
Predicting Probabilities for Gbm with Caret Library
Italic Greek Letters in R Plot
Integrate a Very Peaked Function
Rbindlist Two Data.Tables Where One Has Factor and Other Has Character Type for a Column
Shiny Dashboard Mainpanel Height Issue
Fill in Data Frame with Values from Rows Above
Ggplot2 One Line Per Each Row Dataframe
Error: Maximal Number of Dlls Reached