Rbindlist Two Data.Tables Where One Has Factor and Other Has Character Type for a Column

rbindlist two data.tables where one has factor and other has character type for a column

UPDATE - This bug (#2650) was fixed on 17 May 2013 in v1.8.9


I believe that rbindlist when applied to factors is combining the numerical values of the factors and using only the levels associated with the first list element.

As in this bug report:
http://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975


# Temporary workaround: 

levs <- c(as.character(DT.1$x), as.character(DT.2$x))

DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]

rbindlist(list(DT.1, DT.2))

As another view of whats going on:

DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)

DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]

DT3
DT4

# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd

do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd

Edit as per comments:

as for observation 1, what's happening is similar to:

x <- factor(LETTERS[1:5])

x[6:10] <- letters[1:5]
x

# Notice however, if you are assigning a value that is already present
x[11] <- "S" # warning, since `S` is not one of the levels of x
x[12] <- "D" # all good, since `D` *is* one of the levels of x

data.table assignment involving factors

Here's a workaround:

dt1[dt2, z := i.y][!is.na(z), y := z][, z := NULL]

Note that z is a character column and the second assignment works as expected, not really sure why the OP one doesn't.

Make rbindlist skip, ignore or change class attribute of the column

I came up with this inelegant solution that bypasses the problem. Basically, What I am doing is to assign the attributes of the columns of the first item of the list to the columns with the same names of all the other items of the list. Keep in mind that this solution is problematic and, depending on the project, it could be a very wrong practice as it has the potential to mess up your data. However, if what you need is to use rbindlist to combine your dataframes, this makes the trick


dfs <- list(df1, df2)
varnames <- names(dfs[[1]]) # variable names
vattr <- purrr::map_chr(varnames, ~class(dfs[[1]][[.x]])) # variable attributes

for (i in seq_along(dfs)) {
# assign the same attributes of list 1 to the rest of the lists
for (j in seq_along(varnames)) {
if (varnames[[j]] %in% names(dfs[[i]])) {
class(dfs[[i]][[varnames[[j]]]]) <- vattr[[j]]
}
}
}

df_merged <- data.table::rbindlist(dfs, fill=TRUE, use.names=TRUE)

Best,

Why is rbindlist better than rbind?

rbindlist is an optimized version of do.call(rbind, list(...)), which is known for being slow when using rbind.data.frame


Where does it really excel

Some questions that show where rbindlist shines are

Fast vectorized merge of list of data.frames by row

Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

These have benchmarks that show how fast it can be.


rbind.data.frame is slow, for a reason

rbind.data.frame does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist doesn't do this kind of checking, and will join by position

eg

do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2

rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3

Some other limitations of rbindlist

It used to struggle to deal with factors, due to a bug that has since been fixed:

rbindlist two data.tables where one has factor and other has character type for a column (Bug #2650)

It has problems with duplicate column names

see
Warning message: in rbindlist(allargs) : NAs introduced by coercion: possible bug in data.table? (Bug #2384)


rbind.data.frame rownames can be frustrating

rbindlist can handle lists data.frames and data.tables, and will return a data.table without rownames

you can get in a muddle of rownames using do.call(rbind, list(...))
see

How to avoid renaming of rows when using rbind inside do.call?


Memory efficiency

In terms of memory rbindlist is implemented in C, so is memory efficient, it uses setattr to set attributes by reference

rbind.data.frame is implemented in R, it does lots of assigning, and uses attr<- (and class<- and rownames<- all of which will (internally) create copies of the created data.frame.

rbindlist for factors with missing levels

I guess rbindlist is faster because it doesn't do the checking of do.call(rbind.data.frame,...)

Why not to set the levels after binding?

    Dt <- rbindlist(list(dt1, dt1)) 
setattr(Dt$x,"levels",letters) ## set attribute without a copy

from the ?setattr:

setattr() is useful in many situations to set attributes by reference and can be used on any object or part of an object, not just data.tables.

rbind a list of data frames with different columns

You can use data.table:

library(data.table)
rbindlist(myList, fill = TRUE)
# x1 x3 x4 x2
#1: 1 2 7 NA
#2: 3 3 8 4
#3: 9 2 9 5


Related Topics



Leave a reply



Submit