Dplyr Join Warning: Joining Factors with Different Levels

dplyr join warning: joining factors with different levels

That's not an error, that's a warning. And it's telling you that one of the columns you used in your join was a factor and that factor had different levels in the different datasets. In order not to lose any information, the factors were converted to character values. For example:

library(dplyr)
x<-data.frame(a=letters[1:7])
y<-data.frame(a=letters[4:10])

class(x$a)
# [1] "factor"

# NOTE these are different
levels(x$a)
# [1] "a" "b" "c" "d" "e" "f" "g"
levels(y$a)
# [1] "d" "e" "f" "g" "h" "i" "j"

m <- left_join(x,y)
# Joining by: "a"
# Warning message:
# joining factors with different levels, coercing to character vector

class(m$a)
# [1] "character"

You can make sure that both factors have the same levels before merging

combined <- sort(union(levels(x$a), levels(y$a)))
n <- left_join(mutate(x, a=factor(a, levels=combined)),
mutate(y, a=factor(a, levels=combined)))
# Joining by: "a"
class(n$a)
#[1] "factor"

dplyr::left_join produce NA values for new joined columns

There are two problems.

  1. Not specifying the by argument in left_join: In this case, by default all the columns are used as the variables to join by. If we look at the columns - "Gain.Month.1", "Last.Price", "Vol.Month.1" - all numeric class and do not have a matching value in each of the datasets. So, it is better to join by "Firm"

    left_join(Avanza.XML, checkpoint, by = "Firm")
  2. The "Firm" column class - factor: We get warning when there is difference in the levels of the factor column (if it is the variable that we join by). In order to remove the warning, we can either convert the "Firm" column in both datasets to character class

    Avanza.XML$Firm <- as.character(Avanza.XML$Firm)
    checkpoint$Firm <- as.character(checkpoint$Firm)

Or if we still want to keep the columns as factor, then change the levels in the "Firm" to include all the levels in both the datasets

lvls <- sort(unique(c(levels(Avanza.XML$Firm), 
levels(checkpoint$Firm))))
Avanza.XML$Firm <- factor(Avanza.XML$Firm, levels=lvls)
checkpoint$Firm <- factor(checkpoint$Firm, levels=lvls)

and then do the left_join.

dplyr join define NA values

First off, I would like to recommend not to use the combination data.frame(cbind(...)). Here's why: cbind creates a matrix by default if you only pass atomic vectors to it. And matrices in R can only have one type of data (think of matrices as a vector with dimension attribute, i.e. number of rows and columns). Therefore, your code

cbind(c("USD","MYR"),c(0.9,1.1))

creates a character matrix:

str(cbind(c("USD","MYR"),c(0.9,1.1)))
# chr [1:2, 1:2] "USD" "MYR" "0.9" "1.1"

although you probably expected a final data frame with a character or factor column (rate) and a numeric column (value). But what you get is:

str(data.frame(cbind(c("USD","MYR"),c(0.9,1.1))))
#'data.frame': 2 obs. of 2 variables:
# $ X1: Factor w/ 2 levels "MYR","USD": 2 1
# $ X2: Factor w/ 2 levels "0.9","1.1": 1 2

because strings (characters) are converted to factors when using data.frame by default (You can circumvent this by specifying stringsAsFactors = FALSE in the data.frame() call).

I suggest the following alternative approach to create the sample data (also note that you can easily specify the column names in the same call):

lookup <- data.frame(rate = c("USD","MYR"), 
value = c(0.9,1.1))

fx <- data.frame(rate = c("USD","MYR","USD","MYR","XXX","YYY"))

Now, for you actual question, if I understand correctly, you want to replace all NAs with a 1 in the joined data. If that's correct, here's a custom function using left_join and mutate_each to do that:

library(dplyr)
left_join_NA <- function(x, y, ...) {
left_join(x = x, y = y, by = ...) %>%
mutate_each(funs(replace(., which(is.na(.)), 1)))
}

Now you can apply it to your data like this:

> left_join_NA(x = fx, y = lookup, by = "rate")
# rate value
#1 USD 0.9
#2 MYR 1.1
#3 USD 0.9
#4 MYR 1.1
#5 XXX 1.0
#6 YYY 1.0
#Warning message:
#joining factors with different levels, coercing to character vector

Note that you end up with a character column (rate) and a numeric column (value) and all NAs are replaced by 1.

str(left_join_NA(x = fx, y = lookup, by = "rate"))
#'data.frame': 6 obs. of 2 variables:
# $ rate : chr "USD" "MYR" "USD" "MYR" ...
# $ value: num 0.9 1.1 0.9 1.1 1 1

Conditional Join with DPLYR

You could use the devel version of data.table

library(data.table)#v1.9.5+
setDT(df1)[df2, on=c('t'='t.nr')][year!=2011, value_1:='0'][]
# t value_1 year value
#1: 1 0 2010 0.2
#2: 1 0.9 2011 0.5
#3: 2 0 2012 0.7
#4: 7 0 2013 0.3


Related Topics



Leave a reply



Submit