dplyr join warning: joining factors with different levels
That's not an error, that's a warning. And it's telling you that one of the columns you used in your join was a factor and that factor had different levels in the different datasets. In order not to lose any information, the factors were converted to character values. For example:
library(dplyr)
x<-data.frame(a=letters[1:7])
y<-data.frame(a=letters[4:10])
class(x$a)
# [1] "factor"
# NOTE these are different
levels(x$a)
# [1] "a" "b" "c" "d" "e" "f" "g"
levels(y$a)
# [1] "d" "e" "f" "g" "h" "i" "j"
m <- left_join(x,y)
# Joining by: "a"
# Warning message:
# joining factors with different levels, coercing to character vector
class(m$a)
# [1] "character"
You can make sure that both factors have the same levels before merging
combined <- sort(union(levels(x$a), levels(y$a)))
n <- left_join(mutate(x, a=factor(a, levels=combined)),
mutate(y, a=factor(a, levels=combined)))
# Joining by: "a"
class(n$a)
#[1] "factor"
dplyr::left_join produce NA values for new joined columns
There are two problems.
Not specifying the
by
argument inleft_join
: In this case, by default all the columns are used as the variables to join by. If we look at the columns - "Gain.Month.1", "Last.Price", "Vol.Month.1" - allnumeric
class and do not have a matching value in each of the datasets. So, it is better to join by "Firm"left_join(Avanza.XML, checkpoint, by = "Firm")
The "Firm" column class -
factor
: We get warning when there is difference in thelevels
of thefactor
column (if it is the variable that we join by). In order to remove the warning, we can either convert the "Firm" column in both datasets tocharacter
classAvanza.XML$Firm <- as.character(Avanza.XML$Firm)
checkpoint$Firm <- as.character(checkpoint$Firm)
Or if we still want to keep the columns as factor
, then change the levels
in the "Firm" to include all the levels
in both the datasets
lvls <- sort(unique(c(levels(Avanza.XML$Firm),
levels(checkpoint$Firm))))
Avanza.XML$Firm <- factor(Avanza.XML$Firm, levels=lvls)
checkpoint$Firm <- factor(checkpoint$Firm, levels=lvls)
and then do the left_join
.
dplyr join define NA values
First off, I would like to recommend not to use the combination data.frame(cbind(...))
. Here's why: cbind
creates a matrix
by default if you only pass atomic vectors to it. And matrices in R can only have one type of data (think of matrices as a vector with dimension attribute, i.e. number of rows and columns). Therefore, your code
cbind(c("USD","MYR"),c(0.9,1.1))
creates a character matrix:
str(cbind(c("USD","MYR"),c(0.9,1.1)))
# chr [1:2, 1:2] "USD" "MYR" "0.9" "1.1"
although you probably expected a final data frame with a character or factor column (rate) and a numeric column (value). But what you get is:
str(data.frame(cbind(c("USD","MYR"),c(0.9,1.1))))
#'data.frame': 2 obs. of 2 variables:
# $ X1: Factor w/ 2 levels "MYR","USD": 2 1
# $ X2: Factor w/ 2 levels "0.9","1.1": 1 2
because strings (characters) are converted to factors when using data.frame
by default (You can circumvent this by specifying stringsAsFactors = FALSE
in the data.frame()
call).
I suggest the following alternative approach to create the sample data (also note that you can easily specify the column names in the same call):
lookup <- data.frame(rate = c("USD","MYR"),
value = c(0.9,1.1))
fx <- data.frame(rate = c("USD","MYR","USD","MYR","XXX","YYY"))
Now, for you actual question, if I understand correctly, you want to replace all NA
s with a 1
in the joined data. If that's correct, here's a custom function using left_join
and mutate_each
to do that:
library(dplyr)
left_join_NA <- function(x, y, ...) {
left_join(x = x, y = y, by = ...) %>%
mutate_each(funs(replace(., which(is.na(.)), 1)))
}
Now you can apply it to your data like this:
> left_join_NA(x = fx, y = lookup, by = "rate")
# rate value
#1 USD 0.9
#2 MYR 1.1
#3 USD 0.9
#4 MYR 1.1
#5 XXX 1.0
#6 YYY 1.0
#Warning message:
#joining factors with different levels, coercing to character vector
Note that you end up with a character column (rate) and a numeric column (value) and all NAs are replaced by 1.
str(left_join_NA(x = fx, y = lookup, by = "rate"))
#'data.frame': 6 obs. of 2 variables:
# $ rate : chr "USD" "MYR" "USD" "MYR" ...
# $ value: num 0.9 1.1 0.9 1.1 1 1
Conditional Join with DPLYR
You could use the devel
version of data.table
library(data.table)#v1.9.5+
setDT(df1)[df2, on=c('t'='t.nr')][year!=2011, value_1:='0'][]
# t value_1 year value
#1: 1 0 2010 0.2
#2: 1 0.9 2011 0.5
#3: 2 0 2012 0.7
#4: 7 0 2013 0.3
Related Topics
Plot a Legend and Well-Spaced Universal Y-Axis and Main Titles in Grid.Arrange
Methods for Doing Heatmaps, Level/Contour Plots, and Hexagonal Binning
Compare If Two Dataframe Objects in R Are Equal
Ggplot2: Overlay Density Plots R
Remove a Layer from a Ggplot2 Chart
Reading in Chunks at a Time Using Fread in Package Data.Table
Rmarkdown Directing Output File into a Directory
R Package Xtable, How to Create a Latextable with Multiple Rows and Columns from R
How to Plot a Subset of a Data Frame in R
Scaling Shiny Plots to Window Height
Ggplot2: How to Remove Slash from Geom_Density Legend
How to Properly Use Functions from Other Packages in a R Package
Remove Geom(S) from an Existing Ggplot Chart
R: How Does a Foreach Loop Find a Function That Should Be Invoked
Using Dplyr for Frequency Counts of Interactions, Must Include Zero Counts