Dplyr Left_Join by Less Than, Greater Than Condition

dplyr left_join by less than, greater than condition

Use a filter. (But note that this answer does not produce a correct LEFT JOIN; but the MWE gives the right result with an INNER JOIN instead.)

The dplyr package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy:

fdata %>% 
mutate(dummy=TRUE) %>%
left_join(sdata %>% mutate(dummy=TRUE)) %>%
filter(fyear >= byear, fyear < eyear) %>%
select(-dummy)

And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy variable as evidenced by the following two query explanations:

> fdata %>% 
+ mutate(dummy=TRUE) %>%
+ left_join(sdata %>% mutate(dummy=TRUE)) %>%
+ filter(fyear >= byear, fyear < eyear) %>%
+ select(-dummy) %>%
+ explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"

LEFT JOIN

(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"

USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"


<PLAN>
Nested Loop (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)

and doing it more cleanly with SQL gives exactly the same result:

> tbl(pg, sql("
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear >= byear AND fyear < eyear")) %>%
+ explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear >= byear AND fyear < eyear) AS "zzz140"


<PLAN>
Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)

Left join with multiple conditions in R

You can join on more than one variable. The example df you give would actually make a suitable lookup table for this:

value_lookup <- data.frame(
type = c('q1', 'q1', 'q2', 'q2', 'q3', 'q3'),
id = c(1, 2, 1, 3, 1, 2),
value = c('yes', 'no', 'one hour', 'more than two hours', 'blue', 'yellow')
)

Then you just merge on both type and id:

df <- left_join(df, value_lookup, by = c('type', 'id'))  

Usually when I need a lookup table like that I store it in a CSV rather than write it all out in the code, but do whatever suits you.

How to do a conditional join in R with dplyr?

left_join(df1, df2, by = c("X1", "X2", "X3", "X4")) %>%
filter(X5.x != X5.y | is.na(X5.x) | is.na(X5.y))
# X1 X2 X3 X4 X5.x X5.y
# 1 Apple Belgium Red Purchase 100 10000
# 2 Guava Germany Green Sale 200 20000
# 3 Grape Italy Purple Purchase 500 NA

Is there a conditional join available in R that picks only the mismatches and ignores when the target column is same?

Yes, I think you could do this with non-equi joins in data.table. Or sqldf, as you mention.

I want to do this in dplyr.

dplyr only joins on equality. So you join and then filter.


Using this data:

df1 = read.table(text = "X1         X2     X3     X4         X5
Apple Belgium Red Purchase 100
Guava Germany Green Sale 200
Grape Italy Purple Purchase 500
Orange India Orange Sale 2000", header = T)

df2 = read.table(text = "X1 X2 X3 X4 X5
Apple Belgium Red Purchase 10000
Guava Germany Green Sale 20000
Grape Italy Purple Purchase NA
Orange India Orange Sale 2000", header = T)

left_join by categories and range values

You should be able to join and then filter the joined data frame:

library(dplyr)

left_join(pensions, pension_factors, by=c('gender', 'type')) %>%
filter(age >= min_age,
age <= max_age)

gender age type min_age max_age factor
1 Female 66 widow 65 75 435
2 Female 69 widow 65 75 435
3 Female 72 disability 65 75 10
4 Male 85 disability 76 85 95

R dplyr join on range of dates

First of all, thank you for trying to help me. I realize my question is incomplete. I moved away from fuzzyjoin because of all the bioconductor dependencies.

I used sqldf instead to accomplish the task:

library(sqldf)
sqldf("SELECT * FROM xxx
LEFT JOIN yyy
ON xxx.ID = yyy.ID
AND xxx.NRA = yyy.NRA
AND yyy.date BETWEEN xxx.date_low AND xxx.date_high")

The result is almost identical to this question but I suspect it can also be solved with that question as per Uwe's data.table solution.

I am also linking this rstudio response

left_join (dplyr) using a function

Should be simple if you just add it as a column before left joining

t <- t %>% mutate(monthinc = monthinc(Date,-1)) 

left_join(t,t,by=c("Product","Date"="monthinc"))

Conditional joins in data.table - left join with column selection

You can also include the selection of variables within the first query.

DT2[DT1, .(x, y, z, t), on = .(a = x, y_start <= y, y_end >= y)]
# x y z t
#1: 1 15 -0.6264538 NA
#2: 2 25 0.1836433 1206
#3: 3 35 -0.8356286 NA
#4: 4 45 1.5952808 NA
#5: 5 55 0.3295078 1686

Regards!

Specifying Logical Conditions (e.g. greater than and less than) in R

Possible Answer:

#load libraries
library(dplyr)
library(mco)

#define function

funct_set <- function (x) {
x1 <- x[1]; x2 <- x[2]; x3 <- x[3] ; x4 <- x[4]; x5 <- x[5]; x6 <- x[6]; x[7] <- x[7]
f <- numeric(4)


#bin data according to random criteria
train_data <- train_data %>%
mutate(cat = ifelse(a1 <= x1 & b1 <= x3, "a",
ifelse(a1 <= x2 & b1 <= x4, "b", "c")))

train_data$cat = as.factor(train_data$cat)

#new splits
a_table = train_data %>%
filter(cat == "a") %>%
select(a1, b1, c1, cat)

b_table = train_data %>%
filter(cat == "b") %>%
select(a1, b1, c1, cat)

c_table = train_data %>%
filter(cat == "c") %>%
select(a1, b1, c1, cat)



#calculate quantile ("quant") for each bin

table_a = data.frame(a_table%>% group_by(cat) %>%
mutate(quant = ifelse(c1 > x[5],1,0 )))

table_b = data.frame(b_table%>% group_by(cat) %>%
mutate(quant = ifelse(c1 > x[6],1,0 )))

table_c = data.frame(c_table%>% group_by(cat) %>%
mutate(quant = ifelse(c1 > x[7],1,0 )))

f[1] = mean(table_a$quant)
f[2] = mean(table_b$
quant)
f[3] = mean(table_c$quant)


#group all tables

final_table = rbind(table_a, table_b, table_c)
# calculate the total mean : this is what needs to be optimized

f[4] = mean(final_table$quant)


return (f);
}


gn <- function(x) {
g1 <- x[3] - x[1]
g2<- x[4] - x[2]
g3 <- x[7] - x[6]
g4 <- x[6] - x[5]
return(c(g1,g2,g3,g4))
}

optimization <- nsga2(funct_set, idim = 7, odim = 4 , constraints = gn, cdim = 4,

generations=150,
popsize=100,
cprob=0.7,
cdist=20,
mprob=0.2,
mdist=20,
lower.bounds=rep(80,80,80,80, 100,200,300),
upper.bounds=rep(120,120,120,120,200,300,400)
)

Now, if we take a look at the output:

#view output
optimization

Sample Image

For any given row, all the logical conditions seem to be respected!

Left Join in R (dplyr) - Too many observations?

With left_join(A, B) new rows will be added wherever there are multiple rows in B for which the key columns (same-name columns by default) match the same, single row in A. For example:

library(dplyr)
df1 <- data.frame(col1 = LETTERS[1:4],
col2 = 1:4)
df2 <- data.frame(col1 = rep(LETTERS[1:2], 2),
col3 = 4:1)

left_join(df1, df2) # has 6 rows rather than 4


Related Topics



Leave a reply



Submit