dplyr left_join by less than, greater than condition
Use a filter
. (But note that this answer does not produce a correct LEFT JOIN
; but the MWE gives the right result with an INNER JOIN
instead.)
The dplyr
package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy
:
fdata %>%
mutate(dummy=TRUE) %>%
left_join(sdata %>% mutate(dummy=TRUE)) %>%
filter(fyear >= byear, fyear < eyear) %>%
select(-dummy)
And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy
variable as evidenced by the following two query explanations:
> fdata %>%
+ mutate(dummy=TRUE) %>%
+ left_join(sdata %>% mutate(dummy=TRUE)) %>%
+ filter(fyear >= byear, fyear < eyear) %>%
+ select(-dummy) %>%
+ explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"
LEFT JOIN
(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"
USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"
<PLAN>
Nested Loop (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
and doing it more cleanly with SQL gives exactly the same result:
> tbl(pg, sql("
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear >= byear AND fyear < eyear")) %>%
+ explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear >= byear AND fyear < eyear) AS "zzz140"
<PLAN>
Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
Left join with multiple conditions in R
You can join on more than one variable. The example df you give would actually make a suitable lookup table for this:
value_lookup <- data.frame(
type = c('q1', 'q1', 'q2', 'q2', 'q3', 'q3'),
id = c(1, 2, 1, 3, 1, 2),
value = c('yes', 'no', 'one hour', 'more than two hours', 'blue', 'yellow')
)
Then you just merge on both type
and id
:
df <- left_join(df, value_lookup, by = c('type', 'id'))
Usually when I need a lookup table like that I store it in a CSV rather than write it all out in the code, but do whatever suits you.
How to do a conditional join in R with dplyr?
left_join(df1, df2, by = c("X1", "X2", "X3", "X4")) %>%
filter(X5.x != X5.y | is.na(X5.x) | is.na(X5.y))
# X1 X2 X3 X4 X5.x X5.y
# 1 Apple Belgium Red Purchase 100 10000
# 2 Guava Germany Green Sale 200 20000
# 3 Grape Italy Purple Purchase 500 NA
Is there a conditional join available in R that picks only the mismatches and ignores when the target column is same?
Yes, I think you could do this with non-equi joins in data.table
. Or sqldf
, as you mention.
I want to do this in dplyr.
dplyr
only joins on equality. So you join and then filter.
Using this data:
df1 = read.table(text = "X1 X2 X3 X4 X5
Apple Belgium Red Purchase 100
Guava Germany Green Sale 200
Grape Italy Purple Purchase 500
Orange India Orange Sale 2000", header = T)
df2 = read.table(text = "X1 X2 X3 X4 X5
Apple Belgium Red Purchase 10000
Guava Germany Green Sale 20000
Grape Italy Purple Purchase NA
Orange India Orange Sale 2000", header = T)
left_join by categories and range values
You should be able to join and then filter the joined data frame:
library(dplyr)
left_join(pensions, pension_factors, by=c('gender', 'type')) %>%
filter(age >= min_age,
age <= max_age)
gender age type min_age max_age factor
1 Female 66 widow 65 75 435
2 Female 69 widow 65 75 435
3 Female 72 disability 65 75 10
4 Male 85 disability 76 85 95
R dplyr join on range of dates
First of all, thank you for trying to help me. I realize my question is incomplete. I moved away from fuzzyjoin
because of all the bioconductor
dependencies.
I used sqldf
instead to accomplish the task:
library(sqldf)
sqldf("SELECT * FROM xxx
LEFT JOIN yyy
ON xxx.ID = yyy.ID
AND xxx.NRA = yyy.NRA
AND yyy.date BETWEEN xxx.date_low AND xxx.date_high")
The result is almost identical to this question but I suspect it can also be solved with that question as per Uwe's data.table
solution.
I am also linking this rstudio response
left_join (dplyr) using a function
Should be simple if you just add it as a column before left joining
t <- t %>% mutate(monthinc = monthinc(Date,-1))
left_join(t,t,by=c("Product","Date"="monthinc"))
Conditional joins in data.table - left join with column selection
You can also include the selection of variables within the first query.
DT2[DT1, .(x, y, z, t), on = .(a = x, y_start <= y, y_end >= y)]
# x y z t
#1: 1 15 -0.6264538 NA
#2: 2 25 0.1836433 1206
#3: 3 35 -0.8356286 NA
#4: 4 45 1.5952808 NA
#5: 5 55 0.3295078 1686
Regards!
Specifying Logical Conditions (e.g. greater than and less than) in R
Possible Answer:
#load libraries
library(dplyr)
library(mco)
#define function
funct_set <- function (x) {
x1 <- x[1]; x2 <- x[2]; x3 <- x[3] ; x4 <- x[4]; x5 <- x[5]; x6 <- x[6]; x[7] <- x[7]
f <- numeric(4)
#bin data according to random criteria
train_data <- train_data %>%
mutate(cat = ifelse(a1 <= x1 & b1 <= x3, "a",
ifelse(a1 <= x2 & b1 <= x4, "b", "c")))
train_data$cat = as.factor(train_data$cat)
#new splits
a_table = train_data %>%
filter(cat == "a") %>%
select(a1, b1, c1, cat)
b_table = train_data %>%
filter(cat == "b") %>%
select(a1, b1, c1, cat)
c_table = train_data %>%
filter(cat == "c") %>%
select(a1, b1, c1, cat)
#calculate quantile ("quant") for each bin
table_a = data.frame(a_table%>% group_by(cat) %>%
mutate(quant = ifelse(c1 > x[5],1,0 )))
table_b = data.frame(b_table%>% group_by(cat) %>%
mutate(quant = ifelse(c1 > x[6],1,0 )))
table_c = data.frame(c_table%>% group_by(cat) %>%
mutate(quant = ifelse(c1 > x[7],1,0 )))
f[1] = mean(table_a$quant)
f[2] = mean(table_b$quant)
f[3] = mean(table_c$quant)
#group all tables
final_table = rbind(table_a, table_b, table_c)
# calculate the total mean : this is what needs to be optimized
f[4] = mean(final_table$quant)
return (f);
}
gn <- function(x) {
g1 <- x[3] - x[1]
g2<- x[4] - x[2]
g3 <- x[7] - x[6]
g4 <- x[6] - x[5]
return(c(g1,g2,g3,g4))
}
optimization <- nsga2(funct_set, idim = 7, odim = 4 , constraints = gn, cdim = 4,
generations=150,
popsize=100,
cprob=0.7,
cdist=20,
mprob=0.2,
mdist=20,
lower.bounds=rep(80,80,80,80, 100,200,300),
upper.bounds=rep(120,120,120,120,200,300,400)
)
Now, if we take a look at the output:
#view output
optimization
For any given row, all the logical conditions seem to be respected!
Left Join in R (dplyr) - Too many observations?
With left_join(A, B)
new rows will be added wherever there are multiple rows in B
for which the key columns (same-name columns by default) match the same, single row in A
. For example:
library(dplyr)
df1 <- data.frame(col1 = LETTERS[1:4],
col2 = 1:4)
df2 <- data.frame(col1 = rep(LETTERS[1:2], 2),
col3 = 4:1)
left_join(df1, df2) # has 6 rows rather than 4
Related Topics
MySQL Select Only Not Null Values
How to Group Time by Hour or by 10 Minutes
How Does MySQL Process Order by and Limit in a Query
Check If Value Exists in Postgres Array
Is There an Advantage to Varchar(500) Over Varchar(8000)
Update Table Values from Another Table With the Same User Name
Calculate Age in MySQL (Innodb)
Truncate (Not Round) Decimal Places in SQL Server
How to Create a Temporary Table in an Oracle Database
Select Max Value of Each Group
Are "From Table1 Left Join Table2" and "From Table2 Right Join Table1" Interchangeable
Difference Between Join and Union
Difference Between Having and Where in Sql
Find a String by Searching All Tables in SQL Server
How to Convert Datetime to Varchar
Set Versus Select When Assigning Variables