R Dplyr Join on Range of Dates

R dplyr join on range of dates

First of all, thank you for trying to help me. I realize my question is incomplete. I moved away from fuzzyjoin because of all the bioconductor dependencies.

I used sqldf instead to accomplish the task:

library(sqldf)
sqldf("SELECT * FROM xxx
LEFT JOIN yyy
ON xxx.ID = yyy.ID
AND xxx.NRA = yyy.NRA
AND yyy.date BETWEEN xxx.date_low AND xxx.date_high")

The result is almost identical to this question but I suspect it can also be solved with that question as per Uwe's data.table solution.

I am also linking this rstudio response

join data frame based on date ranges not exact values

Here's a demonstration of using dbplyr to affect non-equal joins. I'll use a temporary (in-memory) SQLite database, which has one side-effect of converting dates to numbers; this is both reversible (if you use SQLite) and should not be a factor for DBMSes that distinguish between Date and numeric.

Setup:

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# rename them in the db for unambiguity here in code
DBI::dbExecute(con, "drop table if exists table1")
DBI::dbExecute(con, "drop table if exists table2")
library(dplyr)
library(dbplyr)
copy_to(con, df1, "table1")
copy_to(con, df2, "table2")

The work:

tbl1 <- tbl(con, "table1")
tbl2 <- tbl(con, "table2")
tbl1 %>%
group_by(id) %>%
transmute(id, date1 = date, date2 = lead(date - 1, default = 99999999), thing1) %>%
left_join(tbl2, ., sql_on = "LHS.id = RHS.id and (RHS.date1 <= LHS.date and LHS.date < RHS.date2)") %>%
select(-date1, -date2, -id.y) %>%
rename(id = id.x) %>%
collect()
# # A tibble: 10 x 4
# id date thing2 thing1
# <dbl> <dbl> <chr> <chr>
# 1 1 18266 v a
# 2 1 18312 w a
# 3 1 18326 x b
# 4 1 18331 y b
# 5 1 18387 z c
# 6 2 18297 v a
# 7 2 18302 w a
# 8 2 18357 x b
# 9 2 18362 y b
# 10 2 18418 z c

Edit using a postgres container.

tbl1 %>%
group_by(id) %>%
transmute(id, date1 = date, date2 = lead(date), thing1) %>%
mutate(date2 = if_else(is.na(date2), as.Date("2099-01-01"), date2)) %>%
left_join(tbl2, ., sql_on = '"LHS".id = "RHS".id and ("RHS".date1 <= "LHS".date and "LHS".date < "RHS".date2)') %>%
select(-date1, -date2, -id.y) %>%
rename(id = id.x) %>%
collect()
# # A tibble: 10 x 4
# id date thing2 thing1
# <dbl> <date> <chr> <chr>
# 1 1 2020-01-05 v a
# 2 1 2020-02-20 w a
# 3 1 2020-03-05 x b
# 4 1 2020-03-10 y b
# 5 1 2020-05-05 z c
# 6 2 2020-02-05 v a
# 7 2 2020-02-10 w a
# 8 2 2020-04-05 x b
# 9 2 2020-04-10 y b
# 10 2 2020-06-05 z c

Some of the problems I ran into with this:

  • date-math is something I should have expected problems with, since most DBMSes have specific functions for date/time arithmetic, and I made no effort to address that (I was being lazy with SQLite's numeric date-alias); additionally, I don't know how to get lead(..., default=) working, so I trimmed the lead(...) call and added a mutate;

  • postgres was taking issue with the fact that dbplyr identifier-quotes all of the tables name (e.g., FROM "table" AS "LHS", yet my sql_on was using unquoted LHS; when I changed the sql_on to have quoted table names, things stopped breaking; you can see how this query pans out by replacing collect() with show_query(), showing the change from

    ON (LHS.id = RHS.id and (RHS.date1 <= LHS.date and LHS.date < RHS.date2))

    to

    ON ("LHS".id = "RHS".id and ("RHS".date1 <= "LHS".date and "LHS".date < "RHS".date2))

Join tables by date range

I know the following looks horrible in base, but here's what I came up with. It's better to use the 'sqldf' package (see below).

library(data.table)
data1 <- data.table(date = c('2010-01-21', '2010-01-25', '2010-02-02', '2010-02-09'),
name = c('id1','id2','id3','id4'))

data2 <- data.table(beginning=c('2010-01-15', '2010-01-23', '2010-01-30', '2010-02-05'),
ending = c('2010-01-22','2010-01-29','2010-02-04','2010-02-13'),
class = c(1,2,3,4))

result <- cbind(data1,"beginning"=sapply(1:nrow(data2),function(x) data2$beginning[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]),
"ending"=sapply(1:nrow(data2),function(x) data2$ending[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]),
"class"=sapply(1:nrow(data2),function(x) data2$class[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]))

Using the package sqldf:

library(sqldf)
result = sqldf("select * from data1
left join data2
on data1.date between data2.beginning and data2.ending")

Using data.table this is simply

data1[data2, on = .(date >= beginning, date <= ending)]

Left join on date range by group ID

Using the solution from @mt1002 with one simple addition to obtain the txs column. I have obtained a single value by sum(txs), where this could also be a single value of the min(txs) or max(txs) depending on your needs.

Data %>% group_by(num_c) %>% 
summarise(week = list(seq(min(week), max(week), by = 'week')),
txs = sum(txs)) %>%
unnest(week)

After the clarification, this is the solution I came up with where there are NA values for weeks there were no orders, in addition to the number of orders per user per week. You could also join the list of weeks with orders by num_c using a left-join including the above query to the df.

library(lubridate)
a <- data.frame(week = rep(seq(1,52,1)))
Data %>%
group_by(num_c) %>%
mutate(week_num = week(week)) %>%
group_by(num_c, week_num) %>%
summarise(txs = sum(txs),
number_orders = n()) %>%
full_join(a, by = c("week_num"="week")) %>%
ungroup() %>%
arrange(week_num)

Is there any way to join two data frames by date ranges?

You could also try fuzzy_join as suggested by @Gregor Thomas. I added a row number column to make sure you have unique rows independent of item and date ranges (but this may not be needed).

library(fuzzyjoin)
library(dplyr)

daily_forecast %>%
mutate(rn = row_number()) %>%
fuzzy_left_join(actual_orders,
by = c("item" = "item",
"date_fcsted" = "order_date",
"extended_date" = "order_date"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(rn, item.x, date_fcsted, extended_date, fcsted_qty) %>%
summarise(actual_total_demand = sum(order_qty))

Output

     rn item.x date_fcsted extended_date fcsted_qty actual_total_demand
<int> <chr> <date> <date> <dbl> <dbl>
1 1 A 2020-08-01 2020-08-28 100 221
2 2 B 2020-08-01 2020-08-28 200 219
3 3 A 2020-08-15 2020-09-11 200 212
4 4 B 2020-08-15 2020-09-11 100 216

dplyr left_join by less than, greater than condition

Use a filter. (But note that this answer does not produce a correct LEFT JOIN; but the MWE gives the right result with an INNER JOIN instead.)

The dplyr package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy:

fdata %>% 
mutate(dummy=TRUE) %>%
left_join(sdata %>% mutate(dummy=TRUE)) %>%
filter(fyear >= byear, fyear < eyear) %>%
select(-dummy)

And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy variable as evidenced by the following two query explanations:

> fdata %>% 
+ mutate(dummy=TRUE) %>%
+ left_join(sdata %>% mutate(dummy=TRUE)) %>%
+ filter(fyear >= byear, fyear < eyear) %>%
+ select(-dummy) %>%
+ explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"

LEFT JOIN

(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"

USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"

<PLAN>
Nested Loop (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)

and doing it more cleanly with SQL gives exactly the same result:

> tbl(pg, sql("
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear >= byear AND fyear < eyear")) %>%
+ explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear >= byear AND fyear < eyear) AS "zzz140"

<PLAN>
Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)


Related Topics



Leave a reply



Submit