R dplyr join on range of dates
First of all, thank you for trying to help me. I realize my question is incomplete. I moved away from fuzzyjoin
because of all the bioconductor
dependencies.
I used sqldf
instead to accomplish the task:
library(sqldf)
sqldf("SELECT * FROM xxx
LEFT JOIN yyy
ON xxx.ID = yyy.ID
AND xxx.NRA = yyy.NRA
AND yyy.date BETWEEN xxx.date_low AND xxx.date_high")
The result is almost identical to this question but I suspect it can also be solved with that question as per Uwe's data.table
solution.
I am also linking this rstudio response
join data frame based on date ranges not exact values
Here's a demonstration of using dbplyr
to affect non-equal joins. I'll use a temporary (in-memory) SQLite database, which has one side-effect of converting dates to numbers; this is both reversible (if you use SQLite) and should not be a factor for DBMSes that distinguish between Date
and numeric
.
Setup:
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
# rename them in the db for unambiguity here in code
DBI::dbExecute(con, "drop table if exists table1")
DBI::dbExecute(con, "drop table if exists table2")
library(dplyr)
library(dbplyr)
copy_to(con, df1, "table1")
copy_to(con, df2, "table2")
The work:
tbl1 <- tbl(con, "table1")
tbl2 <- tbl(con, "table2")
tbl1 %>%
group_by(id) %>%
transmute(id, date1 = date, date2 = lead(date - 1, default = 99999999), thing1) %>%
left_join(tbl2, ., sql_on = "LHS.id = RHS.id and (RHS.date1 <= LHS.date and LHS.date < RHS.date2)") %>%
select(-date1, -date2, -id.y) %>%
rename(id = id.x) %>%
collect()
# # A tibble: 10 x 4
# id date thing2 thing1
# <dbl> <dbl> <chr> <chr>
# 1 1 18266 v a
# 2 1 18312 w a
# 3 1 18326 x b
# 4 1 18331 y b
# 5 1 18387 z c
# 6 2 18297 v a
# 7 2 18302 w a
# 8 2 18357 x b
# 9 2 18362 y b
# 10 2 18418 z c
Edit using a postgres container.
tbl1 %>%
group_by(id) %>%
transmute(id, date1 = date, date2 = lead(date), thing1) %>%
mutate(date2 = if_else(is.na(date2), as.Date("2099-01-01"), date2)) %>%
left_join(tbl2, ., sql_on = '"LHS".id = "RHS".id and ("RHS".date1 <= "LHS".date and "LHS".date < "RHS".date2)') %>%
select(-date1, -date2, -id.y) %>%
rename(id = id.x) %>%
collect()
# # A tibble: 10 x 4
# id date thing2 thing1
# <dbl> <date> <chr> <chr>
# 1 1 2020-01-05 v a
# 2 1 2020-02-20 w a
# 3 1 2020-03-05 x b
# 4 1 2020-03-10 y b
# 5 1 2020-05-05 z c
# 6 2 2020-02-05 v a
# 7 2 2020-02-10 w a
# 8 2 2020-04-05 x b
# 9 2 2020-04-10 y b
# 10 2 2020-06-05 z c
Some of the problems I ran into with this:
date-math is something I should have expected problems with, since most DBMSes have specific functions for date/time arithmetic, and I made no effort to address that (I was being lazy with SQLite's numeric date-alias); additionally, I don't know how to get
lead(..., default=)
working, so I trimmed thelead(...)
call and added amutate
;postgres was taking issue with the fact that
dbplyr
identifier-quotes all of the tables name (e.g.,FROM "table" AS "LHS"
, yet mysql_on
was using unquotedLHS
; when I changed thesql_on
to have quoted table names, things stopped breaking; you can see how this query pans out by replacingcollect()
withshow_query()
, showing the change fromON (LHS.id = RHS.id and (RHS.date1 <= LHS.date and LHS.date < RHS.date2))
to
ON ("LHS".id = "RHS".id and ("RHS".date1 <= "LHS".date and "LHS".date < "RHS".date2))
Join tables by date range
I know the following looks horrible in base, but here's what I came up with. It's better to use the 'sqldf' package (see below).
library(data.table)
data1 <- data.table(date = c('2010-01-21', '2010-01-25', '2010-02-02', '2010-02-09'),
name = c('id1','id2','id3','id4'))
data2 <- data.table(beginning=c('2010-01-15', '2010-01-23', '2010-01-30', '2010-02-05'),
ending = c('2010-01-22','2010-01-29','2010-02-04','2010-02-13'),
class = c(1,2,3,4))
result <- cbind(data1,"beginning"=sapply(1:nrow(data2),function(x) data2$beginning[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]),
"ending"=sapply(1:nrow(data2),function(x) data2$ending[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]),
"class"=sapply(1:nrow(data2),function(x) data2$class[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]))
Using the package sqldf:
library(sqldf)
result = sqldf("select * from data1
left join data2
on data1.date between data2.beginning and data2.ending")
Using data.table this is simply
data1[data2, on = .(date >= beginning, date <= ending)]
Left join on date range by group ID
Using the solution from @mt1002 with one simple addition to obtain the txs
column. I have obtained a single value by sum(txs)
, where this could also be a single value of the min(txs)
or max(txs)
depending on your needs.
Data %>% group_by(num_c) %>%
summarise(week = list(seq(min(week), max(week), by = 'week')),
txs = sum(txs)) %>%
unnest(week)
After the clarification, this is the solution I came up with where there are NA values for weeks there were no orders, in addition to the number of orders per user per week. You could also join the list of weeks with orders by num_c
using a left-join including the above query to the df.
library(lubridate)
a <- data.frame(week = rep(seq(1,52,1)))
Data %>%
group_by(num_c) %>%
mutate(week_num = week(week)) %>%
group_by(num_c, week_num) %>%
summarise(txs = sum(txs),
number_orders = n()) %>%
full_join(a, by = c("week_num"="week")) %>%
ungroup() %>%
arrange(week_num)
Is there any way to join two data frames by date ranges?
You could also try fuzzy_join
as suggested by @Gregor Thomas. I added a row number column to make sure you have unique rows independent of item
and date ranges (but this may not be needed).
library(fuzzyjoin)
library(dplyr)
daily_forecast %>%
mutate(rn = row_number()) %>%
fuzzy_left_join(actual_orders,
by = c("item" = "item",
"date_fcsted" = "order_date",
"extended_date" = "order_date"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(rn, item.x, date_fcsted, extended_date, fcsted_qty) %>%
summarise(actual_total_demand = sum(order_qty))
Output
rn item.x date_fcsted extended_date fcsted_qty actual_total_demand
<int> <chr> <date> <date> <dbl> <dbl>
1 1 A 2020-08-01 2020-08-28 100 221
2 2 B 2020-08-01 2020-08-28 200 219
3 3 A 2020-08-15 2020-09-11 200 212
4 4 B 2020-08-15 2020-09-11 100 216
dplyr left_join by less than, greater than condition
Use a filter
. (But note that this answer does not produce a correct LEFT JOIN
; but the MWE gives the right result with an INNER JOIN
instead.)
The dplyr
package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy
:
fdata %>%
mutate(dummy=TRUE) %>%
left_join(sdata %>% mutate(dummy=TRUE)) %>%
filter(fyear >= byear, fyear < eyear) %>%
select(-dummy)
And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy
variable as evidenced by the following two query explanations:
> fdata %>%
+ mutate(dummy=TRUE) %>%
+ left_join(sdata %>% mutate(dummy=TRUE)) %>%
+ filter(fyear >= byear, fyear < eyear) %>%
+ select(-dummy) %>%
+ explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"
LEFT JOIN
(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"
USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"
<PLAN>
Nested Loop (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
and doing it more cleanly with SQL gives exactly the same result:
> tbl(pg, sql("
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear >= byear AND fyear < eyear")) %>%
+ explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear >= byear AND fyear < eyear) AS "zzz140"
<PLAN>
Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
Related Topics
Using Rvest to Scrape a Website W/ a Login Page
R Shiny - Uioutput Not Rendering Inside Menuitem
How to Remove Na Data in Only One Columns
How to Replace Multiple Values at Once
How to Merge Two Data Frames in R by a Common Column with Mismatched Date/Time Values
Categorical Scatter Plot with Mean Segments Using Ggplot2 in R
R Leaflet - Use Date or Character Legend Labels with Colornumeric() Palette
Looping Through Covariates in Regression Using R
Check If a String Contains at Least One Numeric Character in R
Using Rollmean When There Are Missing Values (Na)
How to Create a Histogram from Aggregated Data in R
How to Create a Plot with Customized Points in R
Gap in Polar Time Plot - How to Connect Start and End Points in Geom_Line or Remove Blank Space
How to Use Variables Newly Created in 'J' in the Same 'J' Argument
R: Building a Simple Command Line Plotting Tool/Capturing Window Close Events
R Plots: How to Draw a Border, Shadow or Buffer Around Text Labels