Range Join Data.Frames - Specific Date Column with Date Ranges/Intervals in R

Range join data.frames - specific date column with date ranges/intervals in R

Update: In v1.9.3+, now overlap joins are implemented. This is a special case where start and end Date are identical in Speeches. We can accomplish this using foverlaps() as follows:

require(data.table) ## 1.9.3+
setDT(Speeches)
setDT(History)

Speeches[, `:=`(Date2 = Date, id = .I)]
setkey(History, Name, Role.Start, Role.End)

ans = foverlaps(Speeches, History, by.x=c("Name", "Date", "Date2"))[, Date2 := NULL]
ans = ans[order(id, Value)][, N := 1:.N, by=list(Name, Date, Role, id)]
ans = dcast.data.table(ans, id+Name+Date ~ Role+N, value.var="Value")

This is a case for range/interval join.

Here's the data.table way. It uses two rolling joins.

require(data.table) ## 1.9.2+
dt1 = as.data.table(Speeches)
dt2 = as.data.table(History)

# first rolling join - to get end indices
setkey(dt2, Name, Role.Start)
tmp1 = dt2[dt1, roll=Inf, which=TRUE]

# second rolling join - to get start indices
setkey(dt2, Name, Role.End)
tmp2 = dt2[dt1, roll=-Inf, which=TRUE]

# generate dt1's and dt2's corresponding row indices
idx = tmp1-tmp2+1L
idx1 = rep(seq_len(nrow(dt1)), idx)
idx2 = data.table:::vecseq(tmp2, idx, sum(idx))

dt1[, id := 1:.N] ## needed for casting later

# subset using idx1 and idx2 and bind them colwise
ans = cbind(dt1[idx1], dt2[idx2, -1L, with=FALSE])

# a little reordering to get the output correctly (factors are a pain!)
ans = ans[order(id,Value)][, N := 1:.N, by=list(Name, Date, Role, id)]

# finally cast them.
f_ans = dcast.data.table(ans, id+Name+Date ~ Role+N, value.var="Value")

Here's the output:

    id Name       Date Political groups_1 National parties_1 Member_1 Member_2 Member_3 Substitute_1
1: 1 AAA 2004-05-05 j l c f NA d
2: 2 AAA 2003-12-18 j l c f h d
3: 3 AAA 2003-12-18 j l c f h d
4: 4 AAA 2003-12-18 j l c f h d
5: 5 AAA 2003-11-17 j l c f h d
6: 6 AAA 2003-11-06 j l c f h d
7: 7 AAA 2003-10-20 j l c f h d
8: 8 AAA 2003-09-25 j l c f h d
9: 9 AAA 2003-06-04 j l c f h d
10: 10 BBB 2012-04-20 i k b g NA NA
11: 11 BBB 2012-04-19 i k b g NA NA
12: 12 BBB 2012-04-19 i k b g NA NA
13: 13 BBB 2012-04-19 i k b g NA NA
14: 14 BBB 2012-04-19 i k b g NA NA
15: 15 BBB 2012-04-19 i k b g NA NA
16: 16 BBB 2012-04-19 i k b g NA NA
17: 17 BBB 2012-04-19 i k b g NA NA
18: 18 BBB 2012-04-18 i k b g NA NA
19: 19 BBB 2012-04-18 i k b g NA NA
20: 20 BBB 2012-04-18 i k b g NA NA

Alternatively you can also accomplish this using GenomicRanges package from bioconductor, which deals with Ranges quite nicely, especially when you require an additional column to join by (Name) in addition to the ranges. You can install it from here.

require(GenomicRanges)
require(data.table)
dt1 <- as.data.table(Speeches)
dt2 <- as.data.table(History)
gr1 = GRanges(Rle(dt1$Name), IRanges(as.numeric(dt1$Date), as.numeric(dt1$Date)))
gr2 = GRanges(Rle(dt2$Name), IRanges(as.numeric(dt2$Role.Start), as.numeric(dt2$Role.End)))

olaps = findOverlaps(gr1, gr2, type="within")
idx1 = queryHits(olaps)
idx2 = subjectHits(olaps)

# from here, you can do exactly as above
dt1[, id := 1:.N]
...
...
dcast.data.table(ans, id+Name+Date ~ Role+N, value.var="Value")

Gives the same result as above.

Join within date range

Your logic is not as simple as "between", since it appears that you want any kind of overlap, whether a superset or otherwise. For that, we need a slightly different query (and should include ID on the left-join as well, I'm inferring).

sqldf::sqldf("
select h.*, o.DT_START_OUT, o.DT_END_OUT
from HOSP h
left join OUT o on h.ID = o.ID
and h.DT_START_HOSP < o.DT_END_OUT
and h.DT_END_HOSP > o.DT_START_OUT")
# ID DT_START_HOSP DT_END_HOSP DT_START_OUT DT_END_OUT
# 1 111 2021-01-07 2021-01-10 <NA> <NA>
# 2 222 2021-01-11 2021-01-20 2021-01-15 2021-01-15
# 3 333 2021-01-21 2021-01-25 <NA> <NA>
# 4 444 2021-01-21 2021-02-01 <NA> <NA>
# 5 555 2021-01-21 2021-01-29 <NA> <NA>
# 6 666 2021-01-22 2021-02-02 2021-01-25 2021-01-25
# 7 666 2021-01-22 2021-02-02 2021-01-28 2021-01-30

(Thank you for fixing your data from the previous question and the first draft of this one. For the record, you may want this, some handy code that deals well with vectors of dates in inconsistent/different formats.)

Is there any way to join two data frames by date ranges?

You could also try fuzzy_join as suggested by @Gregor Thomas. I added a row number column to make sure you have unique rows independent of item and date ranges (but this may not be needed).

library(fuzzyjoin)
library(dplyr)

daily_forecast %>%
mutate(rn = row_number()) %>%
fuzzy_left_join(actual_orders,
by = c("item" = "item",
"date_fcsted" = "order_date",
"extended_date" = "order_date"),
match_fun = list(`==`, `<=`, `>=`)) %>%
group_by(rn, item.x, date_fcsted, extended_date, fcsted_qty) %>%
summarise(actual_total_demand = sum(order_qty))

Output

     rn item.x date_fcsted extended_date fcsted_qty actual_total_demand
<int> <chr> <date> <date> <dbl> <dbl>
1 1 A 2020-08-01 2020-08-28 100 221
2 2 B 2020-08-01 2020-08-28 200 219
3 3 A 2020-08-15 2020-09-11 200 212
4 4 B 2020-08-15 2020-09-11 100 216

merge dataframes based on date range in R

Assume you are using lubridate for the DateTime, you can do

df1 %>% left_join(df2, by = c("SiteID" = "Site.ID")) %>% 
filter(DateTime %within% interval(Start.date, End.date))
#> # A tibble: 8 × 4
#> DateTime SiteID Start.date End.date
#> <dttm> <chr> <date> <date>
#> 1 2010-07-25 01:06:55 B04 2010-07-18 2010-08-24
#> 2 2011-05-10 23:52:14 B04 2011-02-22 2011-07-23
#> 3 2011-09-17 01:14:30 B04 2011-08-30 2012-10-03
#> 4 2012-04-04 02:55:29 B05 2011-08-30 2012-08-21
#> 5 2013-01-05 23:03:06 B05 2012-12-08 2013-01-21
#> 6 2011-03-09 20:39:46 B06 2011-02-12 2011-04-18
#> 7 2012-07-25 23:17:19 B07 2011-12-29 2012-12-02
#> 8 2011-03-03 00:46:45 B08 2011-02-12 2011-04-01

Full Example:

library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union

df1 <- tribble(~DateTime, ~SiteID,
"2010-07-25 01:06:55", "B04",
"2011-05-10 23:52:14", "B04",
"2011-09-17 01:14:30", "B04",
"2012-04-04 02:55:29", "B05",
"2013-01-05 23:03:06", "B05",
"2011-03-09 20:39:46", "B06",
"2012-07-25 23:17:19", "B07",
"2011-03-03 00:46:45", "B08") %>%
mutate(DateTime = as_datetime(DateTime))

df2 <- tribble(~Site.ID, ~Start.date, ~End.date,
"B04", "2010-07-18", "2010-08-24",
"B04", "2011-02-22", "2011-07-23",
"B04", "2011-08-30", "2012-10-03",
"B04", "2012-10-20", "2013-04-08",
"B05", "2011-08-30", "2012-08-21",
"B05", "2012-12-08", "2013-01-21",
"B05", "2013-02-08", "2013-04-08",
"B06", "2010-07-20", "2010-09-03",
"B06", "2011-02-12", "2011-04-18",
"B06", "2011-05-13", "2011-05-16",
"B07", "2011-10-24", "2011-11-29",
"B07", "2011-12-29", "2012-12-02",
"B08", "2011-02-12", "2011-04-01",
"B08", "2011-10-24", "2011-12-24") %>%
mutate(Start.date = as_date(Start.date), End.date = as_date(End.date))

df1 %>% left_join(df2, by = c("SiteID" = "Site.ID")) %>%
filter(DateTime %within% interval(Start.date, End.date))
#> # A tibble: 8 × 4
#> DateTime SiteID Start.date End.date
#> <dttm> <chr> <date> <date>
#> 1 2010-07-25 01:06:55 B04 2010-07-18 2010-08-24
#> 2 2011-05-10 23:52:14 B04 2011-02-22 2011-07-23
#> 3 2011-09-17 01:14:30 B04 2011-08-30 2012-10-03
#> 4 2012-04-04 02:55:29 B05 2011-08-30 2012-08-21
#> 5 2013-01-05 23:03:06 B05 2012-12-08 2013-01-21
#> 6 2011-03-09 20:39:46 B06 2011-02-12 2011-04-18
#> 7 2012-07-25 23:17:19 B07 2011-12-29 2012-12-02
#> 8 2011-03-03 00:46:45 B08 2011-02-12 2011-04-01

Created on 2022-06-09 by the reprex package (v2.0.1)

merging tables based on time ranges/intervals using lubridate

This may be inefficient for large dataframes (since you're creating a much larger match and subsetting), and I'm sure there's a more elegant way, but this works:

output <- merge(df1,df2,by="User")[test$Date %within% test$interval,]

Or you could use a loop:

for(x in 1:length(df1$User)){
df1$Price[x]<-df2[(df1$Date[x] %within% df2$interval)&df1$User[x]==df2$User,]$Price
}

I'm sure you could also make a function and use apply...

R - Summarize values between specific date range

A couple alternatives to consider. I assume your dates are actual dates and not character values.

You can try using fuzzyjoin to merge the two data.frames, including rows where the dates fall between start_dates and end_dates.

library(tidyverse)
library(fuzzyjoin)

fuzzy_left_join(
date_df,
df,
by = c("start_dates" = "dates", "end_dates" = "dates"),
match_fun = list(`<=`, `>=`)
) %>%
group_by(start_dates, end_dates) %>%
summarise(new_goal_column = sum(x))

Output

  start_dates end_dates  new_goal_column
<date> <date> <dbl>
1 2021-01-01 2021-01-06 19
2 2021-01-07 2021-01-10 6

You can also try using data.table and joining.

library(data.table)

setDT(date_df)
setDT(df)

df[date_df, .(start_dates, end_dates, x), on = .(dates >= start_dates, dates <= end_dates)][
, .(new_goal_column = sum(x)), by = .(start_dates, end_dates)
]

Output

   start_dates  end_dates new_goal_column
1: 2021-01-01 2021-01-06 19
2: 2021-01-07 2021-01-10 6

Merging two dataframes on a date range in R

In general, its not a good idea to use POSIXlt in data frames. Use POSIXct instead. Also your SQL statement is ok except the comma before FROM needs to be removed:

df1a <- transform(df1, 
StartDateTime = as.POSIXct(StartDateTime),
EndDateTime = as.POSIXct(EndDateTime))
df2a <- transform(df2, dateTime = as.POSIXct(dateTime))

The SQL statement in the question has an extraneous commma before FROM.

Here is a slightly simplified statement. This one uses a left join instead to ensure that all ID's from df1a are included even if they have no matches in df2a.

sqldf("SELECT df1a.ID, PtID, dateTime, lat, lon 
FROM df1a LEFT JOIN df2a
ON df1a.ID = df2a.ID AND dateTime BETWEEN StartDateTime AND EndDateTime")

How to perform join over date ranges using data.table?

You can use the foverlaps() function which implements joins over intervals efficiently. In your case, we just need a dummy column for measurments.

Note 1: You should install the development version of data.table - v1.9.5 as a bug with foverlaps() has been fixed there. You can find the installation instructions here.

Note 2: I'll call whatWasMeasured = dt1 and measurments = dt2 here for convenience.

require(data.table) ## 1.9.5+
dt2[, dummy := time]

setkey(dt1, start, end)
ans = foverlaps(dt2, dt1, by.x=c("time", "dummy"), nomatch=0L)[, dummy := NULL]

See ?foverlaps for more info and this post for a performance comparison.

Join tables by date range

I know the following looks horrible in base, but here's what I came up with. It's better to use the 'sqldf' package (see below).

library(data.table)
data1 <- data.table(date = c('2010-01-21', '2010-01-25', '2010-02-02', '2010-02-09'),
name = c('id1','id2','id3','id4'))

data2 <- data.table(beginning=c('2010-01-15', '2010-01-23', '2010-01-30', '2010-02-05'),
ending = c('2010-01-22','2010-01-29','2010-02-04','2010-02-13'),
class = c(1,2,3,4))

result <- cbind(data1,"beginning"=sapply(1:nrow(data2),function(x) data2$beginning[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]),
"ending"=sapply(1:nrow(data2),function(x) data2$ending[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]),
"class"=sapply(1:nrow(data2),function(x) data2$class[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]))

Using the package sqldf:

library(sqldf)
result = sqldf("select * from data1
left join data2
on data1.date between data2.beginning and data2.ending")

Using data.table this is simply

data1[data2, on = .(date >= beginning, date <= ending)]


Related Topics



Leave a reply



Submit