Range Join Data.Frames - Specific Date Column with Date Ranges/Intervals in R

Range join data.frames - specific date column with date ranges/intervals in R

Update: In v1.9.3+, now overlap joins are implemented. This is a special case where start and end Date are identical in Speeches. We can accomplish this using foverlaps() as follows:

require(data.table) ## 1.9.3+
setDT(Speeches)
setDT(History)

Speeches[, `:=`(Date2 = Date, id = .I)]
setkey(History, Name, Role.Start, Role.End)

ans = foverlaps(Speeches, History, by.x=c("Name", "Date", "Date2"))[, Date2 := NULL]
ans = ans[order(id, Value)][, N := 1:.N, by=list(Name, Date, Role, id)]
ans = dcast.data.table(ans, id+Name+Date ~ Role+N, value.var="Value")

This is a case for range/interval join.

Here's the data.table way. It uses two rolling joins.

require(data.table) ## 1.9.2+
dt1 = as.data.table(Speeches)
dt2 = as.data.table(History)

# first rolling join - to get end indices
setkey(dt2, Name, Role.Start)
tmp1 = dt2[dt1, roll=Inf, which=TRUE]

# second rolling join - to get start indices
setkey(dt2, Name, Role.End)
tmp2 = dt2[dt1, roll=-Inf, which=TRUE]

# generate dt1's and dt2's corresponding row indices
idx = tmp1-tmp2+1L
idx1 = rep(seq_len(nrow(dt1)), idx)
idx2 = data.table:::vecseq(tmp2, idx, sum(idx))

dt1[, id := 1:.N] ## needed for casting later

# subset using idx1 and idx2 and bind them colwise
ans = cbind(dt1[idx1], dt2[idx2, -1L, with=FALSE])

# a little reordering to get the output correctly (factors are a pain!)
ans = ans[order(id,Value)][, N := 1:.N, by=list(Name, Date, Role, id)]

# finally cast them.
f_ans = dcast.data.table(ans, id+Name+Date ~ Role+N, value.var="Value")

Here's the output:

    id Name       Date Political groups_1 National parties_1 Member_1 Member_2 Member_3 Substitute_1
 1:  1  AAA 2004-05-05                  j                  l        c        f       NA            d
 2:  2  AAA 2003-12-18                  j                  l        c        f        h            d
 3:  3  AAA 2003-12-18                  j                  l        c        f        h            d
 4:  4  AAA 2003-12-18                  j                  l        c        f        h            d
 5:  5  AAA 2003-11-17                  j                  l        c        f        h            d
 6:  6  AAA 2003-11-06                  j                  l        c        f        h            d
 7:  7  AAA 2003-10-20                  j                  l        c        f        h            d
 8:  8  AAA 2003-09-25                  j                  l        c        f        h            d
 9:  9  AAA 2003-06-04                  j                  l        c        f        h            d
10: 10  BBB 2012-04-20                  i                  k        b        g       NA           NA
11: 11  BBB 2012-04-19                  i                  k        b        g       NA           NA
12: 12  BBB 2012-04-19                  i                  k        b        g       NA           NA
13: 13  BBB 2012-04-19                  i                  k        b        g       NA           NA
14: 14  BBB 2012-04-19                  i                  k        b        g       NA           NA
15: 15  BBB 2012-04-19                  i                  k        b        g       NA           NA
16: 16  BBB 2012-04-19                  i                  k        b        g       NA           NA
17: 17  BBB 2012-04-19                  i                  k        b        g       NA           NA
18: 18  BBB 2012-04-18                  i                  k        b        g       NA           NA
19: 19  BBB 2012-04-18                  i                  k        b        g       NA           NA
20: 20  BBB 2012-04-18                  i                  k        b        g       NA           NA

Alternatively you can also accomplish this using GenomicRanges package from bioconductor, which deals with Ranges quite nicely, especially when you require an additional column to join by (Name) in addition to the ranges. You can install it from here.

require(GenomicRanges)
require(data.table)
dt1 <- as.data.table(Speeches)
dt2 <- as.data.table(History)
gr1 = GRanges(Rle(dt1$Name), IRanges(as.numeric(dt1$Date), as.numeric(dt1$Date)))
gr2 = GRanges(Rle(dt2$Name), IRanges(as.numeric(dt2$Role.Start), as.numeric(dt2$Role.End)))

olaps = findOverlaps(gr1, gr2, type="within")
idx1 = queryHits(olaps)
idx2 = subjectHits(olaps)

# from here, you can do exactly as above
dt1[, id := 1:.N]
...
...
dcast.data.table(ans, id+Name+Date ~ Role+N, value.var="Value")

Gives the same result as above.

Join within date range

Your logic is not as simple as "between", since it appears that you want any kind of overlap, whether a superset or otherwise. For that, we need a slightly different query (and should include ID on the left-join as well, I'm inferring).

sqldf::sqldf("
  select h.*, o.DT_START_OUT, o.DT_END_OUT
  from HOSP h
    left join OUT o on h.ID = o.ID
      and h.DT_START_HOSP < o.DT_END_OUT
      and h.DT_END_HOSP > o.DT_START_OUT")
#    ID DT_START_HOSP DT_END_HOSP DT_START_OUT DT_END_OUT
# 1 111    2021-01-07  2021-01-10         <NA>       <NA>
# 2 222    2021-01-11  2021-01-20   2021-01-15 2021-01-15
# 3 333    2021-01-21  2021-01-25         <NA>       <NA>
# 4 444    2021-01-21  2021-02-01         <NA>       <NA>
# 5 555    2021-01-21  2021-01-29         <NA>       <NA>
# 6 666    2021-01-22  2021-02-02   2021-01-25 2021-01-25
# 7 666    2021-01-22  2021-02-02   2021-01-28 2021-01-30

(Thank you for fixing your data from the previous question and the first draft of this one. For the record, you may want this, some handy code that deals well with vectors of dates in inconsistent/different formats.)

Is there any way to join two data frames by date ranges?

You could also try fuzzy_join as suggested by @Gregor Thomas. I added a row number column to make sure you have unique rows independent of item and date ranges (but this may not be needed).

library(fuzzyjoin)
library(dplyr)

daily_forecast %>%
  mutate(rn = row_number()) %>%
  fuzzy_left_join(actual_orders,
                  by = c("item" = "item",
                         "date_fcsted" = "order_date",
                         "extended_date" = "order_date"),
                  match_fun = list(`==`, `<=`, `>=`)) %>%
  group_by(rn, item.x, date_fcsted, extended_date, fcsted_qty) %>%
  summarise(actual_total_demand = sum(order_qty))

Output

     rn item.x date_fcsted extended_date fcsted_qty actual_total_demand
  <int> <chr>  <date>      <date>             <dbl>               <dbl>
1     1 A      2020-08-01  2020-08-28           100                 221
2     2 B      2020-08-01  2020-08-28           200                 219
3     3 A      2020-08-15  2020-09-11           200                 212
4     4 B      2020-08-15  2020-09-11           100                 216

merge dataframes based on date range in R

Assume you are using lubridate for the DateTime, you can do

df1 %>% left_join(df2, by = c("SiteID" = "Site.ID")) %>% 
  filter(DateTime %within% interval(Start.date, End.date))
#> # A tibble: 8 × 4
#>   DateTime            SiteID Start.date End.date  
#>   <dttm>              <chr>  <date>     <date>    
#> 1 2010-07-25 01:06:55 B04    2010-07-18 2010-08-24
#> 2 2011-05-10 23:52:14 B04    2011-02-22 2011-07-23
#> 3 2011-09-17 01:14:30 B04    2011-08-30 2012-10-03
#> 4 2012-04-04 02:55:29 B05    2011-08-30 2012-08-21
#> 5 2013-01-05 23:03:06 B05    2012-12-08 2013-01-21
#> 6 2011-03-09 20:39:46 B06    2011-02-12 2011-04-18
#> 7 2012-07-25 23:17:19 B07    2011-12-29 2012-12-02
#> 8 2011-03-03 00:46:45 B08    2011-02-12 2011-04-01

Full Example:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

df1 <- tribble(~DateTime, ~SiteID,
"2010-07-25 01:06:55", "B04",
"2011-05-10 23:52:14", "B04",
"2011-09-17 01:14:30", "B04",
"2012-04-04 02:55:29", "B05",
"2013-01-05 23:03:06", "B05",
"2011-03-09 20:39:46", "B06",
"2012-07-25 23:17:19", "B07",
"2011-03-03 00:46:45", "B08") %>% 
  mutate(DateTime = as_datetime(DateTime))

df2 <- tribble(~Site.ID, ~Start.date, ~End.date,
"B04", "2010-07-18", "2010-08-24",
"B04", "2011-02-22", "2011-07-23",
"B04", "2011-08-30", "2012-10-03",
"B04", "2012-10-20", "2013-04-08",
"B05", "2011-08-30", "2012-08-21",
"B05", "2012-12-08", "2013-01-21",
"B05", "2013-02-08", "2013-04-08",
"B06", "2010-07-20", "2010-09-03",
"B06", "2011-02-12", "2011-04-18",
"B06", "2011-05-13", "2011-05-16",
"B07", "2011-10-24", "2011-11-29",
"B07", "2011-12-29", "2012-12-02",
"B08", "2011-02-12", "2011-04-01",
"B08", "2011-10-24", "2011-12-24") %>% 
  mutate(Start.date = as_date(Start.date), End.date = as_date(End.date))

df1 %>% left_join(df2, by = c("SiteID" = "Site.ID")) %>% 
  filter(DateTime %within% interval(Start.date, End.date))
#> # A tibble: 8 × 4
#>   DateTime            SiteID Start.date End.date  
#>   <dttm>              <chr>  <date>     <date>    
#> 1 2010-07-25 01:06:55 B04    2010-07-18 2010-08-24
#> 2 2011-05-10 23:52:14 B04    2011-02-22 2011-07-23
#> 3 2011-09-17 01:14:30 B04    2011-08-30 2012-10-03
#> 4 2012-04-04 02:55:29 B05    2011-08-30 2012-08-21
#> 5 2013-01-05 23:03:06 B05    2012-12-08 2013-01-21
#> 6 2011-03-09 20:39:46 B06    2011-02-12 2011-04-18
#> 7 2012-07-25 23:17:19 B07    2011-12-29 2012-12-02
#> 8 2011-03-03 00:46:45 B08    2011-02-12 2011-04-01

^{Created on 2022-06-09 by the reprex package (v2.0.1)}

merging tables based on time ranges/intervals using lubridate

This may be inefficient for large dataframes (since you're creating a much larger match and subsetting), and I'm sure there's a more elegant way, but this works:

output <- merge(df1,df2,by="User")[test$Date %within% test$interval,]

Or you could use a loop:

for(x in 1:length(df1$User)){
  df1$Price[x]<-df2[(df1$Date[x] %within% df2$interval)&df1$User[x]==df2$User,]$Price
}

I'm sure you could also make a function and use apply...

R - Summarize values between specific date range

A couple alternatives to consider. I assume your dates are actual dates and not character values.

You can try using fuzzyjoin to merge the two data.frames, including rows where the dates fall between start_dates and end_dates.

library(tidyverse)
library(fuzzyjoin)

fuzzy_left_join(
  date_df,
  df,
  by = c("start_dates" = "dates", "end_dates" = "dates"),
  match_fun = list(`<=`, `>=`)
) %>%
  group_by(start_dates, end_dates) %>%
  summarise(new_goal_column = sum(x))

Output

  start_dates end_dates  new_goal_column
  <date>      <date>               <dbl>
1 2021-01-01  2021-01-06              19
2 2021-01-07  2021-01-10               6

You can also try using data.table and joining.

library(data.table)

setDT(date_df)
setDT(df)

df[date_df, .(start_dates, end_dates, x), on = .(dates >= start_dates, dates <= end_dates)][
  , .(new_goal_column = sum(x)), by = .(start_dates, end_dates)
]

Output

   start_dates  end_dates new_goal_column
1:  2021-01-01 2021-01-06              19
2:  2021-01-07 2021-01-10               6

Merging two dataframes on a date range in R

In general, its not a good idea to use POSIXlt in data frames. Use POSIXct instead. Also your SQL statement is ok except the comma before FROM needs to be removed:

df1a <- transform(df1, 
           StartDateTime = as.POSIXct(StartDateTime),
           EndDateTime = as.POSIXct(EndDateTime))
df2a <- transform(df2, dateTime = as.POSIXct(dateTime))

The SQL statement in the question has an extraneous commma before FROM.

Here is a slightly simplified statement. This one uses a left join instead to ensure that all ID's from df1a are included even if they have no matches in df2a.

sqldf("SELECT df1a.ID, PtID, dateTime, lat, lon 
  FROM df1a LEFT JOIN df2a 
  ON df1a.ID = df2a.ID AND dateTime BETWEEN StartDateTime AND EndDateTime")

How to perform join over date ranges using data.table?

You can use the foverlaps() function which implements joins over intervals efficiently. In your case, we just need a dummy column for measurments.

Note 1: You should install the development version of data.table - v1.9.5 as a bug with foverlaps() has been fixed there. You can find the installation instructions here.
Note 2: I'll call whatWasMeasured = dt1 and measurments = dt2 here for convenience.

require(data.table) ## 1.9.5+
dt2[, dummy := time]

setkey(dt1, start, end)
ans = foverlaps(dt2, dt1, by.x=c("time", "dummy"), nomatch=0L)[, dummy := NULL]

See ?foverlaps for more info and this post for a performance comparison.

Join tables by date range

I know the following looks horrible in base, but here's what I came up with. It's better to use the 'sqldf' package (see below).

library(data.table)
data1 <- data.table(date = c('2010-01-21', '2010-01-25', '2010-02-02', '2010-02-09'),
                    name = c('id1','id2','id3','id4'))

data2 <- data.table(beginning=c('2010-01-15', '2010-01-23', '2010-01-30', '2010-02-05'), 
                    ending = c('2010-01-22','2010-01-29','2010-02-04','2010-02-13'),
                    class = c(1,2,3,4))

result <- cbind(data1,"beginning"=sapply(1:nrow(data2),function(x) data2$beginning[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]),
            "ending"=sapply(1:nrow(data2),function(x) data2$ending[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]),
            "class"=sapply(1:nrow(data2),function(x) data2$class[data2$beginning[x]<data1$date & data2$ending[x]>data1$date]))

Using the package sqldf:

library(sqldf)
result = sqldf("select * from data1
                left join data2
                on data1.date between data2.beginning and data2.ending")

Using data.table this is simply

data1[data2, on = .(date >= beginning, date <= ending)]

Range Join Data.Frames - Specific Date Column with Date Ranges/Intervals in R