Find All Date Ranges for Overlapping Start and End Dates in R

Find all date ranges for overlapping start and end dates in R

Try this:

w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
w$group <- 1:nrow(w) # common intervals should belong to same group
merge.indices <- lapply(2:nrow(w), function(x) {
indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
if (length(indices) > 0) indices <- c(indices, x)
indices})
# assign the intervals the right groups
for (i in 1:length(merge.indices)) {
if (length(merge.indices[[i]]) > 0) {
w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
}
}

do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))

It conceptually merges overlapping intervals into the same group as shown below:
Sample Image

with output:

   start.date   end.date
1 2006-01-19 2006-01-20
2 2006-01-25 2006-01-29
3 2006-02-24 2006-02-25
4 2006-03-15 2006-03-22
5 2006-04-29 2006-04-30
6 2006-05-24 2006-05-25
7 2006-06-26 2006-08-16
11 2006-08-18 2006-08-19
12 2006-08-28 2006-09-02

Calculating overlapping dates in R (dplyr)

Here is another option using the IRanges package on Bioconductor. The collapse_date_ranges function is taken from here, and I just adjusted according

library(data.table)
library(tidyverse)

collapse_date_ranges <- function(w, min.gapwidth = 1L) {
IRanges::IRanges(start = as.integer(as.Date(w$First_day)),
end = as.integer(as.Date(w$Last_day))) %>%
IRanges::reduce(min.gapwidth = min.gapwidth) %>%
as.data.table() %>%
.[, lapply(.SD, lubridate::as_date),
.SDcols = c("start", "end")]
}

split(df, df$CSN_id) %>%
map(., ~collapse_date_ranges(., 0L)) %>%
bind_rows(., .id = 'id')

Output

   id      start        end
1: 1 2018-01-03 2018-01-11
2: 1 2018-01-15 2018-01-22
3: 2 2018-01-17 2018-01-19
4: 3 2018-01-17 2018-02-04
5: 4 2018-01-31 2018-02-17
6: 4 2018-02-22 2018-02-27
7: 4 2018-03-02 2018-03-27

If you want to have this in the original dataframe, then we can join the data back to the original dataframe, then use fill to add the dates to each row.

split(df, df$CSN_id) %>% 
map(., ~collapse_date_ranges(., 0L)) %>%
bind_rows(., .id = 'CSN_id2') %>%
data.frame %>%
mutate(CSN_id2 = as.integer(CSN_id2)) %>%
full_join(df, ., by = c("CSN_id" = "CSN_id2", "First_day" = "start"), keep = TRUE) %>%
select(-CSN_id2) %>%
group_by(CSN_id) %>%
fill(start, end, .direction = "down")

Output

   CSN_id First_day           Last_day            start      end       
<int> <dttm> <dttm> <date> <date>
1 1 2018-01-03 00:00:00 2018-01-11 00:00:00 2018-01-03 2018-01-11
2 1 2018-01-03 00:00:00 2018-01-05 00:00:00 2018-01-03 2018-01-11
3 1 2018-01-04 00:00:00 2018-01-10 00:00:00 2018-01-03 2018-01-11
4 1 2018-01-04 00:00:00 2018-01-10 00:00:00 2018-01-03 2018-01-11
5 1 2018-01-05 00:00:00 2018-01-11 00:00:00 2018-01-03 2018-01-11
6 1 2018-01-15 00:00:00 2018-01-22 00:00:00 2018-01-15 2018-01-22
7 2 2018-01-17 00:00:00 2018-01-19 00:00:00 2018-01-17 2018-01-19
8 3 2018-01-17 00:00:00 2018-01-29 00:00:00 2018-01-17 2018-02-04
9 3 2018-01-17 00:00:00 2018-01-18 00:00:00 2018-01-17 2018-02-04
10 3 2018-01-18 00:00:00 2018-01-29 00:00:00 2018-01-17 2018-02-04
# … with 11 more rows

Determine Whether Two Date Ranges Overlap

(StartA <= EndB) and (EndA >= StartB)

Proof:

Let ConditionA Mean that DateRange A Completely After DateRange B

_                        |---- DateRange A ------|
|---Date Range B -----| _

(True if StartA > EndB)

Let ConditionB Mean that DateRange A is Completely Before DateRange B

|---- DateRange A -----|                        _ 
_ |---Date Range B ----|

(True if EndA < StartB)

Then Overlap exists if Neither A Nor B is true -

(If one range is neither completely after the other,

nor completely before the other,
then they must overlap.)

Now one of De Morgan's laws says that:

Not (A Or B) <=> Not A And Not B

Which translates to: (StartA <= EndB) and (EndA >= StartB)


NOTE: This includes conditions where the edges overlap exactly. If you wish to exclude that,

change the >= operators to >, and <= to <


NOTE2. Thanks to @Baodad, see this blog, the actual overlap is least of:

{ endA-startA, endA - startB, endB-startA, endB - startB }

(StartA <= EndB) and (EndA >= StartB)
(StartA <= EndB) and (StartB <= EndA)


NOTE3. Thanks to @tomosius, a shorter version reads:

DateRangesOverlap = max(start1, start2) < min(end1, end2)

This is actually a syntactical shortcut for what is a longer implementation, which includes extra checks to verify that the start dates are on or before the endDates. Deriving this from above:

If start and end dates can be out of order, i.e., if it is possible that startA > endA or startB > endB, then you also have to check that they are in order, so that means you have to add two additional validity rules:

(StartA <= EndB) and (StartB <= EndA) and (StartA <= EndA) and (StartB <= EndB)
or:

(StartA <= EndB) and (StartA <= EndA) and (StartB <= EndA) and (StartB <= EndB)
or,

(StartA <= Min(EndA, EndB) and (StartB <= Min(EndA, EndB))
or:

(Max(StartA, StartB) <= Min(EndA, EndB)

But to implement Min() and Max(), you have to code, (using C ternary for terseness),:

(StartA > StartB? Start A: StartB) <= (EndA < EndB? EndA: EndB)

Identify overlapping date ranges by ID R

First convert the dates to Date class. Then a self join on id and the intersection criteria will join all relevant overlapping rows. overlap is 1 if that row has an overlap and 0 otherwise. overlaps lists the row numbers of the overlaps for that row. We used row numbers rowid but we could replace each occurrence of it in the code below with row_n if desired.

library(sqldf)

fmt <- "%m/%d/%Y"
eg2 <- transform(eg_data,
start_dt = as.Date(start_dt, fmt),
end_dt = as.Date(end_dt, fmt))

sqldf("select
a.*,
count(b.rowid) > 0 as overlap,
coalesce(group_concat(b.rowid), '') as overlaps
from eg2 a
left join eg2 b on a.id = b.id and
not a.rowid = b.rowid and
((a.start_dt between b.start_dt and b.end_dt) or
(b.start_dt between a.start_dt and a.end_dt))
group by a.rowid
order by a.rowid")

giving:

   id   start_dt     end_dt row_n overlap overlaps
1 1 2016-01-01 2016-12-01 1 0
2 1 2016-12-02 2017-03-14 2 1 3
3 1 2017-03-12 2017-05-15 3 1 2
4 2 2016-02-01 2016-05-15 4 0
5 2 2016-08-12 2016-12-29 5 0
6 3 2016-01-01 2016-03-02 6 0
7 3 2016-03-05 2016-04-29 7 0
8 3 2016-05-07 2016-06-29 8 0
9 3 2016-07-01 2016-08-31 9 0
10 3 2016-09-04 2016-09-25 10 0
11 3 2016-10-10 2016-11-29 11 0
12 4 2016-01-01 2016-05-31 12 1 13
13 4 2016-05-28 2016-08-19 13 1 12
14 5 2016-01-01 2016-06-10 14 1 15
15 5 2016-06-05 2016-07-25 15 1 14
16 5 2016-08-25 2016-08-29 16 0
17 5 2016-11-01 2016-12-30 17 0

Calculate each overlapping date ranges from two independent databases in r

An option using data.table::foverlaps:

foverlaps(data1, data2)[, 
sum(1L + pmin(Drugend, FUend) - pmax(Drugstart, FUstart)),
.(ID, FUstart, FUend)]

output and I am also getting slightly diff numbers from OP's expected output:

   ID    FUstart      FUend V1
1: 1 2019-01-01 2019-03-31 16
2: 1 2019-04-01 2019-06-30 10
3: 1 2019-07-01 2019-09-30 92
4: 1 2019-10-01 2019-12-31 5
5: 2 2019-04-01 2019-06-30 91
6: 2 2019-07-01 2019-09-30 92
7: 2 2019-10-01 2019-12-31 10

data:

library(data.table)
setDT(data1)
cols <- paste0("FU", c("start","end"))
data1[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
setkeyv(data1, c("ID", cols))

#too lazy to generalize and hence copy paste
setDT(data2)
cols <- paste0("Drug", c("start","end"))
data2[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
setkeyv(data2, c("ID", cols))

Count how many times date ranges overlap for each category (machine)

First I would recommend using lubridate since this involves time intervals and purrr since you are essentially needing iterate through the individual intervals to compare against the list.

Two notes based on your desired outcome. First, lubridate views intervals as inclusive of their boundaries, i.e. if two intervals share a boundary they overlap. This means, for example, 5A and 9A overlap because one starts at 00:35:00 and the other ends at 00:35:00. Second, while it's not clear from your problem description, it appears you only want to compare across the same machine. For example, it doesn't count that 4A is running at the same time as 3C.

library(dplyr)
library(lubridate)
library(purrr)

datateste %>%
mutate(
run_interval = interval(dmy_hms(start_time), dmy_hms(end_time)),
numb_times_with_overlap = imap_int(
run_interval,
~sum(
int_overlaps(.x, run_interval) &
(machine_id == machine_id[.y])
) - 1L
)
) %>%
select(-run_interval)

#> id machine_id start_time end_time numb_times_with_overlap
#> 1 1 A 01/12/2021 00:00:00 01/12/2021 00:10:10 1
#> 2 2 B 01/12/2021 04:15:10 01/12/2021 04:45:03 1
#> 3 3 C 01/12/2021 00:15:00 01/12/2021 00:30:53 0
#> 4 4 A 01/12/2021 00:05:07 01/12/2021 00:30:02 2
#> 5 5 A 01/12/2021 00:35:00 01/12/2021 00:39:00 1
#> 6 6 B 01/12/2021 04:00:00 01/12/2021 04:12:45 0
#> 7 7 C 01/12/2021 04:07:00 01/12/2021 04:34:00 0
#> 8 8 B 01/12/2021 04:44:34 01/12/2021 05:06:34 1
#> 9 9 A 01/12/2021 00:15:00 01/12/2021 00:35:00 2

What's going on here is first, the we create a column of intervals (run_interval). Then we can then iterate through this column using a map function, in this case imap_int. This compares the row to the entire column to using int_overlap, then includes only rows where machine_id matches that row (the .y argument is the index of the current row). Summing a logical vector gives you a count and you need to subtract 1 to account for self-matching.

The counts are different from your expected results because of the previously mentioned way lubridate determines overlaps. I suppose a work around would be to add a microsecond to each start time if you don't want this definition of an interval.

In R: is there a way to flag overlapping date ranges within each specific group in a table? (i.e. by patient ID)

does this answer your question ?

library(data.table)
admissions <- data.table(
ID = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 5),
admdate = c("2001-10-03", "2001-10-05", "2003-10-04", "2006-02-03", "2006-05-27", "2006-07-01", "2001-08-02", "2008-10-11", "2008-11-01", "2009-01-09", "2009-02-18"),
dischdate = c("2001-10-05", "2001-12-08", "2003-10-04", "2006-05-29", "2006-06-01", "2006-07-07", "2001-08-11", "2008-10-14", "2009-01-13", "2009-01-21", "2009-02-26")
)

# Non equi joins are only possible with numeric fields
admissions[,c('start','end'):=.(as.POSIXct(admdate),
as.POSIXct(dischdate))]

admissions[admissions, on = .(ID=ID,start<start,end>start ),nomatch = NULL]

How to find overlapping between date and time?

Here is an approach which does not use any additional packages:

Firstly, we merge all start and end dates and then we arrange them in an increasing order. Like this, we get a table of all time slots.

df$Start_Date = strptime(as.character(df$Start_Date),format = "%d/%m/%Y %H:%M")
df$End_Date = strptime(as.character(df$End_Date),format = "%d/%m/%Y %H:%M")

dates=sort(unique(c(df[,2],df[,3])))

df2=data.frame(start=dates[1:length(dates)-1],end=dates[2:length(dates)])

For each time slot we check which product was active during the given time period. If there is no such product, then "NA" is returned.

active<-function(start,end){
tmp<-as.vector(df[df$Start_Date<=start & df$End_Date>= end,1])
if(length(tmp)>0){
paste(tmp, collapse="/")
}
else{
return(NA)
}
}

activeproducts <- mapply(active,df2[,1],df2[,2])
df2<-transform(df2,products=activeproducts)

To get rid of the time slots where no product was active, we use the complete.cases function.

df2<- df2[complete.cases(df2),]

>df2

start end products
1 2015-01-01 08:00:00 2015-01-02 09:00:00 x
3 2015-01-03 10:00:00 2015-01-04 09:00:00 y
4 2015-01-04 09:00:00 2015-01-04 12:34:00 y/z
5 2015-01-04 12:34:00 2015-01-05 12:00:00 y/z/x
6 2015-01-05 12:00:00 2015-01-05 13:00:00 z/x
7 2015-01-05 13:00:00 2015-01-07 11:23:00 x


Related Topics



Leave a reply



Submit