Find all date ranges for overlapping start and end dates in R
Try this:
w[] <- lapply(w, function(x) as.Date(x, '%Y-%m-%d'))
w <- w[order(w$start.date),] # sort the data by start dates if already not sorted
w$group <- 1:nrow(w) # common intervals should belong to same group
merge.indices <- lapply(2:nrow(w), function(x) {
indices <- which(findInterval(w$end.date[1:(x-1)], w$start.date[x])==1)
if (length(indices) > 0) indices <- c(indices, x)
indices})
# assign the intervals the right groups
for (i in 1:length(merge.indices)) {
if (length(merge.indices[[i]]) > 0) {
w$group[merge.indices[[i]]] <- min(w$group[merge.indices[[i]]])
}
}
do.call(rbind, lapply(split(w, w$group), function(x) data.frame(start.date=min(x[,1]), end.date=max(x[,2]))))
It conceptually merges overlapping intervals into the same group as shown below:
with output:
start.date end.date
1 2006-01-19 2006-01-20
2 2006-01-25 2006-01-29
3 2006-02-24 2006-02-25
4 2006-03-15 2006-03-22
5 2006-04-29 2006-04-30
6 2006-05-24 2006-05-25
7 2006-06-26 2006-08-16
11 2006-08-18 2006-08-19
12 2006-08-28 2006-09-02
Calculating overlapping dates in R (dplyr)
Here is another option using the IRanges
package on Bioconductor. The collapse_date_ranges
function is taken from here, and I just adjusted according
library(data.table)
library(tidyverse)
collapse_date_ranges <- function(w, min.gapwidth = 1L) {
IRanges::IRanges(start = as.integer(as.Date(w$First_day)),
end = as.integer(as.Date(w$Last_day))) %>%
IRanges::reduce(min.gapwidth = min.gapwidth) %>%
as.data.table() %>%
.[, lapply(.SD, lubridate::as_date),
.SDcols = c("start", "end")]
}
split(df, df$CSN_id) %>%
map(., ~collapse_date_ranges(., 0L)) %>%
bind_rows(., .id = 'id')
Output
id start end
1: 1 2018-01-03 2018-01-11
2: 1 2018-01-15 2018-01-22
3: 2 2018-01-17 2018-01-19
4: 3 2018-01-17 2018-02-04
5: 4 2018-01-31 2018-02-17
6: 4 2018-02-22 2018-02-27
7: 4 2018-03-02 2018-03-27
If you want to have this in the original dataframe, then we can join the data back to the original dataframe, then use fill
to add the dates to each row.
split(df, df$CSN_id) %>%
map(., ~collapse_date_ranges(., 0L)) %>%
bind_rows(., .id = 'CSN_id2') %>%
data.frame %>%
mutate(CSN_id2 = as.integer(CSN_id2)) %>%
full_join(df, ., by = c("CSN_id" = "CSN_id2", "First_day" = "start"), keep = TRUE) %>%
select(-CSN_id2) %>%
group_by(CSN_id) %>%
fill(start, end, .direction = "down")
Output
CSN_id First_day Last_day start end
<int> <dttm> <dttm> <date> <date>
1 1 2018-01-03 00:00:00 2018-01-11 00:00:00 2018-01-03 2018-01-11
2 1 2018-01-03 00:00:00 2018-01-05 00:00:00 2018-01-03 2018-01-11
3 1 2018-01-04 00:00:00 2018-01-10 00:00:00 2018-01-03 2018-01-11
4 1 2018-01-04 00:00:00 2018-01-10 00:00:00 2018-01-03 2018-01-11
5 1 2018-01-05 00:00:00 2018-01-11 00:00:00 2018-01-03 2018-01-11
6 1 2018-01-15 00:00:00 2018-01-22 00:00:00 2018-01-15 2018-01-22
7 2 2018-01-17 00:00:00 2018-01-19 00:00:00 2018-01-17 2018-01-19
8 3 2018-01-17 00:00:00 2018-01-29 00:00:00 2018-01-17 2018-02-04
9 3 2018-01-17 00:00:00 2018-01-18 00:00:00 2018-01-17 2018-02-04
10 3 2018-01-18 00:00:00 2018-01-29 00:00:00 2018-01-17 2018-02-04
# … with 11 more rows
Determine Whether Two Date Ranges Overlap
(StartA <= EndB) and (EndA >= StartB)
Proof:
Let ConditionA Mean that DateRange A Completely After DateRange B
_ |---- DateRange A ------|
|---Date Range B -----| _
(True if StartA > EndB
)
Let ConditionB Mean that DateRange A is Completely Before DateRange B
|---- DateRange A -----| _
_ |---Date Range B ----|
(True if EndA < StartB
)
Then Overlap exists if Neither A Nor B is true -
(If one range is neither completely after the other,
nor completely before the other,
then they must overlap.)
Now one of De Morgan's laws says that:
Not (A Or B)
<=> Not A And Not B
Which translates to: (StartA <= EndB) and (EndA >= StartB)
NOTE: This includes conditions where the edges overlap exactly. If you wish to exclude that,
change the >=
operators to >
, and <=
to <
NOTE2. Thanks to @Baodad, see this blog, the actual overlap is least of:
{ endA-startA
, endA - startB
, endB-startA
, endB - startB
}
(StartA <= EndB) and (EndA >= StartB)
(StartA <= EndB) and (StartB <= EndA)
NOTE3. Thanks to @tomosius, a shorter version reads:DateRangesOverlap = max(start1, start2) < min(end1, end2)
This is actually a syntactical shortcut for what is a longer implementation, which includes extra checks to verify that the start dates are on or before the endDates. Deriving this from above:
If start and end dates can be out of order, i.e., if it is possible that startA > endA
or startB > endB
, then you also have to check that they are in order, so that means you have to add two additional validity rules:(StartA <= EndB) and (StartB <= EndA) and (StartA <= EndA) and (StartB <= EndB)
or:(StartA <= EndB) and (StartA <= EndA) and (StartB <= EndA) and (StartB <= EndB)
or,(StartA <= Min(EndA, EndB) and (StartB <= Min(EndA, EndB))
or:(Max(StartA, StartB) <= Min(EndA, EndB)
But to implement Min()
and Max()
, you have to code, (using C ternary for terseness),:(StartA > StartB? Start A: StartB) <= (EndA < EndB? EndA: EndB)
Identify overlapping date ranges by ID R
First convert the dates to Date
class. Then a self join on id
and the intersection criteria will join all relevant overlapping rows. overlap
is 1 if that row has an overlap and 0 otherwise. overlaps
lists the row numbers of the overlaps for that row. We used row numbers rowid
but we could replace each occurrence of it in the code below with row_n
if desired.
library(sqldf)
fmt <- "%m/%d/%Y"
eg2 <- transform(eg_data,
start_dt = as.Date(start_dt, fmt),
end_dt = as.Date(end_dt, fmt))
sqldf("select
a.*,
count(b.rowid) > 0 as overlap,
coalesce(group_concat(b.rowid), '') as overlaps
from eg2 a
left join eg2 b on a.id = b.id and
not a.rowid = b.rowid and
((a.start_dt between b.start_dt and b.end_dt) or
(b.start_dt between a.start_dt and a.end_dt))
group by a.rowid
order by a.rowid")
giving:
id start_dt end_dt row_n overlap overlaps
1 1 2016-01-01 2016-12-01 1 0
2 1 2016-12-02 2017-03-14 2 1 3
3 1 2017-03-12 2017-05-15 3 1 2
4 2 2016-02-01 2016-05-15 4 0
5 2 2016-08-12 2016-12-29 5 0
6 3 2016-01-01 2016-03-02 6 0
7 3 2016-03-05 2016-04-29 7 0
8 3 2016-05-07 2016-06-29 8 0
9 3 2016-07-01 2016-08-31 9 0
10 3 2016-09-04 2016-09-25 10 0
11 3 2016-10-10 2016-11-29 11 0
12 4 2016-01-01 2016-05-31 12 1 13
13 4 2016-05-28 2016-08-19 13 1 12
14 5 2016-01-01 2016-06-10 14 1 15
15 5 2016-06-05 2016-07-25 15 1 14
16 5 2016-08-25 2016-08-29 16 0
17 5 2016-11-01 2016-12-30 17 0
Calculate each overlapping date ranges from two independent databases in r
An option using data.table::foverlaps
:
foverlaps(data1, data2)[,
sum(1L + pmin(Drugend, FUend) - pmax(Drugstart, FUstart)),
.(ID, FUstart, FUend)]
output and I am also getting slightly diff numbers from OP's expected output:
ID FUstart FUend V1
1: 1 2019-01-01 2019-03-31 16
2: 1 2019-04-01 2019-06-30 10
3: 1 2019-07-01 2019-09-30 92
4: 1 2019-10-01 2019-12-31 5
5: 2 2019-04-01 2019-06-30 91
6: 2 2019-07-01 2019-09-30 92
7: 2 2019-10-01 2019-12-31 10
data:
library(data.table)
setDT(data1)
cols <- paste0("FU", c("start","end"))
data1[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
setkeyv(data1, c("ID", cols))
#too lazy to generalize and hence copy paste
setDT(data2)
cols <- paste0("Drug", c("start","end"))
data2[, (cols) := lapply(.SD, as.IDate, format="%Y-%m-%d"), .SDcols=cols]
setkeyv(data2, c("ID", cols))
Count how many times date ranges overlap for each category (machine)
First I would recommend using lubridate
since this involves time intervals and purrr
since you are essentially needing iterate through the individual intervals to compare against the list.
Two notes based on your desired outcome. First, lubridate
views intervals as inclusive of their boundaries, i.e. if two intervals share a boundary they overlap. This means, for example, 5A and 9A overlap because one starts at 00:35:00 and the other ends at 00:35:00. Second, while it's not clear from your problem description, it appears you only want to compare across the same machine. For example, it doesn't count that 4A is running at the same time as 3C.
library(dplyr)
library(lubridate)
library(purrr)
datateste %>%
mutate(
run_interval = interval(dmy_hms(start_time), dmy_hms(end_time)),
numb_times_with_overlap = imap_int(
run_interval,
~sum(
int_overlaps(.x, run_interval) &
(machine_id == machine_id[.y])
) - 1L
)
) %>%
select(-run_interval)
#> id machine_id start_time end_time numb_times_with_overlap
#> 1 1 A 01/12/2021 00:00:00 01/12/2021 00:10:10 1
#> 2 2 B 01/12/2021 04:15:10 01/12/2021 04:45:03 1
#> 3 3 C 01/12/2021 00:15:00 01/12/2021 00:30:53 0
#> 4 4 A 01/12/2021 00:05:07 01/12/2021 00:30:02 2
#> 5 5 A 01/12/2021 00:35:00 01/12/2021 00:39:00 1
#> 6 6 B 01/12/2021 04:00:00 01/12/2021 04:12:45 0
#> 7 7 C 01/12/2021 04:07:00 01/12/2021 04:34:00 0
#> 8 8 B 01/12/2021 04:44:34 01/12/2021 05:06:34 1
#> 9 9 A 01/12/2021 00:15:00 01/12/2021 00:35:00 2
What's going on here is first, the we create a column of intervals (run_interval
). Then we can then iterate through this column using a map function, in this case imap_int
. This compares the row to the entire column to using int_overlap
, then includes only rows where machine_id
matches that row (the .y
argument is the index of the current row). Summing a logical vector gives you a count and you need to subtract 1 to account for self-matching.
The counts are different from your expected results because of the previously mentioned way lubridate
determines overlaps. I suppose a work around would be to add a microsecond to each start time if you don't want this definition of an interval.
In R: is there a way to flag overlapping date ranges within each specific group in a table? (i.e. by patient ID)
does this answer your question ?
library(data.table)
admissions <- data.table(
ID = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 5),
admdate = c("2001-10-03", "2001-10-05", "2003-10-04", "2006-02-03", "2006-05-27", "2006-07-01", "2001-08-02", "2008-10-11", "2008-11-01", "2009-01-09", "2009-02-18"),
dischdate = c("2001-10-05", "2001-12-08", "2003-10-04", "2006-05-29", "2006-06-01", "2006-07-07", "2001-08-11", "2008-10-14", "2009-01-13", "2009-01-21", "2009-02-26")
)
# Non equi joins are only possible with numeric fields
admissions[,c('start','end'):=.(as.POSIXct(admdate),
as.POSIXct(dischdate))]
admissions[admissions, on = .(ID=ID,start<start,end>start ),nomatch = NULL]
How to find overlapping between date and time?
Here is an approach which does not use any additional packages:
Firstly, we merge all start and end dates and then we arrange them in an increasing order. Like this, we get a table of all time slots.
df$Start_Date = strptime(as.character(df$Start_Date),format = "%d/%m/%Y %H:%M")
df$End_Date = strptime(as.character(df$End_Date),format = "%d/%m/%Y %H:%M")
dates=sort(unique(c(df[,2],df[,3])))
df2=data.frame(start=dates[1:length(dates)-1],end=dates[2:length(dates)])
For each time slot we check which product was active during the given time period. If there is no such product, then "NA" is returned.
active<-function(start,end){
tmp<-as.vector(df[df$Start_Date<=start & df$End_Date>= end,1])
if(length(tmp)>0){
paste(tmp, collapse="/")
}
else{
return(NA)
}
}
activeproducts <- mapply(active,df2[,1],df2[,2])
df2<-transform(df2,products=activeproducts)
To get rid of the time slots where no product was active, we use the complete.cases function.
df2<- df2[complete.cases(df2),]
>df2
start end products
1 2015-01-01 08:00:00 2015-01-02 09:00:00 x
3 2015-01-03 10:00:00 2015-01-04 09:00:00 y
4 2015-01-04 09:00:00 2015-01-04 12:34:00 y/z
5 2015-01-04 12:34:00 2015-01-05 12:00:00 y/z/x
6 2015-01-05 12:00:00 2015-01-05 13:00:00 z/x
7 2015-01-05 13:00:00 2015-01-07 11:23:00 x
Related Topics
Get a List of the Data Sets in a Particular Package
Row Operations in Data.Table Using 'By = .I'
Using R to Download Gzipped Data File, Extract, and Import Data
Shinydashboard Some Font Awesome Icons Not Working
Shiny: Merge Cells in Dt::Datatable
R: Replacing Na Values by Mean of Hour with Dplyr
Calculating Time Difference Between Two Columns
Ggplot2, Geom_Bar, Dodge, Order of Bars
R: How to Sum Columns Grouped by a Factor
Solving Non-Square Linear System with R
How to Facet a Plot_Ly() Chart
Geom_Tile and Facet_Grid/Facet_Wrap for Same Height of Tiles
Assign a Value, If a Number Is in Between Two Numbers
Operations on Multiple Tables/Datasets with Edit Queries and R in Power Bi