How to flatten / merge overlapping time periods
Here's a possible solution. The basic idea here is to compare lagged start
date with the maximum end date "until now" using the cummax
function and create an index that will separate the data into groups
data %>%
arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = first(start), end = last(end))
# Source: local data frame [3 x 4]
# Groups: ID
#
# ID indx start end
# 1 A 0 2013-01-01 2013-01-06
# 2 A 1 2013-01-07 2013-01-11
# 3 A 2 2013-01-12 2013-01-15
Merge overlapping time periods with milliseconds in R
I've found it is quite easy to preserve milliseconds if you use POSIXlt
format. Although there are faster ways to calculate the overlap, it's fast enough for most purposes to just loop through the data frame.
Here's a reproducible example.
start <- c("2019-07-15 21:32:43.565",
"2019-07-15 21:32:43.634",
"2019-07-15 21:32:54.301",
"2019-07-15 21:34:08.506",
"2019-07-15 21:34:09.957")
end <- c("2019-07-15 21:32:48.445",
"2019-07-15 21:32:49.045",
"2019-07-15 21:32:54.801",
"2019-07-15 21:34:10.111",
"2019-07-15 21:34:10.236")
df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))
i <- 1
df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))
while(i < nrow(df))
{
overlaps <- which(df$start < df$end[i] & df$end > df$start[i])
if(length(overlaps) > 1)
{
df$end[i] <- max(df$end[overlaps])
df <- df[-overlaps[-which(overlaps == i)], ]
i <- i - 1
}
i <- i + 1
}
So now our data frame doesn't have overlaps:
df
#> start end
#> 1 2019-07-15 21:32:43 2019-07-15 21:32:49
#> 3 2019-07-15 21:32:54 2019-07-15 21:32:54
#> 4 2019-07-15 21:34:08 2019-07-15 21:34:10
Although it appears we have lost the milliseconds, this is just a display issue, as we can show by doing this:
df$end - df$start
#> Time differences in secs
#> [1] 5.48 0.50 1.73
as.numeric(df$end - df$start)
#> [1] 5.48 0.50 1.73
Created on 2020-02-20 by the reprex package (v0.3.0)
How to separate overlapping time periods into overlapping and non-overlapping periods in R
My first answer assumes only overlapping two periods. This means it can use a single join for each comparison. Attempting to repeat this process for more than two time periods results in increasing numbers of joins, leading to an inefficient mess.
To handled joining an arbitrary (or unknown) number of overlaps we need a very different method. Hence I am providing this as a separate answer.
Step 1: Obtain a list of all possible start and end dates
all_start = df %>%
select(id, start)
all_end = df %>%
select(id, start = end)
all_start_and_end = rbind(all_start, all_end) %>%
distinct()
Step 2: Create a list of all possible periods
all_periods = all_start_and_end %>%
group_by(id) %>%
mutate(end = lead(start, 1, order_by = start))
Step 3: Overlap original data with all periods and summarise
overlapped = all_periods %>%
left_join(df, by = "id", suffix = c("_1","_2")) %>%
filter(start_1 <= end_2,
start_2 <= end_1) %>%
select(id, part_2, start = start_1, end = end_1) %>%
group_by(id, start, end) %>%
summarise(part = toString(part_2))
Depending on your exact data and situation:
- You may want to change "<=" to "<" or subtract 1 day from end dates to ensure periods do not overlap. This depends on how you are handling the boundary conditions of your time periods.
- You may want to remove the
distinct
in step 1 to allow for periods that are only a single day long. - In step 1 you can add a very early date (e.g. 0000-01-01) and a very late date (e.g. 9999-12-31) if you want the output to include all the time periods with
part = NA
. - Once step three completes you may want to filter out any periods with
part = NA
. - Depending on your input data you may observe adjacent output periods with the same
part
. E.g. in row 1: part A has end date 2020-01-01 and in row 2: part A has start date 2020-01-02. Take a look at thegaps-and-islands
tag for ways to solve this problem.
Collapse and merge overlapping time intervals
my_time_intervals %>%
group_by(group) %>% arrange(start_time, by_group = TRUE) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
cummax(as.numeric(end_time)))[-n()])) %>%
group_by(group, indx) %>%
summarise(start_time = min(start_time),
end_time = max(end_time)) %>%
select(-indx)
# # A tibble: 5 x 3
# # Groups: group [3]
# group start_time end_time
# <int> <dttm> <dttm>
# 1 1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3 2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4 2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5 3 2018-03-11 22:30:51 2018-10-20 21:01:42
Explanation per OP's request:
I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;
my_time_intervals <- tribble(
~id, ~group, ~start_time, ~end_time,
1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)
So let's look at the indx
column for this dataset. I am adding arrange
by group
column to see all the same grouped rows together; but, as you know because we have group_by(group)
we do not actually need that.
my_time_intervals %>%
group_by(group) %>% arrange(group,start_time) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
cummax(as.numeric(end_time)))[-n()]))
# # A tibble: 13 x 5
# # Groups: group [3]
# id group start_time end_time indx
# <int> <int> <dttm> <dttm> <dbl>
# 1 1 1 2018-04-12 11:15:03 2018-05-14 02:32:10 0
# 2 7 1 2018-05-07 13:02:04 2018-05-23 08:13:06 0
# 3 8 1 2018-05-10 13:02:04 2018-05-23 08:13:06 0
# 4 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01 1
# 5 3 1 2018-07-05 02:53:20 2018-07-14 18:09:01 1
# 6 5 1 2018-07-15 01:53:20 2018-07-19 18:09:01 2
# 7 4 1 2018-07-15 02:53:20 2018-07-16 18:09:01 2
# 8 6 1 2018-07-20 02:53:20 2018-07-22 18:09:01 3
# 9 9 2 2018-02-28 17:43:29 2018-04-20 03:48:40 0
# 10 11 2 2018-04-18 20:47:22 2018-04-19 16:07:29 0
# 11 10 2 2018-04-20 01:19:52 2018-08-12 12:56:37 0
# 12 12 2 2018-10-02 14:08:03 2018-11-08 00:01:23 1
# 13 13 3 2018-03-11 22:30:51 2018-10-20 21:01:42 0
As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx
column divided those data points to 4 groups (i.e. 0, 1, 2, 3
). Later in the solution, when we group_by(indx,group)
we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.
Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first()
and last()
to min()
and max()
.
So...
my_time_intervals %>%
group_by(group) %>% arrange(group,start_time) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
cummax(as.numeric(end_time)))[-n()])) %>%
group_by(group, indx) %>%
summarise(start_time = min(start_time), end_time = max(end_time))
# # A tibble: 7 x 4
# # Groups: group [?]
# group indx start_time end_time
# <int> <dbl> <dttm> <dttm>
# 1 1 0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2 1 1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3 1 2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4 1 3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5 2 0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6 2 1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7 3 0 2018-03-11 22:30:51 2018-10-20 21:01:42
We used the unique index of each overlapping time and date to get the period (start and end) for each of them.
Beyond this point, you need to read about cumsum
and cummax
and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.
Hope this helps, as it is my best.
union/merge overlapping time-ranges
If you arrange on group and start (in that order) and unselect the indx column, this solution posted by David Arenburg works perfectly: How to flatten/merge overlapping time periods in R
library(dplyr)
df1 %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(group, indx) %>%
summarise(start = first(start), end = last(end)) %>%
select(-indx)
group start end
<chr> <dttm> <dttm>
1 A 2018-01-01 08:00:00 2018-01-01 08:20:00
2 A 2018-01-01 08:30:00 2018-01-01 09:00:00
3 A 2018-01-01 09:15:00 2018-01-01 09:30:00
4 B 2018-01-01 14:00:00 2018-01-01 15:30:00
Find overlapping intervals in groups and retain largest non-overlapping periods
Having searched for related problems on stackoverflow, I found that the following approaches (here: Collapse and merge overlapping time intervals) and (here: How to flatten / merge overlapping time periods) could be adapted to my issue.
# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694
# Note: df and df1 created in the initial reprex (above)
df2 <- df %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) > # find overlaps
cummax(as.numeric(end)))[-n()])) %>%
ungroup() %>%
group_by(group, indx) %>%
arrange(desc(intval_length)) %>% # retain largest interval
filter(row_number() == 1) %>%
ungroup() %>%
select(-indx) %>%
arrange(group, start)
# Desired output?
identical(df1, df2)
#> [1] TRUE
Merge overlapping time intervals, how?
You may also try this query (once more solutions beside those given by PM 77-1
in the comment above) :
WITH RECURSIVE cte( id, date_start, date_end ) AS
(
SELECT id, date_start, date_end
FROM evento
UNION
SELECT e.id,
least( c.date_start, e.date_start ),
greatest( c.date_end, e.date_end )
FROM cte c
JOIN evento e
ON e.date_start between c.date_start and c.date_end
OR
e.date_end between c.date_start and c.date_end
)
SELECT distinct date_start, date_end
FROM (
SELECT id,
min( date_start) date_start,
max( date_end ) date_end
FROM cte
GROUP BY id
) xx
ORDER BY date_start;
Demo ---> http://www.sqlfiddle.com/#!12/bdf7e/9
however for huge table the performance of this query could be horribly slow, and some procedural approach might perform better.
Related Topics
Pass Arguments to Dplyr Functions
Concatenate Strings by Group With Dplyr
Offline Install of R Package and Dependencies
Add Correct Century to Dates With Year Provided as "Year Without Century", %Y
Ggplot, Drawing Line Between Points Across Facets
How to Send an Email With Attachment from R in Windows
Ggplot2 - Jitter and Position Dodge Together
Remove Extra Legends in Ggplot2
How to Sum a Numeric List Elements
Memory Allocation "Error: Cannot Allocate Vector of Size 75.1 Mb"
How to Replace Na With Mean by Group/Subset
How to Change the Order of Facet Labels in Ggplot (Custom Facet Wrap Labels)
Number of Months Between Two Dates
Increment by 1 For Every Change in Column
How to Calculate Cumulative Sum
Replace X-Axis With Own Values
How to Read a CSV File in R With Different Number of Columns