How to Flatten/Merge Overlapping Time Periods

How to flatten / merge overlapping time periods

Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
# 
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15

Merge overlapping time periods with milliseconds in R

I've found it is quite easy to preserve milliseconds if you use POSIXlt format. Although there are faster ways to calculate the overlap, it's fast enough for most purposes to just loop through the data frame.

Here's a reproducible example.

start <- c("2019-07-15 21:32:43.565",
           "2019-07-15 21:32:43.634",
           "2019-07-15 21:32:54.301",
           "2019-07-15 21:34:08.506",
           "2019-07-15 21:34:09.957")

end <- c("2019-07-15 21:32:48.445",
         "2019-07-15 21:32:49.045",
         "2019-07-15 21:32:54.801",
         "2019-07-15 21:34:10.111",
         "2019-07-15 21:34:10.236")

df    <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

i <- 1

df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

while(i < nrow(df))
{
  overlaps <- which(df$start < df$end[i] & df$end > df$start[i])
  if(length(overlaps) > 1)
  {
    df$end[i] <- max(df$end[overlaps])
    df <- df[-overlaps[-which(overlaps == i)], ]
    i <- i - 1
  }
  i <- i + 1
}

So now our data frame doesn't have overlaps:

df
#>                 start                 end
#> 1 2019-07-15 21:32:43 2019-07-15 21:32:49
#> 3 2019-07-15 21:32:54 2019-07-15 21:32:54
#> 4 2019-07-15 21:34:08 2019-07-15 21:34:10

Although it appears we have lost the milliseconds, this is just a display issue, as we can show by doing this:

df$end - df$start
#> Time differences in secs
#> [1] 5.48 0.50 1.73

as.numeric(df$end - df$start)
#> [1] 5.48 0.50 1.73

^{Created on 2020-02-20 by the reprex package (v0.3.0)}

How to separate overlapping time periods into overlapping and non-overlapping periods in R

My first answer assumes only overlapping two periods. This means it can use a single join for each comparison. Attempting to repeat this process for more than two time periods results in increasing numbers of joins, leading to an inefficient mess.

To handled joining an arbitrary (or unknown) number of overlaps we need a very different method. Hence I am providing this as a separate answer.

Step 1: Obtain a list of all possible start and end dates

all_start = df %>%
  select(id, start)
all_end = df %>%
  select(id, start = end)
all_start_and_end = rbind(all_start, all_end) %>%
  distinct()

Step 2: Create a list of all possible periods

all_periods = all_start_and_end  %>%
  group_by(id) %>%
  mutate(end = lead(start, 1, order_by = start))

Step 3: Overlap original data with all periods and summarise

overlapped = all_periods %>%
  left_join(df, by = "id", suffix = c("_1","_2")) %>%
  filter(start_1 <= end_2,
         start_2 <= end_1) %>%
  select(id, part_2, start = start_1, end = end_1) %>%
  group_by(id, start, end) %>%
  summarise(part = toString(part_2))

Depending on your exact data and situation:

You may want to change "<=" to "<" or subtract 1 day from end dates to ensure periods do not overlap. This depends on how you are handling the boundary conditions of your time periods.
You may want to remove the distinct in step 1 to allow for periods that are only a single day long.
In step 1 you can add a very early date (e.g. 0000-01-01) and a very late date (e.g. 9999-12-31) if you want the output to include all the time periods with part = NA.
Once step three completes you may want to filter out any periods with part = NA.
Depending on your input data you may observe adjacent output periods with the same part. E.g. in row 1: part A has end date 2020-01-01 and in row 2: part A has start date 2020-01-02. Take a look at the gaps-and-islands tag for ways to solve this problem.

Collapse and merge overlapping time intervals

my_time_intervals %>% 
  group_by(group) %>% arrange(start_time, by_group = TRUE) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), 
            end_time = max(end_time)) %>%
  select(-indx)


# # A tibble: 5 x 3
# # Groups:   group [3]
# group start_time          end_time           
# <int> <dttm>              <dttm>             
# 1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4     2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Explanation per OP's request:

I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
  5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
  6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
  7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

So let's look at the indx column for this dataset. I am adding arrange by group column to see all the same grouped rows together; but, as you know because we have group_by(group) we do not actually need that.

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()]))


  # # A tibble: 13 x 5
  # # Groups:   group [3]
  # id group start_time          end_time             indx
  # <int> <int> <dttm>              <dttm>              <dbl>
  # 1     1      1 2018-04-12 11:15:03 2018-05-14 02:32:10     0
  # 2     7      1 2018-05-07 13:02:04 2018-05-23 08:13:06     0
  # 3     8      1 2018-05-10 13:02:04 2018-05-23 08:13:06     0
  # 4     2      1 2018-07-04 02:53:20 2018-07-14 18:09:01     1
  # 5     3      1 2018-07-05 02:53:20 2018-07-14 18:09:01     1
  # 6     5      1 2018-07-15 01:53:20 2018-07-19 18:09:01     2
  # 7     4      1 2018-07-15 02:53:20 2018-07-16 18:09:01     2
  # 8     6      1 2018-07-20 02:53:20 2018-07-22 18:09:01     3
  # 9     9      2 2018-02-28 17:43:29 2018-04-20 03:48:40     0
  # 10    11     2 2018-04-18 20:47:22 2018-04-19 16:07:29     0
  # 11    10     2 2018-04-20 01:19:52 2018-08-12 12:56:37     0
  # 12    12     2 2018-10-02 14:08:03 2018-11-08 00:01:23     1
  # 13    13     3 2018-03-11 22:30:51 2018-10-20 21:01:42     0

As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx column divided those data points to 4 groups (i.e. 0, 1, 2, 3). Later in the solution, when we group_by(indx,group) we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.

Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first() and last() to min() and max().

So...

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), end_time = max(end_time)) 


# # A tibble: 7 x 4
# # Groups:   group [?]
# group  indx start_time          end_time           
# <int> <dbl> <dttm>              <dttm>             
# 1     1     0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     1     2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4     1     3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5     2     0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6     2     1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7     3     0 2018-03-11 22:30:51 2018-10-20 21:01:42

We used the unique index of each overlapping time and date to get the period (start and end) for each of them.

Beyond this point, you need to read about cumsum and cummax and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.

Hope this helps, as it is my best.

union/merge overlapping time-ranges

If you arrange on group and start (in that order) and unselect the indx column, this solution posted by David Arenburg works perfectly: How to flatten/merge overlapping time periods in R

library(dplyr)

df1 %>% 
group_by(group) %>%
  arrange(group, start) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                              cummax(as.numeric(end)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start = first(start), end = last(end)) %>% 
  select(-indx)

 group start               end                
  <chr> <dttm>              <dttm>             
1 A     2018-01-01 08:00:00 2018-01-01 08:20:00
2 A     2018-01-01 08:30:00 2018-01-01 09:00:00
3 A     2018-01-01 09:15:00 2018-01-01 09:30:00
4 B     2018-01-01 14:00:00 2018-01-01 15:30:00

Find overlapping intervals in groups and retain largest non-overlapping periods

Having searched for related problems on stackoverflow, I found that the following approaches (here: Collapse and merge overlapping time intervals) and (here: How to flatten / merge overlapping time periods) could be adapted to my issue.

# Solution adapted from:
# here https://stackoverflow.com/questions/53213418/collapse-and-merge-overlapping-time-intervals
# and here: https://stackoverflow.com/questions/28938147/how-to-flatten-merge-overlapping-time-periods/28938694#28938694 

# Note: df and df1 created in the initial reprex (above)

df2 <- df %>%
  group_by(group) %>%
  arrange(group, start) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start))  >            # find overlaps
                              cummax(as.numeric(end)))[-n()])) %>%
  ungroup() %>%
  group_by(group, indx) %>%
  arrange(desc(intval_length)) %>%                                # retain largest interval
  filter(row_number() == 1) %>%
  ungroup() %>%
  select(-indx) %>%
  arrange(group, start)

# Desired output?
identical(df1, df2)
#> [1] TRUE

Merge overlapping time intervals, how?

You may also try this query (once more solutions beside those given by PM 77-1 in the comment above) :

WITH RECURSIVE cte( id, date_start, date_end ) AS
(
  SELECT id, date_start, date_end
  FROM evento
  UNION 
  SELECT e.id,
         least( c.date_start, e.date_start ),
         greatest( c.date_end, e.date_end )
  FROM cte c
  JOIN evento e
  ON e.date_start between c.date_start and c.date_end
     OR 
     e.date_end between c.date_start and c.date_end
)
SELECT distinct date_start, date_end
FROM (
  SELECT id, 
         min( date_start) date_start, 
         max( date_end ) date_end
  FROM cte
  GROUP BY id
) xx
ORDER BY date_start;

Demo ---> http://www.sqlfiddle.com/#!12/bdf7e/9

however for huge table the performance of this query could be horribly slow, and some procedural approach might perform better.

How to Flatten/Merge Overlapping Time Periods