Merge Overlapping Time Intervals, How

How to flatten / merge overlapping time periods

Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
# 
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15

Merge overlapping time intervals, how?

You may also try this query (once more solutions beside those given by PM 77-1 in the comment above) :

WITH RECURSIVE cte( id, date_start, date_end ) AS
(
  SELECT id, date_start, date_end
  FROM evento
  UNION 
  SELECT e.id,
         least( c.date_start, e.date_start ),
         greatest( c.date_end, e.date_end )
  FROM cte c
  JOIN evento e
  ON e.date_start between c.date_start and c.date_end
     OR 
     e.date_end between c.date_start and c.date_end
)
SELECT distinct date_start, date_end
FROM (
  SELECT id, 
         min( date_start) date_start, 
         max( date_end ) date_end
  FROM cte
  GROUP BY id
) xx
ORDER BY date_start;

Demo ---> http://www.sqlfiddle.com/#!12/bdf7e/9

however for huge table the performance of this query could be horribly slow, and some procedural approach might perform better.

Merge Overlapping Intervals in PysPark

You can use window function to compare previous rows with current row, to build a column that determine if current row is the start of a new interval, then sum over this column to build a interval id. Then you group by this interval id to get your final dataframe.

If you call input_df your input dataframe, the code will be as follows:

from pyspark.sql import Window
from pyspark.sql import functions as F

all_previous_rows_window = Window \
  .orderBy('start') \
  .rowsBetween(Window.unboundedPreceding, Window.currentRow)

result = input_df \
  .withColumn('max_previous_end', F.max('end').over(all_previous_rows_window)) \
  .withColumn('interval_change', F.when(
    F.col('start') > F.lag('max_previous_end').over(Window.orderBy('start')), 
    F.lit(1)
  ).otherwise(F.lit(0))) \
  .withColumn('interval_id', F.sum('interval_change').over(all_previous_rows_window)) \
  .drop('interval_change', 'max_previous_end') \
  .groupBy('interval_id') \
  .agg(
    F.collect_list('id').alias('ids'),
    F.min('start').alias('start'),
    F.max('end').alias('end')
  ).drop('interval_id')

So you can merge your intervals without any user-defined function. However, every time we use a window, code is executed on only on one executor, as our windows don't have partitions.

Merge overlapping time periods with milliseconds in R

I've found it is quite easy to preserve milliseconds if you use POSIXlt format. Although there are faster ways to calculate the overlap, it's fast enough for most purposes to just loop through the data frame.

Here's a reproducible example.

start <- c("2019-07-15 21:32:43.565",
           "2019-07-15 21:32:43.634",
           "2019-07-15 21:32:54.301",
           "2019-07-15 21:34:08.506",
           "2019-07-15 21:34:09.957")

end <- c("2019-07-15 21:32:48.445",
         "2019-07-15 21:32:49.045",
         "2019-07-15 21:32:54.801",
         "2019-07-15 21:34:10.111",
         "2019-07-15 21:34:10.236")

df    <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

i <- 1

df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

while(i < nrow(df))
{
  overlaps <- which(df$start < df$end[i] & df$end > df$start[i])
  if(length(overlaps) > 1)
  {
    df$end[i] <- max(df$end[overlaps])
    df <- df[-overlaps[-which(overlaps == i)], ]
    i <- i - 1
  }
  i <- i + 1
}

So now our data frame doesn't have overlaps:

df
#>                 start                 end
#> 1 2019-07-15 21:32:43 2019-07-15 21:32:49
#> 3 2019-07-15 21:32:54 2019-07-15 21:32:54
#> 4 2019-07-15 21:34:08 2019-07-15 21:34:10

Although it appears we have lost the milliseconds, this is just a display issue, as we can show by doing this:

df$end - df$start
#> Time differences in secs
#> [1] 5.48 0.50 1.73

as.numeric(df$end - df$start)
#> [1] 5.48 0.50 1.73

^{Created on 2020-02-20 by the reprex package (v0.3.0)}

Collapse and merge overlapping time intervals

my_time_intervals %>% 
  group_by(group) %>% arrange(start_time, by_group = TRUE) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), 
            end_time = max(end_time)) %>%
  select(-indx)

# # A tibble: 5 x 3
# # Groups:   group [3]
# group start_time          end_time           
# <int> <dttm>              <dttm>             
# 1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4     2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Explanation per OP's request:

I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
  5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
  6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
  7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

So let's look at the indx column for this dataset. I am adding arrange by group column to see all the same grouped rows together; but, as you know because we have group_by(group) we do not actually need that.

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()]))

  # # A tibble: 13 x 5
  # # Groups:   group [3]
  # id group start_time          end_time             indx
  # <int> <int> <dttm>              <dttm>              <dbl>
  # 1     1      1 2018-04-12 11:15:03 2018-05-14 02:32:10     0
  # 2     7      1 2018-05-07 13:02:04 2018-05-23 08:13:06     0
  # 3     8      1 2018-05-10 13:02:04 2018-05-23 08:13:06     0
  # 4     2      1 2018-07-04 02:53:20 2018-07-14 18:09:01     1
  # 5     3      1 2018-07-05 02:53:20 2018-07-14 18:09:01     1
  # 6     5      1 2018-07-15 01:53:20 2018-07-19 18:09:01     2
  # 7     4      1 2018-07-15 02:53:20 2018-07-16 18:09:01     2
  # 8     6      1 2018-07-20 02:53:20 2018-07-22 18:09:01     3
  # 9     9      2 2018-02-28 17:43:29 2018-04-20 03:48:40     0
  # 10    11     2 2018-04-18 20:47:22 2018-04-19 16:07:29     0
  # 11    10     2 2018-04-20 01:19:52 2018-08-12 12:56:37     0
  # 12    12     2 2018-10-02 14:08:03 2018-11-08 00:01:23     1
  # 13    13     3 2018-03-11 22:30:51 2018-10-20 21:01:42     0

As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx column divided those data points to 4 groups (i.e. 0, 1, 2, 3). Later in the solution, when we group_by(indx,group) we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.

Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first() and last() to min() and max().

So...

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), end_time = max(end_time)) 

# # A tibble: 7 x 4
# # Groups:   group [?]
# group  indx start_time          end_time           
# <int> <dbl> <dttm>              <dttm>             
# 1     1     0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     1     2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4     1     3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5     2     0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6     2     1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7     3     0 2018-03-11 22:30:51 2018-10-20 21:01:42

We used the unique index of each overlapping time and date to get the period (start and end) for each of them.

Beyond this point, you need to read about cumsum and cummax and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.

Hope this helps, as it is my best.

Merge Overlapping Time Intervals based on Hierarchy in SQL

Below is for BigQuery Standard SQL

#standardSQL
WITH check_times AS (
  SELECT id, start_time AS time FROM `project.dataset.table` UNION DISTINCT
  SELECT id, stop_time AS time FROM `project.dataset.table` 
), distinct_intervals AS (
  SELECT id, time AS start_time, LEAD(time) OVER(PARTITION BY id ORDER BY time) stop_time
  FROM check_times
), deduped_intervals AS (
  SELECT a.id, a.start_time, a.stop_time, MIN(priority) priority
  FROM distinct_intervals a
  JOIN `project.dataset.table` b
  ON a.id = b.id 
  AND a.start_time BETWEEN b.start_time AND b.stop_time 
  AND a.stop_time BETWEEN b.start_time AND b.stop_time
  GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
  SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, ANY_VALUE(priority) priority 
  FROM (
    SELECT id, start_time, stop_time, priority, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
    FROM (
      SELECT id, start_time, stop_time, priority, 
        start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) OR
        priority != IFNULL(LAG(priority) OVER(PARTITION BY id ORDER BY start_time), -1) flag
      FROM deduped_intervals
    )
  )
  GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time

If to apply to sample data from your question - result is

Sample Image

Can you also share a solution where we merge intervals based on just id and no priority column

I just simply slightly adjusted above query to ignore priority

#standardSQL
WITH check_times AS (
  SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
  SELECT id, stop_time AS TIME FROM `project.dataset.table` 
), distinct_intervals AS (
  SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
  FROM check_times
), deduped_intervals AS (
  SELECT a.id, a.start_time, a.stop_time 
  FROM distinct_intervals a
  JOIN `project.dataset.table` b
  ON a.id = b.id 
  AND a.start_time BETWEEN b.start_time AND b.stop_time 
  AND a.stop_time BETWEEN b.start_time AND b.stop_time
  GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
  SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time 
  FROM (
    SELECT id, start_time, stop_time, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
    FROM (
      SELECT id, start_time, stop_time, 
        start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
      FROM deduped_intervals
    )
  )
  GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time

with result

Row id  start_time  stop_time    
1   1   0           36   
2   1   41          47

Java merge overlapping date intervals

private Set<AppointmentAvailabilityBlock> mergeUnavailabilitiesBlocks(
    Set<AppointmentAvailabilityBlock> appointmentComponentUnavailabilitiesBlock) {

// Transform Set to List
List<AppointmentAvailabilityBlock> intervals = new LinkedList<AppointmentAvailabilityBlock>();
intervals.addAll(appointmentComponentUnavailabilitiesBlock);

// Sort by blockStartTime
intervals.sort(Comparator.comparing(AppointmentAvailabilityBlock::getBlockStartTime));

// Merge
LinkedList<AppointmentAvailabilityBlock> merged = new LinkedList<>();
for (AppointmentAvailabilityBlock interval : intervals) {
    // No overlap with the previous interval, append it.
    if (merged.isEmpty() || merged.getLast().getBlockEndTime().isBefore(interval.getBlockStartTime())) {
        merged.add(interval);
    } else { // There is overlap
        OffsetDateTime maxOffsetDateTime = merged.getLast().getBlockEndTime().isAfter(
                interval.getBlockEndTime()) ? merged.getLast().getBlockEndTime() : interval.getBlockEndTime();
        
        merged.getLast().setBlockEndTime(maxOffsetDateTime);
    }
}
return new HashSet<AppointmentAvailabilityBlock>(merged);
}