Merge Overlapping Time Intervals, How

How to flatten / merge overlapping time periods

Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups

data %>%
arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
#
# ID indx start end
# 1 A 0 2013-01-01 2013-01-06
# 2 A 1 2013-01-07 2013-01-11
# 3 A 2 2013-01-12 2013-01-15

Merge overlapping time intervals, how?

You may also try this query (once more solutions beside those given by PM 77-1 in the comment above) :

WITH RECURSIVE cte( id, date_start, date_end ) AS
(
SELECT id, date_start, date_end
FROM evento
UNION
SELECT e.id,
least( c.date_start, e.date_start ),
greatest( c.date_end, e.date_end )
FROM cte c
JOIN evento e
ON e.date_start between c.date_start and c.date_end
OR
e.date_end between c.date_start and c.date_end
)
SELECT distinct date_start, date_end
FROM (
SELECT id,
min( date_start) date_start,
max( date_end ) date_end
FROM cte
GROUP BY id
) xx
ORDER BY date_start;

Demo ---> http://www.sqlfiddle.com/#!12/bdf7e/9

however for huge table the performance of this query could be horribly slow, and some procedural approach might perform better.

Merge Overlapping Intervals in PysPark

You can use window function to compare previous rows with current row, to build a column that determine if current row is the start of a new interval, then sum over this column to build a interval id. Then you group by this interval id to get your final dataframe.

If you call input_df your input dataframe, the code will be as follows:

from pyspark.sql import Window
from pyspark.sql import functions as F

all_previous_rows_window = Window \
.orderBy('start') \
.rowsBetween(Window.unboundedPreceding, Window.currentRow)

result = input_df \
.withColumn('max_previous_end', F.max('end').over(all_previous_rows_window)) \
.withColumn('interval_change', F.when(
F.col('start') > F.lag('max_previous_end').over(Window.orderBy('start')),
F.lit(1)
).otherwise(F.lit(0))) \
.withColumn('interval_id', F.sum('interval_change').over(all_previous_rows_window)) \
.drop('interval_change', 'max_previous_end') \
.groupBy('interval_id') \
.agg(
F.collect_list('id').alias('ids'),
F.min('start').alias('start'),
F.max('end').alias('end')
).drop('interval_id')

So you can merge your intervals without any user-defined function. However, every time we use a window, code is executed on only on one executor, as our windows don't have partitions.

Merge overlapping time periods with milliseconds in R

I've found it is quite easy to preserve milliseconds if you use POSIXlt format. Although there are faster ways to calculate the overlap, it's fast enough for most purposes to just loop through the data frame.

Here's a reproducible example.

start <- c("2019-07-15 21:32:43.565",
"2019-07-15 21:32:43.634",
"2019-07-15 21:32:54.301",
"2019-07-15 21:34:08.506",
"2019-07-15 21:34:09.957")

end <- c("2019-07-15 21:32:48.445",
"2019-07-15 21:32:49.045",
"2019-07-15 21:32:54.801",
"2019-07-15 21:34:10.111",
"2019-07-15 21:34:10.236")

df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

i <- 1

df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))

while(i < nrow(df))
{
overlaps <- which(df$start < df$end[i] & df$end > df$start[i])
if(length(overlaps) > 1)
{
df$end[i] <- max(df$end[overlaps])
df <- df[-overlaps[-which(overlaps == i)], ]
i <- i - 1
}
i <- i + 1
}

So now our data frame doesn't have overlaps:

df
#> start end
#> 1 2019-07-15 21:32:43 2019-07-15 21:32:49
#> 3 2019-07-15 21:32:54 2019-07-15 21:32:54
#> 4 2019-07-15 21:34:08 2019-07-15 21:34:10

Although it appears we have lost the milliseconds, this is just a display issue, as we can show by doing this:

df$end - df$start
#> Time differences in secs
#> [1] 5.48 0.50 1.73

as.numeric(df$end - df$start)
#> [1] 5.48 0.50 1.73

Created on 2020-02-20 by the reprex package (v0.3.0)

Collapse and merge overlapping time intervals

my_time_intervals %>% 
group_by(group) %>% arrange(start_time, by_group = TRUE) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
cummax(as.numeric(end_time)))[-n()])) %>%
group_by(group, indx) %>%
summarise(start_time = min(start_time),
end_time = max(end_time)) %>%
select(-indx)

# # A tibble: 5 x 3
# # Groups: group [3]
# group start_time end_time
# <int> <dttm> <dttm>
# 1 1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3 2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4 2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5 3 2018-03-11 22:30:51 2018-10-20 21:01:42

Explanation per OP's request:

I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;

my_time_intervals <- tribble(
~id, ~group, ~start_time, ~end_time,
1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

So let's look at the indx column for this dataset. I am adding arrange by group column to see all the same grouped rows together; but, as you know because we have group_by(group) we do not actually need that.

my_time_intervals %>% 
group_by(group) %>% arrange(group,start_time) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
cummax(as.numeric(end_time)))[-n()]))

# # A tibble: 13 x 5
# # Groups: group [3]
# id group start_time end_time indx
# <int> <int> <dttm> <dttm> <dbl>
# 1 1 1 2018-04-12 11:15:03 2018-05-14 02:32:10 0
# 2 7 1 2018-05-07 13:02:04 2018-05-23 08:13:06 0
# 3 8 1 2018-05-10 13:02:04 2018-05-23 08:13:06 0
# 4 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01 1
# 5 3 1 2018-07-05 02:53:20 2018-07-14 18:09:01 1
# 6 5 1 2018-07-15 01:53:20 2018-07-19 18:09:01 2
# 7 4 1 2018-07-15 02:53:20 2018-07-16 18:09:01 2
# 8 6 1 2018-07-20 02:53:20 2018-07-22 18:09:01 3
# 9 9 2 2018-02-28 17:43:29 2018-04-20 03:48:40 0
# 10 11 2 2018-04-18 20:47:22 2018-04-19 16:07:29 0
# 11 10 2 2018-04-20 01:19:52 2018-08-12 12:56:37 0
# 12 12 2 2018-10-02 14:08:03 2018-11-08 00:01:23 1
# 13 13 3 2018-03-11 22:30:51 2018-10-20 21:01:42 0

As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx column divided those data points to 4 groups (i.e. 0, 1, 2, 3). Later in the solution, when we group_by(indx,group) we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.

Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first() and last() to min() and max().

So...

my_time_intervals %>% 
group_by(group) %>% arrange(group,start_time) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
cummax(as.numeric(end_time)))[-n()])) %>%
group_by(group, indx) %>%
summarise(start_time = min(start_time), end_time = max(end_time))

# # A tibble: 7 x 4
# # Groups: group [?]
# group indx start_time end_time
# <int> <dbl> <dttm> <dttm>
# 1 1 0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2 1 1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3 1 2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4 1 3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5 2 0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6 2 1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7 3 0 2018-03-11 22:30:51 2018-10-20 21:01:42

We used the unique index of each overlapping time and date to get the period (start and end) for each of them.

Beyond this point, you need to read about cumsum and cummax and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.

Hope this helps, as it is my best.

Merge Overlapping Time Intervals based on Hierarchy in SQL

Below is for BigQuery Standard SQL

#standardSQL
WITH check_times AS (
SELECT id, start_time AS time FROM `project.dataset.table` UNION DISTINCT
SELECT id, stop_time AS time FROM `project.dataset.table`
), distinct_intervals AS (
SELECT id, time AS start_time, LEAD(time) OVER(PARTITION BY id ORDER BY time) stop_time
FROM check_times
), deduped_intervals AS (
SELECT a.id, a.start_time, a.stop_time, MIN(priority) priority
FROM distinct_intervals a
JOIN `project.dataset.table` b
ON a.id = b.id
AND a.start_time BETWEEN b.start_time AND b.stop_time
AND a.stop_time BETWEEN b.start_time AND b.stop_time
GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, ANY_VALUE(priority) priority
FROM (
SELECT id, start_time, stop_time, priority, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
FROM (
SELECT id, start_time, stop_time, priority,
start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) OR
priority != IFNULL(LAG(priority) OVER(PARTITION BY id ORDER BY start_time), -1) flag
FROM deduped_intervals
)
)
GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time

If to apply to sample data from your question - result is

Sample Image

Can you also share a solution where we merge intervals based on just id and no priority column

I just simply slightly adjusted above query to ignore priority

#standardSQL
WITH check_times AS (
SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
SELECT id, stop_time AS TIME FROM `project.dataset.table`
), distinct_intervals AS (
SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
FROM check_times
), deduped_intervals AS (
SELECT a.id, a.start_time, a.stop_time
FROM distinct_intervals a
JOIN `project.dataset.table` b
ON a.id = b.id
AND a.start_time BETWEEN b.start_time AND b.stop_time
AND a.stop_time BETWEEN b.start_time AND b.stop_time
GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time
FROM (
SELECT id, start_time, stop_time, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
FROM (
SELECT id, start_time, stop_time,
start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
FROM deduped_intervals
)
)
GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time

with result

Row id  start_time  stop_time    
1 1 0 36
2 1 41 47

Java merge overlapping date intervals

private Set<AppointmentAvailabilityBlock> mergeUnavailabilitiesBlocks(
Set<AppointmentAvailabilityBlock> appointmentComponentUnavailabilitiesBlock) {

// Transform Set to List
List<AppointmentAvailabilityBlock> intervals = new LinkedList<AppointmentAvailabilityBlock>();
intervals.addAll(appointmentComponentUnavailabilitiesBlock);

// Sort by blockStartTime
intervals.sort(Comparator.comparing(AppointmentAvailabilityBlock::getBlockStartTime));

// Merge
LinkedList<AppointmentAvailabilityBlock> merged = new LinkedList<>();
for (AppointmentAvailabilityBlock interval : intervals) {
// No overlap with the previous interval, append it.
if (merged.isEmpty() || merged.getLast().getBlockEndTime().isBefore(interval.getBlockStartTime())) {
merged.add(interval);
} else { // There is overlap
OffsetDateTime maxOffsetDateTime = merged.getLast().getBlockEndTime().isAfter(
interval.getBlockEndTime()) ? merged.getLast().getBlockEndTime() : interval.getBlockEndTime();

merged.getLast().setBlockEndTime(maxOffsetDateTime);
}
}
return new HashSet<AppointmentAvailabilityBlock>(merged);
}


Related Topics



Leave a reply



Submit