Collapse Rows with Overlapping Ranges

Collapse rows with overlapping ranges

You can try this:

library(dplyr)
ranges %>% 
       arrange(start) %>% 
       group_by(g = cumsum(cummax(lag(stop, default = first(stop))) < start)) %>% 
       summarise(start = first(start), stop = max(stop))

# A tibble: 2 × 3
#      g    start      stop
#  <int>    <dbl>     <dbl>
#1     0 65.72000  87.75625
#2     1 89.61625 104.94062

Collapsing rows with consecutive ranges in two separate columns

There are several ways to achieve this, here is one:

library(tidyverse)
genomic_ranges %>%
  group_by(sample_ID) %>%
  summarize(start = min(start),
            end = max(end),
            feature = feature[1])

which gives:

# A tibble: 3 x 4
  sample_ID start   end feature
  <chr>     <dbl> <dbl> <chr>  
1 A             1     5 normal 
2 B            20    70 DUP    
3 C           250   400 DUP

Pandas: collapse overlapping intervals [start-end] and keep the smaller

it can be done like below

df.groupby(((df.shift()["end"] - df["start"])<0).cumsum()).agg({"start":"min", "end":"max"})

Collapse and merge overlapping time intervals

my_time_intervals %>% 
  group_by(group) %>% arrange(start_time, by_group = TRUE) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), 
            end_time = max(end_time)) %>%
  select(-indx)


# # A tibble: 5 x 3
# # Groups:   group [3]
# group start_time          end_time           
# <int> <dttm>              <dttm>             
# 1     1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4     2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5     3 2018-03-11 22:30:51 2018-10-20 21:01:42

Explanation per OP's request:

I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;

my_time_intervals <- tribble(
  ~id, ~group, ~start_time, ~end_time,
  1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
  2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
  4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
  5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
  6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
  7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
  9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
  10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
  11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
  12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
  13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)

So let's look at the indx column for this dataset. I am adding arrange by group column to see all the same grouped rows together; but, as you know because we have group_by(group) we do not actually need that.

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()]))


  # # A tibble: 13 x 5
  # # Groups:   group [3]
  # id group start_time          end_time             indx
  # <int> <int> <dttm>              <dttm>              <dbl>
  # 1     1      1 2018-04-12 11:15:03 2018-05-14 02:32:10     0
  # 2     7      1 2018-05-07 13:02:04 2018-05-23 08:13:06     0
  # 3     8      1 2018-05-10 13:02:04 2018-05-23 08:13:06     0
  # 4     2      1 2018-07-04 02:53:20 2018-07-14 18:09:01     1
  # 5     3      1 2018-07-05 02:53:20 2018-07-14 18:09:01     1
  # 6     5      1 2018-07-15 01:53:20 2018-07-19 18:09:01     2
  # 7     4      1 2018-07-15 02:53:20 2018-07-16 18:09:01     2
  # 8     6      1 2018-07-20 02:53:20 2018-07-22 18:09:01     3
  # 9     9      2 2018-02-28 17:43:29 2018-04-20 03:48:40     0
  # 10    11     2 2018-04-18 20:47:22 2018-04-19 16:07:29     0
  # 11    10     2 2018-04-20 01:19:52 2018-08-12 12:56:37     0
  # 12    12     2 2018-10-02 14:08:03 2018-11-08 00:01:23     1
  # 13    13     3 2018-03-11 22:30:51 2018-10-20 21:01:42     0

As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx column divided those data points to 4 groups (i.e. 0, 1, 2, 3). Later in the solution, when we group_by(indx,group) we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.

Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first() and last() to min() and max().

So...

my_time_intervals %>% 
  group_by(group) %>% arrange(group,start_time) %>% 
  mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
                              cummax(as.numeric(end_time)))[-n()])) %>%
  group_by(group, indx) %>%
  summarise(start_time = min(start_time), end_time = max(end_time)) 


# # A tibble: 7 x 4
# # Groups:   group [?]
# group  indx start_time          end_time           
# <int> <dbl> <dttm>              <dttm>             
# 1     1     0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2     1     1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3     1     2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4     1     3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5     2     0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6     2     1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7     3     0 2018-03-11 22:30:51 2018-10-20 21:01:42

We used the unique index of each overlapping time and date to get the period (start and end) for each of them.

Beyond this point, you need to read about cumsum and cummax and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.

Hope this helps, as it is my best.

Merging rows with overlapping values

Here is a data.table solution

library(data.table)
setDT(testDT)

testDT[order(AgeMin)
      ][, .(AgeMin=min(AgeMin), AgeMax=max(AgeMax)),
       by=.(group=cumsum(c(1, tail(AgeMin, -1) > head(AgeMax, -1))))]
#>    group AgeMin AgeMax
#> 1:     1  13273  13540
#> 2:     2  13794  14087
#> 3:     3  14095  14343

The key of this solution is getting the group of overlapping periods.

Let's say we have two ranges p1 and p2. They have start and end named as start1,end1,start2，end2.

There are only two conditions where p1 and p2 are not overalpping.

start1 > end2

OR
end1 < start2

Since we already ordered Agemin ascendingly, we only need to consider conditioon 1 only.
Then we can use cumsum to get the group indentifier.

Dropping rows with overlapping date ranges

You can use -

library(data.table)

setDT(final_arrange)[, .SD[Event_start - shift(Event_end) > 0 | seq_len(.N) == 1], ticker]

#    ticker Event_start  Event_end
# 1:    AAP  2018-11-23 2018-12-03
# 2:    AAP  2019-02-14 2019-02-24
# 3:    AAP  2019-03-07 2019-03-17
# 4:    AAP  2019-05-17 2019-05-27
# 5:    AAP  2019-08-22 2019-09-01
# 6:    AAP  2019-11-07 2019-11-17
# 7:    AAP  2020-02-13 2020-02-23
# 8:    AAP  2020-05-14 2020-05-24
# 9:    AAP  2020-06-05 2020-06-15
#10:   AAPL  2018-07-04 2018-07-14
#11:   AAPL  2018-08-01 2018-08-11
#12:   EFSC  2020-04-15 2020-04-25
#13:   EFSC  2020-07-15 2020-07-25
#14:    EFX  2018-07-06 2018-07-16
#15:    EFX  2018-07-20 2018-07-30
#16:    EFX  2018-08-03 2018-08-13

Or with dplyr -

library(dplyr)

final_arrange %>% 
  arrange(ticker, Event_start) %>%
  group_by(ticker) %>%
  filter(Event_start - lag(Event_end) > 0 | row_number() == 1)

Merge overlapping ranges per group

I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.


> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]       4F   2500-3401      +
  [2]       4F 19116-20730      +
  [3]       4F   1420-2527      -
  [4]       0F   1405-1700      -
  [5]       0F   1727-2038      -
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

which differs from your expectation because there are two OF non-overlapping ranges?

How to flatten / merge overlapping time periods

Here's a possible solution. The basic idea here is to compare lagged start date with the maximum end date "until now" using the cummax function and create an index that will separate the data into groups

data %>%
  arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
  group_by(ID) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
                     cummax(as.numeric(end)))[-n()])) %>%
  group_by(ID, indx) %>%
  summarise(start = first(start), end = last(end))

# Source: local data frame [3 x 4]
# Groups: ID
# 
#   ID indx      start        end
# 1  A    0 2013-01-01 2013-01-06
# 2  A    1 2013-01-07 2013-01-11
# 3  A    2 2013-01-12 2013-01-15

Collapse Rows with Overlapping Ranges