Collapse rows with overlapping ranges
You can try this:
library(dplyr)
ranges %>%
arrange(start) %>%
group_by(g = cumsum(cummax(lag(stop, default = first(stop))) < start)) %>%
summarise(start = first(start), stop = max(stop))
# A tibble: 2 × 3
# g start stop
# <int> <dbl> <dbl>
#1 0 65.72000 87.75625
#2 1 89.61625 104.94062
Collapsing rows with consecutive ranges in two separate columns
There are several ways to achieve this, here is one:
library(tidyverse)
genomic_ranges %>%
group_by(sample_ID) %>%
summarize(start = min(start),
end = max(end),
feature = feature[1])
which gives:
# A tibble: 3 x 4
sample_ID start end feature
<chr> <dbl> <dbl> <chr>
1 A 1 5 normal
2 B 20 70 DUP
3 C 250 400 DUP
Pandas: collapse overlapping intervals [start-end] and keep the smaller
it can be done like below
df.groupby(((df.shift()["end"] - df["start"])<0).cumsum()).agg({"start":"min", "end":"max"})
Collapse and merge overlapping time intervals
my_time_intervals %>%
group_by(group) %>% arrange(start_time, by_group = TRUE) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
cummax(as.numeric(end_time)))[-n()])) %>%
group_by(group, indx) %>%
summarise(start_time = min(start_time),
end_time = max(end_time)) %>%
select(-indx)
# # A tibble: 5 x 3
# # Groups: group [3]
# group start_time end_time
# <int> <dttm> <dttm>
# 1 1 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3 2 2018-02-28 17:43:29 2018-08-12 12:56:37
# 4 2 2018-10-02 14:08:03 2018-11-08 00:01:23
# 5 3 2018-03-11 22:30:51 2018-10-20 21:01:42
Explanation per OP's request:
I am making another dataset which has more overlapping times within each group so the solution would get more exposure and hopefully will be grasped better;
my_time_intervals <- tribble(
~id, ~group, ~start_time, ~end_time,
1L, 1L, ymd_hms("2018-04-12 11:15:03"), ymd_hms("2018-05-14 02:32:10"),
2L, 1L, ymd_hms("2018-07-04 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
3L, 1L, ymd_hms("2018-07-05 02:53:20"), ymd_hms("2018-07-14 18:09:01"),
4L, 1L, ymd_hms("2018-07-15 02:53:20"), ymd_hms("2018-07-16 18:09:01"),
5L, 1L, ymd_hms("2018-07-15 01:53:20"), ymd_hms("2018-07-19 18:09:01"),
6L, 1L, ymd_hms("2018-07-20 02:53:20"), ymd_hms("2018-07-22 18:09:01"),
7L, 1L, ymd_hms("2018-05-07 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
8L, 1L, ymd_hms("2018-05-10 13:02:04"), ymd_hms("2018-05-23 08:13:06"),
9L, 2L, ymd_hms("2018-02-28 17:43:29"), ymd_hms("2018-04-20 03:48:40"),
10L, 2L, ymd_hms("2018-04-20 01:19:52"), ymd_hms("2018-08-12 12:56:37"),
11L, 2L, ymd_hms("2018-04-18 20:47:22"), ymd_hms("2018-04-19 16:07:29"),
12L, 2L, ymd_hms("2018-10-02 14:08:03"), ymd_hms("2018-11-08 00:01:23"),
13L, 3L, ymd_hms("2018-03-11 22:30:51"), ymd_hms("2018-10-20 21:01:42")
)
So let's look at the indx
column for this dataset. I am adding arrange
by group
column to see all the same grouped rows together; but, as you know because we have group_by(group)
we do not actually need that.
my_time_intervals %>%
group_by(group) %>% arrange(group,start_time) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
cummax(as.numeric(end_time)))[-n()]))
# # A tibble: 13 x 5
# # Groups: group [3]
# id group start_time end_time indx
# <int> <int> <dttm> <dttm> <dbl>
# 1 1 1 2018-04-12 11:15:03 2018-05-14 02:32:10 0
# 2 7 1 2018-05-07 13:02:04 2018-05-23 08:13:06 0
# 3 8 1 2018-05-10 13:02:04 2018-05-23 08:13:06 0
# 4 2 1 2018-07-04 02:53:20 2018-07-14 18:09:01 1
# 5 3 1 2018-07-05 02:53:20 2018-07-14 18:09:01 1
# 6 5 1 2018-07-15 01:53:20 2018-07-19 18:09:01 2
# 7 4 1 2018-07-15 02:53:20 2018-07-16 18:09:01 2
# 8 6 1 2018-07-20 02:53:20 2018-07-22 18:09:01 3
# 9 9 2 2018-02-28 17:43:29 2018-04-20 03:48:40 0
# 10 11 2 2018-04-18 20:47:22 2018-04-19 16:07:29 0
# 11 10 2 2018-04-20 01:19:52 2018-08-12 12:56:37 0
# 12 12 2 2018-10-02 14:08:03 2018-11-08 00:01:23 1
# 13 13 3 2018-03-11 22:30:51 2018-10-20 21:01:42 0
As you can see, in the group one we have 3 distinct period of times with overlapping datapoints and one datapoint which has no overlapped entry within that group. The indx
column divided those data points to 4 groups (i.e. 0, 1, 2, 3
). Later in the solution, when we group_by(indx,group)
we get each of these overlapping ones together and we get the first starting time and last ending time to make the desired output.
Just to make the solution more prone to errors (in case we had a datapoint which was starting sooner but ending later than the whole other ones in one group (group and index) like what we have in the datapooints with the id of 6 and 7) I changed first()
and last()
to min()
and max()
.
So...
my_time_intervals %>%
group_by(group) %>% arrange(group,start_time) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start_time)) >
cummax(as.numeric(end_time)))[-n()])) %>%
group_by(group, indx) %>%
summarise(start_time = min(start_time), end_time = max(end_time))
# # A tibble: 7 x 4
# # Groups: group [?]
# group indx start_time end_time
# <int> <dbl> <dttm> <dttm>
# 1 1 0 2018-04-12 11:15:03 2018-05-23 08:13:06
# 2 1 1 2018-07-04 02:53:20 2018-07-14 18:09:01
# 3 1 2 2018-07-15 01:53:20 2018-07-19 18:09:01
# 4 1 3 2018-07-20 02:53:20 2018-07-22 18:09:01
# 5 2 0 2018-02-28 17:43:29 2018-08-12 12:56:37
# 6 2 1 2018-10-02 14:08:03 2018-11-08 00:01:23
# 7 3 0 2018-03-11 22:30:51 2018-10-20 21:01:42
We used the unique index of each overlapping time and date to get the period (start and end) for each of them.
Beyond this point, you need to read about cumsum
and cummax
and also look at the output of these two functions for this specific problem to understand why the comparison that I made, ended up giving us unique identifiers for each of the overlapping time and dates.
Hope this helps, as it is my best.
Merging rows with overlapping values
Here is a data.table
solution
library(data.table)
setDT(testDT)
testDT[order(AgeMin)
][, .(AgeMin=min(AgeMin), AgeMax=max(AgeMax)),
by=.(group=cumsum(c(1, tail(AgeMin, -1) > head(AgeMax, -1))))]
#> group AgeMin AgeMax
#> 1: 1 13273 13540
#> 2: 2 13794 14087
#> 3: 3 14095 14343
The key of this solution is getting the group
of overlapping periods.
Let's say we have two ranges p1
and p2
. They have start and end named as start1
,end1
,start2
,end2
.
There are only two conditions where p1
and p2
are not overalpping.
start1
>end2
ORend1
<start2
Since we already ordered Agemin
ascendingly, we only need to consider conditioon 1 only.
Then we can use cumsum
to get the group indentifier.
Dropping rows with overlapping date ranges
You can use -
library(data.table)
setDT(final_arrange)[, .SD[Event_start - shift(Event_end) > 0 | seq_len(.N) == 1], ticker]
# ticker Event_start Event_end
# 1: AAP 2018-11-23 2018-12-03
# 2: AAP 2019-02-14 2019-02-24
# 3: AAP 2019-03-07 2019-03-17
# 4: AAP 2019-05-17 2019-05-27
# 5: AAP 2019-08-22 2019-09-01
# 6: AAP 2019-11-07 2019-11-17
# 7: AAP 2020-02-13 2020-02-23
# 8: AAP 2020-05-14 2020-05-24
# 9: AAP 2020-06-05 2020-06-15
#10: AAPL 2018-07-04 2018-07-14
#11: AAPL 2018-08-01 2018-08-11
#12: EFSC 2020-04-15 2020-04-25
#13: EFSC 2020-07-15 2020-07-25
#14: EFX 2018-07-06 2018-07-16
#15: EFX 2018-07-20 2018-07-30
#16: EFX 2018-08-03 2018-08-13
Or with dplyr
-
library(dplyr)
final_arrange %>%
arrange(ticker, Event_start) %>%
group_by(ticker) %>%
filter(Event_start - lag(Event_end) > 0 | row_number() == 1)
Merge overlapping ranges per group
I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.
> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 4F 2500-3401 +
[2] 4F 19116-20730 +
[3] 4F 1420-2527 -
[4] 0F 1405-1700 -
[5] 0F 1727-2038 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
which differs from your expectation because there are two OF
non-overlapping ranges?
How to flatten / merge overlapping time periods
Here's a possible solution. The basic idea here is to compare lagged start
date with the maximum end date "until now" using the cummax
function and create an index that will separate the data into groups
data %>%
arrange(ID, start) %>% # as suggested by @Jonno in case the data is unsorted
group_by(ID) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(ID, indx) %>%
summarise(start = first(start), end = last(end))
# Source: local data frame [3 x 4]
# Groups: ID
#
# ID indx start end
# 1 A 0 2013-01-01 2013-01-06
# 2 A 1 2013-01-07 2013-01-11
# 3 A 2 2013-01-12 2013-01-15
Related Topics
Examples of the Perils of Globals in R and Stata
Argument Is of Length Zero in If Statement
Controlling Line Color and Line Type in Ggplot Legend
How to Assign the Result of the Previous Expression to a Variable
Using a Pre-Defined Color Palette in Ggplot
Read.Csv, Header on First Line, Skip Second Line
Update a Value in One Column Based on Criteria in Other Columns
How to Produce Stacked Bars Within Grouped Barchart in R
In 'Knitr' How to Test for If the Output Will Be PDF or Word
Unicode Characters in Ggplot2 PDF Output
Rolling Join on Data.Table with Duplicate Keys
Using Cut and Quartile to Generate Breaks in R Function
Assign Unique Id Based on Two Columns
How to Escape a Backslash in R