Finding overlapping ranges between two interval data
In general, it's very appropriate to use the bioconductor package IRanges
to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges
is another package that builds on top of IRanges
, specifically for handling, well, "Genomic Ranges".
require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]
grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500
Determine Whether Two Date Ranges Overlap
(StartA <= EndB) and (EndA >= StartB)
Proof:
Let ConditionA Mean that DateRange A Completely After DateRange B
_ |---- DateRange A ------|
|---Date Range B -----| _
(True if StartA > EndB
)
Let ConditionB Mean that DateRange A is Completely Before DateRange B
|---- DateRange A -----| _
_ |---Date Range B ----|
(True if EndA < StartB
)
Then Overlap exists if Neither A Nor B is true -
(If one range is neither completely after the other,
nor completely before the other,
then they must overlap.)
Now one of De Morgan's laws says that:
Not (A Or B)
<=> Not A And Not B
Which translates to: (StartA <= EndB) and (EndA >= StartB)
NOTE: This includes conditions where the edges overlap exactly. If you wish to exclude that,
change the >=
operators to >
, and <=
to <
NOTE2. Thanks to @Baodad, see this blog, the actual overlap is least of:
{ endA-startA
, endA - startB
, endB-startA
, endB - startB
}
(StartA <= EndB) and (EndA >= StartB)
(StartA <= EndB) and (StartB <= EndA)
NOTE3. Thanks to @tomosius, a shorter version reads:DateRangesOverlap = max(start1, start2) < min(end1, end2)
This is actually a syntactical shortcut for what is a longer implementation, which includes extra checks to verify that the start dates are on or before the endDates. Deriving this from above:
If start and end dates can be out of order, i.e., if it is possible that startA > endA
or startB > endB
, then you also have to check that they are in order, so that means you have to add two additional validity rules:(StartA <= EndB) and (StartB <= EndA) and (StartA <= EndA) and (StartB <= EndB)
or:(StartA <= EndB) and (StartA <= EndA) and (StartB <= EndA) and (StartB <= EndB)
or,(StartA <= Min(EndA, EndB) and (StartB <= Min(EndA, EndB))
or:(Max(StartA, StartB) <= Min(EndA, EndB)
But to implement Min()
and Max()
, you have to code, (using C ternary for terseness),:(StartA > StartB? Start A: StartB) <= (EndA < EndB? EndA: EndB)
What's the most efficient way to test if two ranges overlap?
What does it mean for the ranges to overlap? It means there exists some number C which is in both ranges, i.e.
x1 <= C <= x2
and
y1 <= C <= y2
To avoid confusion, considering the ranges are:
[x1:x2] and [y1:y2]
Now, if we are allowed to assume that the ranges are well-formed (so that x1 <= x2 and y1 <= y2) then it is sufficient to test
x1 <= y2 && y1 <= x2
OR
(StartA <= EndB) and (EndA >= StartB)
Efficiently find the overlap between two time intervals in R
Maybe try data.table
foverlaps
function:
library(data.table)
setDT(dat)
setkey(dat, start, end)
foverlaps(dat, dat)[id != i.id]
Finding overlaps between millions of ranges/intervals
Here's an algorithm.
- Prepare a set of intervals sorted by starting point. Initially the set is empty.
- Sort all starting and ending points.
- Traverse the points. If a starting point is encountered, add the corresponding interval to the set. If an ending point is encountered, remove the corresponding interval from the set.
- When removing an interval, look at other intervals in the set. They all overlap the interval being removed, and they are sorted by the length of the overlap, longest first. Traverse the set until the length is too short, and report each overlap.
Efficiently finding overlapping intervals from a list of intervals
Create a single, sorted array of transitions. Each transition has a position, and a cumulative number based on how many intervals you're joining or leaving. As you pass through the list, keep track of how many intervals you are in. When you're in as many intervals as series, that's when you're in a common interval.
For your example the transitions would be:
[2, 1], [6, -1], [7, 1], [11, -1],
[1, 1], [3, -1], [5, 1], [10, -1], [11, 1], [13, -1]
[2, 1], [5, -1], [6, 1], [8, -1]
which after sorting by position and merging collapses to:
[1, 1], [2, 2], [3, -1], [5, 0], [6, 0], [7, 1], [8, -1], [10, -1], [11, 0], [13, -1]
which gives you transitions for running totals of:
[1, 1], [2, 3], [3, 2], [7, 3], [8, 2], [10, 2], [13, 1]
And then we can read off the intervals where we're at 3 as one starting at 2
and going to 3
, and another starting at 7
and going to 8
. Which is the answer.
The idea of creating one long list and sorting is admittedly extra work. You can instead create those lists and merge them on the fly. The savings is a factor of the log of the number of series rather than the log of the number of events.
Possible Interview Question: How to Find All Overlapping Intervals
Throw the endpoints of the intervals into an array, marking them as either start- or end-points. Sort them by breaking ties by placing end-points before start-points if the intervals are closed, or the other way around if they're half-open.
1S, 2S, 3E, 4E, 5S, 10E, 12S, 13S, 14E, 15E
Then iterate through the list, keeping track of how many intervals we're in (this equates to number of start-points processed minus number of end-points). Whenever we hit a start-point while we are already in an interval, this means we must have overlapping intervals.
1S, 2S, 3E, 4E, 5S, 10E, 12S, 13S, 14E, 15E
^ ^
overlap overlap
We can find which intervals overlap with which by storing data alongside the end-points, and keeping track of which intervals we're in.
This is an O(N logN) solution, with sorting being the main factor.
search for interval overlap in list of intervals?
[a, b] overlaps with [x, y] iff b > x and a < y. Sorting intervals by their first elements gives you intervals matching the first condition in log time. Sorting intervals by their last elements gives you intervals matching the second condition in log time. Take the intersections of the resulting sets.
Finding intervals of a set that are overlapping
Build a list of values (a, -1), (b, 1) for every interval [a, b]. Now sorting these lets you run through the array, counting how many intervals are currently open at each endpoint, just by aggregating the +1 and -1s.
It's important that (b, 1) comes after (b, -1) in the sorted list; because we want to count the intersection even if the intersection is at a single point.
Complete code here.
def overlaps(intervals):
es = []
for a, b in intervals:
es.append((a, -1))
es.append((b, 1))
es.sort()
result = 0
n = 0
for a, s in es:
if s == -1: result += n
n -= s
return result
Note, this is trivially O(n log n), and doesn't need any fancy data-structures.
Related Topics
Dealing with Spaces and "Weird" Characters in Column Names with Dplyr::Rename()
How to Calculate Mean of All Columns, by Group
Ggplot2: Using Gtable to Move Strip Labels to Top of Panel for Facet_Grid
R Shiny, How to Make Datatable React to Checkboxes in Datatable
How to Change Angle of Line in Customized Legend in Ggplot2
Add Row in Each Group Using Dplyr and Add_Row()
How to Turn Gpclibpermit() to True
Ggplot: How to Set Default Color for All Geoms
Check If String Contains Only Numbers or Only Characters (R)
How to Assign from a Function with Multiple Outputs
Set a Functions Environment to That of the Calling Environment (Parent.Frame) from Within Function
Cbind: How to Have Missing Values Set to Na
Programmatically Insert Header and Plot in Same Code Chunk with R Markdown Using Results='Asis'
Different Robust Standard Errors of Logit Regression in Stata and R
Count the Number of Non-Zero Elements of Each Column