Finding Overlapping Ranges Between Two Interval Data

Finding overlapping ranges between two interval data

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500

Determine Whether Two Date Ranges Overlap

(StartA <= EndB) and (EndA >= StartB)

Proof:

Let ConditionA Mean that DateRange A Completely After DateRange B

_                        |---- DateRange A ------|
|---Date Range B -----| _

(True if StartA > EndB)

Let ConditionB Mean that DateRange A is Completely Before DateRange B

|---- DateRange A -----|                        _ 
_ |---Date Range B ----|

(True if EndA < StartB)

Then Overlap exists if Neither A Nor B is true -

(If one range is neither completely after the other,

nor completely before the other,
then they must overlap.)

Now one of De Morgan's laws says that:

Not (A Or B) <=> Not A And Not B

Which translates to: (StartA <= EndB) and (EndA >= StartB)


NOTE: This includes conditions where the edges overlap exactly. If you wish to exclude that,

change the >= operators to >, and <= to <


NOTE2. Thanks to @Baodad, see this blog, the actual overlap is least of:

{ endA-startA, endA - startB, endB-startA, endB - startB }

(StartA <= EndB) and (EndA >= StartB)
(StartA <= EndB) and (StartB <= EndA)


NOTE3. Thanks to @tomosius, a shorter version reads:

DateRangesOverlap = max(start1, start2) < min(end1, end2)

This is actually a syntactical shortcut for what is a longer implementation, which includes extra checks to verify that the start dates are on or before the endDates. Deriving this from above:

If start and end dates can be out of order, i.e., if it is possible that startA > endA or startB > endB, then you also have to check that they are in order, so that means you have to add two additional validity rules:

(StartA <= EndB) and (StartB <= EndA) and (StartA <= EndA) and (StartB <= EndB)
or:

(StartA <= EndB) and (StartA <= EndA) and (StartB <= EndA) and (StartB <= EndB)
or,

(StartA <= Min(EndA, EndB) and (StartB <= Min(EndA, EndB))
or:

(Max(StartA, StartB) <= Min(EndA, EndB)

But to implement Min() and Max(), you have to code, (using C ternary for terseness),:

(StartA > StartB? Start A: StartB) <= (EndA < EndB? EndA: EndB)

What's the most efficient way to test if two ranges overlap?

What does it mean for the ranges to overlap? It means there exists some number C which is in both ranges, i.e.

x1 <= C <= x2

and

y1 <= C <= y2

To avoid confusion, considering the ranges are:
[x1:x2] and [y1:y2]

Now, if we are allowed to assume that the ranges are well-formed (so that x1 <= x2 and y1 <= y2) then it is sufficient to test

x1 <= y2 && y1 <= x2

OR

(StartA <= EndB) and (EndA >= StartB)

Efficiently find the overlap between two time intervals in R

Maybe try data.table foverlaps function:

library(data.table)
setDT(dat)
setkey(dat, start, end)
foverlaps(dat, dat)[id != i.id]

Finding overlaps between millions of ranges/intervals

Here's an algorithm.

  1. Prepare a set of intervals sorted by starting point. Initially the set is empty.
  2. Sort all starting and ending points.
  3. Traverse the points. If a starting point is encountered, add the corresponding interval to the set. If an ending point is encountered, remove the corresponding interval from the set.
  4. When removing an interval, look at other intervals in the set. They all overlap the interval being removed, and they are sorted by the length of the overlap, longest first. Traverse the set until the length is too short, and report each overlap.

Efficiently finding overlapping intervals from a list of intervals

Create a single, sorted array of transitions. Each transition has a position, and a cumulative number based on how many intervals you're joining or leaving. As you pass through the list, keep track of how many intervals you are in. When you're in as many intervals as series, that's when you're in a common interval.

For your example the transitions would be:

[2, 1], [6, -1], [7, 1], [11, -1],
[1, 1], [3, -1], [5, 1], [10, -1], [11, 1], [13, -1]
[2, 1], [5, -1], [6, 1], [8, -1]

which after sorting by position and merging collapses to:

[1, 1], [2, 2], [3, -1], [5, 0], [6, 0], [7, 1], [8, -1], [10, -1], [11, 0], [13, -1]

which gives you transitions for running totals of:

[1, 1], [2, 3], [3, 2], [7, 3], [8, 2], [10, 2], [13, 1]

And then we can read off the intervals where we're at 3 as one starting at 2 and going to 3, and another starting at 7 and going to 8. Which is the answer.

The idea of creating one long list and sorting is admittedly extra work. You can instead create those lists and merge them on the fly. The savings is a factor of the log of the number of series rather than the log of the number of events.

Possible Interview Question: How to Find All Overlapping Intervals

Throw the endpoints of the intervals into an array, marking them as either start- or end-points. Sort them by breaking ties by placing end-points before start-points if the intervals are closed, or the other way around if they're half-open.

1S, 2S, 3E, 4E, 5S, 10E, 12S, 13S, 14E, 15E

Then iterate through the list, keeping track of how many intervals we're in (this equates to number of start-points processed minus number of end-points). Whenever we hit a start-point while we are already in an interval, this means we must have overlapping intervals.

1S, 2S, 3E, 4E, 5S, 10E, 12S, 13S, 14E, 15E
^ ^
overlap overlap

We can find which intervals overlap with which by storing data alongside the end-points, and keeping track of which intervals we're in.

This is an O(N logN) solution, with sorting being the main factor.

search for interval overlap in list of intervals?

[a, b] overlaps with [x, y] iff b > x and a < y. Sorting intervals by their first elements gives you intervals matching the first condition in log time. Sorting intervals by their last elements gives you intervals matching the second condition in log time. Take the intersections of the resulting sets.

Finding intervals of a set that are overlapping

Build a list of values (a, -1), (b, 1) for every interval [a, b]. Now sorting these lets you run through the array, counting how many intervals are currently open at each endpoint, just by aggregating the +1 and -1s.

It's important that (b, 1) comes after (b, -1) in the sorted list; because we want to count the intersection even if the intersection is at a single point.

Complete code here.

def overlaps(intervals):
es = []
for a, b in intervals:
es.append((a, -1))
es.append((b, 1))
es.sort()
result = 0
n = 0
for a, s in es:
if s == -1: result += n
n -= s
return result

Note, this is trivially O(n log n), and doesn't need any fancy data-structures.



Related Topics



Leave a reply



Submit