Finding Overlaps Between Interval Sets/Efficient Overlap Joins

Finding Overlaps between interval sets / Efficient Overlap Joins

Ha, nice timing :). Just a few days back, overlap joins (or interval joins) was implemented. in data.table The function is foverlaps() and is available from the github project page. Make sure to have a look at ?foverlaps.

setkey(ref, space, t1, t2)
foverlaps(map, ref, type="within", nomatch=0L)

I think this is what you're after. This'll result in the join result only where there's a match, and it'll check for t1,t2 overlaps between ref and map within space identifier.. If you don't want that, just remove space from the key column. And if you want all matches, remove nomatch=0L - the default is nomatch=NA which returns all.

The function is new (but has been rigorously tested) and is therefore not feature complete. If you've any suggestions for improvement or come across any issues, please feel free to file an issue.

Finding overlapping ranges between two interval data

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr, 
          levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

   grp id   type id.1 chr coord
1:   1  1   exon   10   1   150
2:   2  2 intron   20   2   300
3:   2  4   exon   20   2   300
4:   3 NA     NA   30   Y   500

Finding an efficient way to count the number of overlaps between interval sets in two tables?

You can use which=TRUE argument to get the position of overlaps, and then use to get the counts by doing a simple aggregation:

ans = foverlaps(map, ref, type="within", nomatch=0L, which=TRUE)[, .N, by=yid]
#    yid  N
# 1:   1 10
# 2:   2 20
# 3:   3 20

And then get this back in ref. But we should provide a more direct way to accomplish this.

Efficiently finding overlapping intervals from a list of intervals

Create a single, sorted array of transitions. Each transition has a position, and a cumulative number based on how many intervals you're joining or leaving. As you pass through the list, keep track of how many intervals you are in. When you're in as many intervals as series, that's when you're in a common interval.

For your example the transitions would be:

[2, 1], [6, -1], [7, 1], [11, -1],
[1, 1], [3, -1], [5, 1], [10, -1], [11, 1], [13, -1]
[2, 1], [5, -1], [6, 1], [8, -1]

which after sorting by position and merging collapses to:

[1, 1], [2, 2], [3, -1], [5, 0], [6, 0], [7, 1], [8, -1], [10, -1], [11, 0], [13, -1]

which gives you transitions for running totals of:

[1, 1], [2, 3], [3, 2], [7, 3], [8, 2], [10, 2], [13, 1]

And then we can read off the intervals where we're at 3 as one starting at 2 and going to 3, and another starting at 7 and going to 8. Which is the answer.

The idea of creating one long list and sorting is admittedly extra work. You can instead create those lists and merge them on the fly. The savings is a factor of the log of the number of series rather than the log of the number of events.

What's the most efficient way to test if two ranges overlap?

What does it mean for the ranges to overlap? It means there exists some number C which is in both ranges, i.e.

x1 <= C <= x2

and

y1 <= C <= y2

To avoid confusion, considering the ranges are:
[x1:x2] and [y1:y2]

Now, if we are allowed to assume that the ranges are well-formed (so that x1 <= x2 and y1 <= y2) then it is sufficient to test

x1 <= y2 && y1 <= x2

(StartA <= EndB) and (EndA >= StartB)

Finding overlaps between 2 ranges and their overlapped region lengths?

Should point you in the right direction:

Load libraries

# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)

Generate data

gp1 <- read.table(text = 
"
chr   start   end   id1 
chr1  580     600    1
chr1  900     970    2
chr3  400     600    3
chr2  100     700    4
", header = TRUE)

gp2 <- read.table(text = 
                      "
chr   start   end   id2
chr1  590     864   1
chr3  550     670   2
chr2  897     1987  3
", header = TRUE)

Calculate ranges

gr1 <- GenomicRanges::makeGRangesFromDataFrame(
    gp1, 
    seqnames.field = "chr", 
    start.field = "start", 
    end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
    gp2, 
    seqnames.field = "chr", 
    start.field = "start", 
    end.field = "end"
)

Calculate overlaps

hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)

Result

> as.data.frame(i)
  seqnames start end width strand  hit
1     chr1   590 600    11      * TRUE
2     chr3   550 600    51      * TRUE

Match Value and Interval via overlaps

You can try fuzzyjoin :

fuzzyjoin::fuzzy_left_join(test, test_2, 
                           by = c('amount' = 'top', 'amount' = 'bottom'), 
                           match_fun = list(`<=`, `>=`))

#   id amount interval bottom top
#1   1     19        1      0  24
#2   1     21        1      0  24
#3   1     39        2     25  49
#4   1     45        2     25  49
#5   1     62        3     50  74
#6   1     71        3     50  74
#7   1    100        5    100 124
#8   1    121        5    100 124
#9   1    130        6    125 149
#10  1    160        7    150 174
#11  1    180        8    175 199
#12  1    210        9    200 224
#13  1    240       NA     NA  NA

How to efficiently find overlapping intervals?

If I understand correctly, you want to separate your current df into data frames where the initial interval is set by the first row, and the second interval is defined by the first row that does not intersect, etc. The below method will do that and should be pretty efficient if the number of groups isn't too large:

df = df.sort_values("f_low").reset_index(drop=True)
idx = 0
dfs = []
while True:
    low = df.f_low[idx]
    high = df.f_high[idx]
    sub_df = df[(df.f_low <= high) & (low <= df.f_low)]
    dfs.append(sub_df)
    idx = sub_df.index.max() + 1
    if idx > df.index.max():
        break

output:

[      f_low    f_high
 0  0.476201  0.481915
 1  0.479161  0.484977,
       f_low    f_high
 2  0.485997  0.491911,
       f_low    f_high
 3  0.503259  0.508679
 4  0.504687  0.510075
 5  0.504687  0.670075,
       f_low    f_high
 6  0.666093  0.670438,
       f_low    f_high
 7  0.765602  0.770028
 8  0.766884  0.771307,
       f_low    f_high
 9  0.775986  0.780398,
       f_low    f_high
 10  0.79459  0.798965]

Finding Overlaps Between Interval Sets/Efficient Overlap Joins