Finding Overlaps Between Interval Sets/Efficient Overlap Joins

Finding Overlaps between interval sets / Efficient Overlap Joins

Ha, nice timing :). Just a few days back, overlap joins (or interval joins) was implemented. in data.table The function is foverlaps() and is available from the github project page. Make sure to have a look at ?foverlaps.

setkey(ref, space, t1, t2)
foverlaps(map, ref, type="within", nomatch=0L)

I think this is what you're after. This'll result in the join result only where there's a match, and it'll check for t1,t2 overlaps between ref and map within space identifier.. If you don't want that, just remove space from the key column. And if you want all matches, remove nomatch=0L - the default is nomatch=NA which returns all.

The function is new (but has been rigorously tested) and is therefore not feature complete. If you've any suggestions for improvement or come across any issues, please feel free to file an issue.

Finding overlapping ranges between two interval data

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500

Finding an efficient way to count the number of overlaps between interval sets in two tables?

You can use which=TRUE argument to get the position of overlaps, and then use to get the counts by doing a simple aggregation:

ans = foverlaps(map, ref, type="within", nomatch=0L, which=TRUE)[, .N, by=yid]
# yid N
# 1: 1 10
# 2: 2 20
# 3: 3 20

And then get this back in ref. But we should provide a more direct way to accomplish this.

Efficiently finding overlapping intervals from a list of intervals

Create a single, sorted array of transitions. Each transition has a position, and a cumulative number based on how many intervals you're joining or leaving. As you pass through the list, keep track of how many intervals you are in. When you're in as many intervals as series, that's when you're in a common interval.

For your example the transitions would be:

[2, 1], [6, -1], [7, 1], [11, -1],
[1, 1], [3, -1], [5, 1], [10, -1], [11, 1], [13, -1]
[2, 1], [5, -1], [6, 1], [8, -1]

which after sorting by position and merging collapses to:

[1, 1], [2, 2], [3, -1], [5, 0], [6, 0], [7, 1], [8, -1], [10, -1], [11, 0], [13, -1]

which gives you transitions for running totals of:

[1, 1], [2, 3], [3, 2], [7, 3], [8, 2], [10, 2], [13, 1]

And then we can read off the intervals where we're at 3 as one starting at 2 and going to 3, and another starting at 7 and going to 8. Which is the answer.

The idea of creating one long list and sorting is admittedly extra work. You can instead create those lists and merge them on the fly. The savings is a factor of the log of the number of series rather than the log of the number of events.

What's the most efficient way to test if two ranges overlap?

What does it mean for the ranges to overlap? It means there exists some number C which is in both ranges, i.e.

x1 <= C <= x2

and

y1 <= C <= y2

To avoid confusion, considering the ranges are:
[x1:x2] and [y1:y2]

Now, if we are allowed to assume that the ranges are well-formed (so that x1 <= x2 and y1 <= y2) then it is sufficient to test

x1 <= y2 && y1 <= x2

OR

(StartA <= EndB) and (EndA >= StartB)

Finding overlaps between 2 ranges and their overlapped region lengths?

Should point you in the right direction:

Load libraries

# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)

Generate data

gp1 <- read.table(text = 
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)

gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)

Calculate ranges

gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)

Calculate overlaps

hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)

Result

> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE

Match Value and Interval via overlaps

You can try fuzzyjoin :

fuzzyjoin::fuzzy_left_join(test, test_2, 
by = c('amount' = 'top', 'amount' = 'bottom'),
match_fun = list(`<=`, `>=`))

# id amount interval bottom top
#1 1 19 1 0 24
#2 1 21 1 0 24
#3 1 39 2 25 49
#4 1 45 2 25 49
#5 1 62 3 50 74
#6 1 71 3 50 74
#7 1 100 5 100 124
#8 1 121 5 100 124
#9 1 130 6 125 149
#10 1 160 7 150 174
#11 1 180 8 175 199
#12 1 210 9 200 224
#13 1 240 NA NA NA

How to efficiently find overlapping intervals?

If I understand correctly, you want to separate your current df into data frames where the initial interval is set by the first row, and the second interval is defined by the first row that does not intersect, etc. The below method will do that and should be pretty efficient if the number of groups isn't too large:

df = df.sort_values("f_low").reset_index(drop=True)
idx = 0
dfs = []
while True:
low = df.f_low[idx]
high = df.f_high[idx]
sub_df = df[(df.f_low <= high) & (low <= df.f_low)]
dfs.append(sub_df)
idx = sub_df.index.max() + 1
if idx > df.index.max():
break

output:

[      f_low    f_high
0 0.476201 0.481915
1 0.479161 0.484977,
f_low f_high
2 0.485997 0.491911,
f_low f_high
3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075,
f_low f_high
6 0.666093 0.670438,
f_low f_high
7 0.765602 0.770028
8 0.766884 0.771307,
f_low f_high
9 0.775986 0.780398,
f_low f_high
10 0.79459 0.798965]


Related Topics



Leave a reply



Submit