Finding Overlap in Ranges with R

Finding overlap in ranges with R

This would be a lot easier / faster if you can merge the two objects first.

ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
# chrom startA stopA startB stopB
#1 1 200 250 200 265
#2 5 100 105 99 106

comparing and finding overlap range in R

You may also try foverlaps from data.table

library(data.table)
setkey(setDT(df1), start1, end1)
setkey(setDT(df2), start2, end2)
df1[,overlap:=foverlaps(df1, df2, which=TRUE)[, !is.na(yid),]+0]
df1
# start1 end1 overlap
#1: 1 6 1
#2: 6 8 0
#3: 9 12 1
#4: 13 15 1
#5: 15 19 1
#6: 19 20 0

Finding overlap in dataframe ranges in R

Another data.table option using foverlaps:

setkeyv(BedA, names(BedA))
setkeyv(BedB, names(BedB))
ans <- foverlaps(BedB, BedA, nomatch=0L)
setnames(ans, c("start","end","i.start","i.end"), c("start.A","end.A","start.B","end.B"))

output:

   chr start.A end.A start.B end.B
1: 2 100 500 99 106
2: 2 100 500 210 265
3: 2 200 250 210 265

data:

library(data.table)
BedA <- fread("chr start end
2 100 500
2 200 250
3 275 300")

BedB <- fread("chr start end
2 210 265
2 99 106
8 275 290")

Finding the overlapping range of a set of vectors in R

Here is a solution which which seems to return the expected result for the given sample datasets.

It takes the vector of all unique interval endpoints and counts the number of intervals they are intersecting (by aggregating in a non-equi join). Among the subset of points with the maximum number of intersections, the range is taken.

library(data.table)
# enhanced dataset with 2 additional intervals
dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
8.3 , 12
20 , 25")

mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][N == max(N), range(lb)]
res
[1] 3.0 3.7

visualisation

library(ggolot2)
ggplot(dt) +
aes(x = lb, y = seq_along(lb), xend = ub, yend = seq_along(ub)) +
geom_segment() +
geom_vline(xintercept = res, col = "red", lty = 2)

Sample Image

EDIT: Handling of no overlaps

The OP has pointed out that the case where there are no overlaps needs to be recognized and handled separately. So I have modified the code:

mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][
N == max(N), {
if (max(N) > 1) {
cat("Maximum overlaps found:", max(N), "out of", nrow(dt), "intervals\n")
range(lb)
} else {
cat("No overlaps found\n")
NULL
}
}]

This code will recognize the situation where there are no overlaps and will return NULL in these cases. In addition, a message is printed.

In all other cases, it will print an informative message, e.g.,

Maximum overlaps found: 4 out of 6 intervals

For OP's sample dataset without overlaps

dt <- data.table(lb = c(3, 6, 10), ub = c(5, 9, 15))

it will print

No overlaps found

Caveat

In case of multiple solutions the code above will return the overall range, i.e, the start of the first interval and the end of the last interval instead of a list of separate intervals.

Sample data for this use case:

dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
11.5, 16
13 , 15
12.1, 13.7
11 , 20.1
0 , 22
")

So, there is a 5-fold overlap between 3 and 3.7 and a second 5-fold overlap between 13 and 13.7.

Furthermore, there is another use case which needs to be considered: How shall intervals be treated which overlap only in one point, i.e. one interval ends where another starts?

Trying to determine if two ranges of dates overlap using R

You can use int_overlaps() in lubridate.

library(dplyr)
library(lubridate)

my_data %>%
mutate(overlap = int_overlaps(interval(entry_date_1, withdrawal_date_1),
interval(entry_date_2, withdrawal_date_2)))

# student_id entry_date_1 withdrawal_date_1 entry_date_2 withdrawal_date_2 overlap
# 1 1 2017-11-09 2018-05-24 <NA> <NA> NA
# 2 2 2017-08-14 2017-12-15 2017-12-16 2018-05-24 FALSE
# 3 3 2017-08-14 2018-06-01 2018-01-16 2018-03-20 TRUE
# 4 4 2018-01-24 2018-02-25 2018-04-03 2018-05-24 FALSE
# 5 5 2017-10-04 2017-11-11 2017-12-12 2018-05-24 FALSE
# 6 6 2017-08-14 2018-05-24 <NA> <NA> NA

Check if two intervals overlap in R

Sort them

rng = cbind(pmin(ranges[,1], ranges[,2]), pmax(ranges[,1], ranges[,2]),
pmin(ranges[,3], ranges[,4]), pmax(ranges[,3], ranges[,4]))

and write the condition

olap = (rng[,1] <= rng[,4]) & (rng[,2] >= rng[,3])

In one step this might be

(pmin(ranges[,1], ranges[,2]) <= pmax(ranges[,3], ranges[,4])) &
(pmax(ranges[,1], ranges[,2]) >= pmin(ranges[,3], ranges[,4]))

The foverlap() function mentioned by others (or IRanges::findOveralaps()) would be appropriate if you were looking for overlaps between any range, but you're looking for 'parallel' (within-row?) overlaps.

The logic of the solution here is the same as the answer of @Julius, but is 'vectorized' (e.g., 1 call to pmin(), rather than nrow(ranges) calls to sort()) and should be much faster (though using more memory) for longer vectors of possible ranges.

Find value of overlapping ranges of integers in R that are NOT times or genomes

I restructured your activities table into a long form so you can do all 4 calculations at once. Then the overlaps join is done, then you can calculate the overlap length from the results.

activities <- data.table(
act = c('act_1','act_2','act_3','act_4'),
a_min = c(min_1, min_2, min_3, min_4),
a_max = c(max_1, max_2, max_3, max_4)
)

spp_id <- c("a", "b", "c", "d")
spp_depth_min <- c(0, 20, 30, 40)
spp_depth_max <- c(200, 500, 50, 100)

species <- data.table(spp_id, spp_depth_min, spp_depth_max)

setkey(activities,a_min,a_max)

ol <- foverlaps(species, activities,
by.x = c('spp_depth_min','spp_depth_max'),
by.y = c('a_min','a_max')
)
ol[,ol_length := pmin(spp_depth_max,a_max)-pmax(spp_depth_min,a_min)]
ol

Finding overlaps between 2 ranges and their overlapped region lengths?

Should point you in the right direction:

Load libraries

# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)

Generate data

gp1 <- read.table(text = 
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)

gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)

Calculate ranges

gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)

Calculate overlaps

hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)

Result

> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE

Finding overlapping ranges between two interval data

In general, it's very appropriate to use the bioconductor package IRanges to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges is another package that builds on top of IRanges, specifically for handling, well, "Genomic Ranges".

require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]

grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500

Identify overlapping date ranges by ID R

First convert the dates to Date class. Then a self join on id and the intersection criteria will join all relevant overlapping rows. overlap is 1 if that row has an overlap and 0 otherwise. overlaps lists the row numbers of the overlaps for that row. We used row numbers rowid but we could replace each occurrence of it in the code below with row_n if desired.

library(sqldf)

fmt <- "%m/%d/%Y"
eg2 <- transform(eg_data,
start_dt = as.Date(start_dt, fmt),
end_dt = as.Date(end_dt, fmt))

sqldf("select
a.*,
count(b.rowid) > 0 as overlap,
coalesce(group_concat(b.rowid), '') as overlaps
from eg2 a
left join eg2 b on a.id = b.id and
not a.rowid = b.rowid and
((a.start_dt between b.start_dt and b.end_dt) or
(b.start_dt between a.start_dt and a.end_dt))
group by a.rowid
order by a.rowid")

giving:

   id   start_dt     end_dt row_n overlap overlaps
1 1 2016-01-01 2016-12-01 1 0
2 1 2016-12-02 2017-03-14 2 1 3
3 1 2017-03-12 2017-05-15 3 1 2
4 2 2016-02-01 2016-05-15 4 0
5 2 2016-08-12 2016-12-29 5 0
6 3 2016-01-01 2016-03-02 6 0
7 3 2016-03-05 2016-04-29 7 0
8 3 2016-05-07 2016-06-29 8 0
9 3 2016-07-01 2016-08-31 9 0
10 3 2016-09-04 2016-09-25 10 0
11 3 2016-10-10 2016-11-29 11 0
12 4 2016-01-01 2016-05-31 12 1 13
13 4 2016-05-28 2016-08-19 13 1 12
14 5 2016-01-01 2016-06-10 14 1 15
15 5 2016-06-05 2016-07-25 15 1 14
16 5 2016-08-25 2016-08-29 16 0
17 5 2016-11-01 2016-12-30 17 0


Related Topics



Leave a reply



Submit