Finding overlap in ranges with R
This would be a lot easier / faster if you can merge the two objects first.
ranges <- merge(rangesA,rangesB,by="chrom",suffixes=c("A","B"))
ranges[with(ranges, startB <= startA & stopB >= stopA),]
# chrom startA stopA startB stopB
#1 1 200 250 200 265
#2 5 100 105 99 106
comparing and finding overlap range in R
You may also try foverlaps
from data.table
library(data.table)
setkey(setDT(df1), start1, end1)
setkey(setDT(df2), start2, end2)
df1[,overlap:=foverlaps(df1, df2, which=TRUE)[, !is.na(yid),]+0]
df1
# start1 end1 overlap
#1: 1 6 1
#2: 6 8 0
#3: 9 12 1
#4: 13 15 1
#5: 15 19 1
#6: 19 20 0
Finding overlap in dataframe ranges in R
Another data.table
option using foverlaps
:
setkeyv(BedA, names(BedA))
setkeyv(BedB, names(BedB))
ans <- foverlaps(BedB, BedA, nomatch=0L)
setnames(ans, c("start","end","i.start","i.end"), c("start.A","end.A","start.B","end.B"))
output:
chr start.A end.A start.B end.B
1: 2 100 500 99 106
2: 2 100 500 210 265
3: 2 200 250 210 265
data:
library(data.table)
BedA <- fread("chr start end
2 100 500
2 200 250
3 275 300")
BedB <- fread("chr start end
2 210 265
2 99 106
8 275 290")
Finding the overlapping range of a set of vectors in R
Here is a solution which which seems to return the expected result for the given sample datasets.
It takes the vector of all unique interval endpoints and counts the number of intervals they are intersecting (by aggregating in a non-equi join). Among the subset of points with the maximum number of intersections, the range is taken.
library(data.table)
# enhanced dataset with 2 additional intervals
dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
8.3 , 12
20 , 25")
mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][N == max(N), range(lb)]
res
[1] 3.0 3.7
visualisation
library(ggolot2)
ggplot(dt) +
aes(x = lb, y = seq_along(lb), xend = ub, yend = seq_along(ub)) +
geom_segment() +
geom_vline(xintercept = res, col = "red", lty = 2)
EDIT: Handling of no overlaps
The OP has pointed out that the case where there are no overlaps needs to be recognized and handled separately. So I have modified the code:
mdt <- dt[, .(b = unique(unlist(.SD)))]
res <- dt[mdt, on = .(lb <= b, ub >= b), .N, by = .EACHI][
N == max(N), {
if (max(N) > 1) {
cat("Maximum overlaps found:", max(N), "out of", nrow(dt), "intervals\n")
range(lb)
} else {
cat("No overlaps found\n")
NULL
}
}]
This code will recognize the situation where there are no overlaps and will return NULL
in these cases. In addition, a message is printed.
In all other cases, it will print an informative message, e.g.,
Maximum overlaps found: 4 out of 6 intervals
For OP's sample dataset without overlaps
dt <- data.table(lb = c(3, 6, 10), ub = c(5, 9, 15))
it will print
No overlaps found
Caveat
In case of multiple solutions the code above will return the overall range, i.e, the start of the first interval and the end of the last interval instead of a list of separate intervals.
Sample data for this use case:
dt <- fread("lb, ub
1.5, 6
3 , 5
2.1, 3.7
1 , 10.1
11.5, 16
13 , 15
12.1, 13.7
11 , 20.1
0 , 22
")
So, there is a 5-fold overlap between 3 and 3.7 and a second 5-fold overlap between 13 and 13.7.
Furthermore, there is another use case which needs to be considered: How shall intervals be treated which overlap only in one point, i.e. one interval ends where another starts?
Trying to determine if two ranges of dates overlap using R
You can use int_overlaps()
in lubridate
.
library(dplyr)
library(lubridate)
my_data %>%
mutate(overlap = int_overlaps(interval(entry_date_1, withdrawal_date_1),
interval(entry_date_2, withdrawal_date_2)))
# student_id entry_date_1 withdrawal_date_1 entry_date_2 withdrawal_date_2 overlap
# 1 1 2017-11-09 2018-05-24 <NA> <NA> NA
# 2 2 2017-08-14 2017-12-15 2017-12-16 2018-05-24 FALSE
# 3 3 2017-08-14 2018-06-01 2018-01-16 2018-03-20 TRUE
# 4 4 2018-01-24 2018-02-25 2018-04-03 2018-05-24 FALSE
# 5 5 2017-10-04 2017-11-11 2017-12-12 2018-05-24 FALSE
# 6 6 2017-08-14 2018-05-24 <NA> <NA> NA
Check if two intervals overlap in R
Sort them
rng = cbind(pmin(ranges[,1], ranges[,2]), pmax(ranges[,1], ranges[,2]),
pmin(ranges[,3], ranges[,4]), pmax(ranges[,3], ranges[,4]))
and write the condition
olap = (rng[,1] <= rng[,4]) & (rng[,2] >= rng[,3])
In one step this might be
(pmin(ranges[,1], ranges[,2]) <= pmax(ranges[,3], ranges[,4])) &
(pmax(ranges[,1], ranges[,2]) >= pmin(ranges[,3], ranges[,4]))
The foverlap()
function mentioned by others (or IRanges::findOveralaps()
) would be appropriate if you were looking for overlaps between any range, but you're looking for 'parallel' (within-row?) overlaps.
The logic of the solution here is the same as the answer of @Julius, but is 'vectorized' (e.g., 1 call to pmin()
, rather than nrow(ranges)
calls to sort()
) and should be much faster (though using more memory) for longer vectors of possible ranges.
Find value of overlapping ranges of integers in R that are NOT times or genomes
I restructured your activities table into a long form so you can do all 4 calculations at once. Then the overlaps join is done, then you can calculate the overlap length from the results.
activities <- data.table(
act = c('act_1','act_2','act_3','act_4'),
a_min = c(min_1, min_2, min_3, min_4),
a_max = c(max_1, max_2, max_3, max_4)
)
spp_id <- c("a", "b", "c", "d")
spp_depth_min <- c(0, 20, 30, 40)
spp_depth_max <- c(200, 500, 50, 100)
species <- data.table(spp_id, spp_depth_min, spp_depth_max)
setkey(activities,a_min,a_max)
ol <- foverlaps(species, activities,
by.x = c('spp_depth_min','spp_depth_max'),
by.y = c('a_min','a_max')
)
ol[,ol_length := pmin(spp_depth_max,a_max)-pmax(spp_depth_min,a_min)]
ol
Finding overlaps between 2 ranges and their overlapped region lengths?
Should point you in the right direction:
Load libraries
# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)
Generate data
gp1 <- read.table(text =
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)
gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)
Calculate ranges
gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
Calculate overlaps
hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)
Result
> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE
Finding overlapping ranges between two interval data
In general, it's very appropriate to use the bioconductor package IRanges
to deal with problems related to intervals. It does so efficiently by implementing interval tree. GenomicRanges
is another package that builds on top of IRanges
, specifically for handling, well, "Genomic Ranges".
require(GenomicRanges)
gr1 = with(dtFrags, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(start, end)))
gr2 = with(dtCoords, GRanges(Rle(factor(chr,
levels=c("1", "2", "X", "Y"))), IRanges(coord, coord)))
olaps = findOverlaps(gr2, gr1)
dtCoords[, grp := seq_len(nrow(dtCoords))]
dtFrags[subjectHits(olaps), grp := queryHits(olaps)]
setkey(dtCoords, grp)
setkey(dtFrags, grp)
dtFrags[, list(grp, id, type)][dtCoords]
grp id type id.1 chr coord
1: 1 1 exon 10 1 150
2: 2 2 intron 20 2 300
3: 2 4 exon 20 2 300
4: 3 NA NA 30 Y 500
Identify overlapping date ranges by ID R
First convert the dates to Date
class. Then a self join on id
and the intersection criteria will join all relevant overlapping rows. overlap
is 1 if that row has an overlap and 0 otherwise. overlaps
lists the row numbers of the overlaps for that row. We used row numbers rowid
but we could replace each occurrence of it in the code below with row_n
if desired.
library(sqldf)
fmt <- "%m/%d/%Y"
eg2 <- transform(eg_data,
start_dt = as.Date(start_dt, fmt),
end_dt = as.Date(end_dt, fmt))
sqldf("select
a.*,
count(b.rowid) > 0 as overlap,
coalesce(group_concat(b.rowid), '') as overlaps
from eg2 a
left join eg2 b on a.id = b.id and
not a.rowid = b.rowid and
((a.start_dt between b.start_dt and b.end_dt) or
(b.start_dt between a.start_dt and a.end_dt))
group by a.rowid
order by a.rowid")
giving:
id start_dt end_dt row_n overlap overlaps
1 1 2016-01-01 2016-12-01 1 0
2 1 2016-12-02 2017-03-14 2 1 3
3 1 2017-03-12 2017-05-15 3 1 2
4 2 2016-02-01 2016-05-15 4 0
5 2 2016-08-12 2016-12-29 5 0
6 3 2016-01-01 2016-03-02 6 0
7 3 2016-03-05 2016-04-29 7 0
8 3 2016-05-07 2016-06-29 8 0
9 3 2016-07-01 2016-08-31 9 0
10 3 2016-09-04 2016-09-25 10 0
11 3 2016-10-10 2016-11-29 11 0
12 4 2016-01-01 2016-05-31 12 1 13
13 4 2016-05-28 2016-08-19 13 1 12
14 5 2016-01-01 2016-06-10 14 1 15
15 5 2016-06-05 2016-07-25 15 1 14
16 5 2016-08-25 2016-08-29 16 0
17 5 2016-11-01 2016-12-30 17 0
Related Topics
R - What Algorithm Does Geom_Density() Use and How to Extract Points/Equation of Curves
Ggplot Scale Color Gradient to Range Outside of Data Range
R: Ggplot Stacked Bar Chart with Counts on Y Axis But Percentage as Label
Modifying Ggplot Objects After Creation
R: How to Use Coord_Cartesian on Facet_Grid with Free-Ranging Axis
Update Shiny's 'Selectinput' Dropdown with New Values After Uploading New Data Using Fileinput
Show Multiple Plots from Ggplot on One Page in R
Asymmetric Color Distribution in Scale_Gradient2
Label X Axis in Time Series Plot Using R
R Plot Filled Longitude-Latitude Grid Cells on Map
Remove All Duplicates Except Last Instance
Adding Empty Graphs to Facet_Wrap in Ggplot2
Xpath and Namespace Specification for Xml Documents with an Explicit Default Namespace