Overlapping Genomic Ranges
Use the Bioconductor GenomicRanges package.
source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")
Suppose there are two data frames x0
and x1
, like encode
and encode.total
in the original example. We'd like to make these into a GRanges instance. I did this with
library(GenomicRanges)
gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))
It will often be possible to simply say makeGRangesFromDataFrame(x0)
, or use standard R commands to create a GRanges instance 'by hand'. Here we use with()
, so that we can write GRanges(chr, IRanges(start=start, end=stop))
instead of GRanges(x0$chr, IRanges(start=x0$start, end=x0$stop))
.
The next step is to find overlaps between query (gr0) and subject (gr1)
hits = findOverlaps(gr0, gr1)
which leads to
> hits
Hits of length 3
queryLength: 3
subjectLength: 16
queryHits subjectHits
<integer> <integer>
1 1 4
2 2 8
3 3 10
Then updated the relevant start / end coordinates
ranges(gr0)[queryHits(hits)] = ranges(gr1)[subjectHits(hits)]
giving
> gr0
GRanges with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 98766902, 98770338] *
[2] chr1 [152011746, 152013185] *
[3] chr1 [168930561, 168933076] *
---
seqlengths:
chr1
NA
This will be fast up into the millions of ranges.
Finding overlaps between 2 ranges and their overlapped region lengths?
Should point you in the right direction:
Load libraries
# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)
Generate data
gp1 <- read.table(text =
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)
gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)
Calculate ranges
gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
Calculate overlaps
hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)
Result
> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE
R, GenomicRanges: find the width of overlapping genomic ranges
The GenomicRanges package has a specific function for this it seems, intersect()
. So the solution is rather simple:
width(intersect(gr1, gr2))
[1] 6 6
(which is correct)
Create a matrix to show overlaps among multiple GRanges
First off, here is a partial solution that shows only the overlapping regions between the first and any additional GRanges
(this should generate results similar to those from bedtools intersect
which allows one to "identify overlaps between a single query (-a) file and multiple database files (-b) at once"); this should be a good starting point for further refinement.
We can define a function that takes any number of GRanges
and identifies overlapping ranges between the first GRanges
and any additional GRanges
using findOverlaps
; the intersecting regions are then obtained from pintersect
.
Please note that I make use of the common tidyverse
syntax; while this is not strictly necessary (for every purrr::map
/purrr::map2
function one can use their base R lapply
/mapply
equivalents), I prefer the tidyverse
approach for code readability.
multiOverlap <- function(...) {
require(GenomicRanges)
require(tidyverse)
# Store GRanges in list
lst <- list(...)
names(lst) <- paste0("gr", 1:length(lst))
# Calculate mutual overlaps
lst.matches <- map(lst[-1L], ~ findOverlaps(lst[[1L]], .x))
# List of intersecting regions
lst.gr <- map2(
lst[-1L], lst.matches,
~pintersect(lst[[1]][queryHits(.y)], .x[subjectHits(.y)]))
names(lst.gr) <- paste0("gr1-gr", 2:length(lst))
# Convert GRanges to data.frame and reshape data
map(lst.gr, ~.x %>%
as.data.frame() %>%
unite(locus, seqnames, start, sep = "-") %>%
select(locus)) %>%
bind_rows(.id = "id") %>%
separate(id, into = c("grx", "gry")) %>%
gather(gr, no, -locus) %>%
transmute(
locus,
no,
val = 1) %>%
spread(no, val, fill = 0)
}
When we apply this function to the three sample GRanges
we get the following result
multiOverlap(gr.1, gr.2, gr.3)
# locus gr1 gr2 gr3
#1 chr1-1 1 0 1
#2 chr1-2 1 1 0
#3 chr10-4 1 1 0
Update
Another (fast) option might be to use data.table
; especially when working with genomic data data.table
s pass-by-reference properties, avoiding deep copies, makes it very attractive (and fast).
Here is a solution that exactly reproduces your expected output
# Load the library
library(data.table)
# Convert GRanges to data.table and row-bind entries
dt <- rbindlist(
lapply(list(gr.1 = gr.1, gr.2 = gr.2, gr.3 = gr.3), as.data.table),
idcol = "id")
# Remove width and strand
dt[, c("width", "strand") := NULL]
# Expand rows by range using start and end
dt <- dt[, .(pos = seq(start, end, by = 1L)), by = .(id, seqnames, grp = 1:nrow(dt))]
# Remove helper group label
dt[, grp := NULL]
# Unite seqnames and pos into one column
dt <- dt[, .(locus = do.call(paste, c(.SD, sep = "-")), id, pos), .SDcols = seqnames:pos]
# Add count variable
dt[, ct := 1]
# Convert from long to wide
dcast(dt, locus ~ id, value.var = "ct", fill = 0)
# locus gr.1 gr.2 gr.3
#1: chr1-1 1 0 1
#2: chr1-2 1 1 0
#3: chr10-3 0 1 0
#4: chr10-4 1 1 0
And we're done:-) It's easy to wrap above lines in a convenience function if necessary.
Width of the overlapped segment in GenomicRanges package
You can apply the ranges
function over hits class( results of findOverlaps
) . ranges returns a Ranges holding the intersection of the ranges in the Ranges objects query and subject.
You don't supply a reproducible example , so here an example :
query <- IRanges(c(1, 4, 9), c(5, 7, 10))
subject <- IRanges(c(2, 2, 10), c(2, 3, 12))
mm <- findOverlaps(query,subject)
ranges(mm,query,subject)
Ranges of length 3
start end width
[1] 2 2 1
[2] 2 3 2
[3] 10 10 1
Related Topics
How to Find Useful R Tutorials with Various Implementations
Multiple Strings with Str_Detect R
Getting Frequency Values from Histogram in R
Multiple Lines for Text Per Legend Label in Ggplot2
R Shiny - Disable/Able Shinyui Elements
Reversed Order After Coord_Flip in R
Anti-Aliasing in R Graphics Under Windows (As Per MAC)
How to Extract Elements from a List with Mixed Elements
Why Does Rendering a PDF from Rmarkdown Require Closing Rstudio Between Renders
Interactively Change the Selectinput Choices
How to Select Columns in Data.Table Using a Character Vector of Certain Column Names
Emacs Ess Mode - Tabbing for Comment Region
Embedding an R HTMLwidget into Existing Webpage
How to Adjust Facet Size Manually
Plotting Cumulative Counts in Ggplot2
How to Fill Nas with Locf by Factors in Data Frame, Split by Country