Overlapping Genomic Ranges

Overlapping Genomic Ranges

Use the Bioconductor GenomicRanges package.

source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")

Suppose there are two data frames x0 and x1, like encode and encode.total in the original example. We'd like to make these into a GRanges instance. I did this with

library(GenomicRanges)
gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))

It will often be possible to simply say makeGRangesFromDataFrame(x0), or use standard R commands to create a GRanges instance 'by hand'. Here we use with(), so that we can write GRanges(chr, IRanges(start=start, end=stop)) instead of GRanges(x0$chr, IRanges(start=x0$start, end=x0$stop)).

The next step is to find overlaps between query (gr0) and subject (gr1)

hits = findOverlaps(gr0, gr1)

which leads to

> hits
Hits of length 3
queryLength: 3
subjectLength: 16
queryHits subjectHits
<integer> <integer>
1 1 4
2 2 8
3 3 10

Then updated the relevant start / end coordinates

ranges(gr0)[queryHits(hits)] = ranges(gr1)[subjectHits(hits)]

giving

> gr0
GRanges with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 [ 98766902, 98770338] *
[2] chr1 [152011746, 152013185] *
[3] chr1 [168930561, 168933076] *
---
seqlengths:
chr1
NA

This will be fast up into the millions of ranges.

Finding overlaps between 2 ranges and their overlapped region lengths?

Should point you in the right direction:

Load libraries

# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)

Generate data

gp1 <- read.table(text = 
"
chr start end id1
chr1 580 600 1
chr1 900 970 2
chr3 400 600 3
chr2 100 700 4
", header = TRUE)

gp2 <- read.table(text =
"
chr start end id2
chr1 590 864 1
chr3 550 670 2
chr2 897 1987 3
", header = TRUE)

Calculate ranges

gr1 <- GenomicRanges::makeGRangesFromDataFrame(
gp1,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
gp2,
seqnames.field = "chr",
start.field = "start",
end.field = "end"
)

Calculate overlaps

hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)

Result

> as.data.frame(i)
seqnames start end width strand hit
1 chr1 590 600 11 * TRUE
2 chr3 550 600 51 * TRUE

R, GenomicRanges: find the width of overlapping genomic ranges

The GenomicRanges package has a specific function for this it seems, intersect(). So the solution is rather simple:

width(intersect(gr1, gr2))

[1] 6 6

(which is correct)

Create a matrix to show overlaps among multiple GRanges

First off, here is a partial solution that shows only the overlapping regions between the first and any additional GRanges (this should generate results similar to those from bedtools intersect which allows one to "identify overlaps between a single query (-a) file and multiple database files (-b) at once"); this should be a good starting point for further refinement.

We can define a function that takes any number of GRanges and identifies overlapping ranges between the first GRanges and any additional GRanges using findOverlaps; the intersecting regions are then obtained from pintersect.

Please note that I make use of the common tidyverse syntax; while this is not strictly necessary (for every purrr::map/purrr::map2 function one can use their base R lapply/mapply equivalents), I prefer the tidyverse approach for code readability.

multiOverlap <- function(...) {
require(GenomicRanges)
require(tidyverse)

# Store GRanges in list
lst <- list(...)
names(lst) <- paste0("gr", 1:length(lst))

# Calculate mutual overlaps
lst.matches <- map(lst[-1L], ~ findOverlaps(lst[[1L]], .x))

# List of intersecting regions
lst.gr <- map2(
lst[-1L], lst.matches,
~pintersect(lst[[1]][queryHits(.y)], .x[subjectHits(.y)]))
names(lst.gr) <- paste0("gr1-gr", 2:length(lst))

# Convert GRanges to data.frame and reshape data
map(lst.gr, ~.x %>%
as.data.frame() %>%
unite(locus, seqnames, start, sep = "-") %>%
select(locus)) %>%
bind_rows(.id = "id") %>%
separate(id, into = c("grx", "gry")) %>%
gather(gr, no, -locus) %>%
transmute(
locus,
no,
val = 1) %>%
spread(no, val, fill = 0)
}

When we apply this function to the three sample GRanges we get the following result

multiOverlap(gr.1, gr.2, gr.3)
# locus gr1 gr2 gr3
#1 chr1-1 1 0 1
#2 chr1-2 1 1 0
#3 chr10-4 1 1 0

Update

Another (fast) option might be to use data.table; especially when working with genomic data data.tables pass-by-reference properties, avoiding deep copies, makes it very attractive (and fast).

Here is a solution that exactly reproduces your expected output

# Load the library
library(data.table)

# Convert GRanges to data.table and row-bind entries
dt <- rbindlist(
lapply(list(gr.1 = gr.1, gr.2 = gr.2, gr.3 = gr.3), as.data.table),
idcol = "id")

# Remove width and strand
dt[, c("width", "strand") := NULL]

# Expand rows by range using start and end
dt <- dt[, .(pos = seq(start, end, by = 1L)), by = .(id, seqnames, grp = 1:nrow(dt))]

# Remove helper group label
dt[, grp := NULL]

# Unite seqnames and pos into one column
dt <- dt[, .(locus = do.call(paste, c(.SD, sep = "-")), id, pos), .SDcols = seqnames:pos]

# Add count variable
dt[, ct := 1]

# Convert from long to wide
dcast(dt, locus ~ id, value.var = "ct", fill = 0)
# locus gr.1 gr.2 gr.3
#1: chr1-1 1 0 1
#2: chr1-2 1 1 0
#3: chr10-3 0 1 0
#4: chr10-4 1 1 0

And we're done:-) It's easy to wrap above lines in a convenience function if necessary.

Width of the overlapped segment in GenomicRanges package

You can apply the ranges function over hits class( results of findOverlaps) . ranges returns a Ranges holding the intersection of the ranges in the Ranges objects query and subject.

You don't supply a reproducible example , so here an example :

query <- IRanges(c(1, 4, 9), c(5, 7, 10))
subject <- IRanges(c(2, 2, 10), c(2, 3, 12))
mm <- findOverlaps(query,subject)
ranges(mm,query,subject)
Ranges of length 3
start end width
[1] 2 2 1
[2] 2 3 2
[3] 10 10 1


Related Topics



Leave a reply



Submit