Overlapping Genomic Ranges

Use the Bioconductor GenomicRanges package.

source("http://bioconductor.org/biocLite.R")
biocLite("GenomicRanges")

Suppose there are two data frames x0 and x1, like encode and encode.total in the original example. We'd like to make these into a GRanges instance. I did this with

library(GenomicRanges)
gr0 = with(x0, GRanges(chr, IRanges(start=start, end=stop)))
gr1 = with(x1, GRanges(chr, IRanges(start=start, end=stop)))

It will often be possible to simply say makeGRangesFromDataFrame(x0), or use standard R commands to create a GRanges instance 'by hand'. Here we use with(), so that we can write GRanges(chr, IRanges(start=start, end=stop)) instead of GRanges(x0$chr, IRanges(start=x0$start, end=x0$stop)).

The next step is to find overlaps between query (gr0) and subject (gr1)

hits = findOverlaps(gr0, gr1)

which leads to

> hits
Hits of length 3
queryLength: 3
subjectLength: 16
  queryHits subjectHits 
   <integer>   <integer> 
 1         1           4 
 2         2           8 
 3         3          10

Then updated the relevant start / end coordinates

ranges(gr0)[queryHits(hits)] = ranges(gr1)[subjectHits(hits)]

giving

> gr0
GRanges with 3 ranges and 0 metadata columns:
      seqnames                 ranges strand
         <Rle>              <IRanges>  <Rle>
  [1]     chr1 [ 98766902,  98770338]      *
  [2]     chr1 [152011746, 152013185]      *
  [3]     chr1 [168930561, 168933076]      *
  ---
  seqlengths:
   chr1
     NA

This will be fast up into the millions of ranges.

Finding overlaps between 2 ranges and their overlapped region lengths?

Should point you in the right direction:

Load libraries

# install.packages("BiocManager")
# BiocManager::install("GenomicRanges")
library(GenomicRanges)
library(IRanges)

Generate data

gp1 <- read.table(text = 
"
chr   start   end   id1 
chr1  580     600    1
chr1  900     970    2
chr3  400     600    3
chr2  100     700    4
", header = TRUE)

gp2 <- read.table(text = 
                      "
chr   start   end   id2
chr1  590     864   1
chr3  550     670   2
chr2  897     1987  3
", header = TRUE)

Calculate ranges

gr1 <- GenomicRanges::makeGRangesFromDataFrame(
    gp1, 
    seqnames.field = "chr", 
    start.field = "start", 
    end.field = "end"
)
gr2 <- GenomicRanges::makeGRangesFromDataFrame(
    gp2, 
    seqnames.field = "chr", 
    start.field = "start", 
    end.field = "end"
)

Calculate overlaps

hits <- findOverlaps(gr1, gr2)
p <- Pairs(gr1, gr2, hits = hits)
i <- pintersect(p)

Result

> as.data.frame(i)
  seqnames start end width strand  hit
1     chr1   590 600    11      * TRUE
2     chr3   550 600    51      * TRUE

R, GenomicRanges: find the width of overlapping genomic ranges

The GenomicRanges package has a specific function for this it seems, intersect(). So the solution is rather simple:

width(intersect(gr1, gr2))

[1] 6 6

(which is correct)

Create a matrix to show overlaps among multiple GRanges

First off, here is a partial solution that shows only the overlapping regions between the first and any additional GRanges (this should generate results similar to those from bedtools intersect which allows one to "identify overlaps between a single query (-a) file and multiple database files (-b) at once"); this should be a good starting point for further refinement.

We can define a function that takes any number of GRanges and identifies overlapping ranges between the first GRanges and any additional GRanges using findOverlaps; the intersecting regions are then obtained from pintersect.

Please note that I make use of the common tidyverse syntax; while this is not strictly necessary (for every purrr::map/purrr::map2 function one can use their base R lapply/mapply equivalents), I prefer the tidyverse approach for code readability.

multiOverlap <- function(...) {
    require(GenomicRanges)
    require(tidyverse)

    # Store GRanges in list
    lst <- list(...)
    names(lst) <- paste0("gr", 1:length(lst))

    # Calculate mutual overlaps
    lst.matches <- map(lst[-1L], ~ findOverlaps(lst[[1L]], .x))

    # List of intersecting regions
    lst.gr <- map2(
        lst[-1L], lst.matches,
        ~pintersect(lst[[1]][queryHits(.y)], .x[subjectHits(.y)]))
    names(lst.gr) <- paste0("gr1-gr", 2:length(lst))

    # Convert GRanges to data.frame and reshape data
    map(lst.gr, ~.x %>%
        as.data.frame() %>%
        unite(locus, seqnames, start, sep = "-") %>%
        select(locus)) %>%
        bind_rows(.id = "id") %>%
        separate(id, into = c("grx", "gry")) %>%
        gather(gr, no, -locus) %>%
        transmute(
            locus,
            no,
            val = 1) %>%
            spread(no, val, fill = 0)
}

When we apply this function to the three sample GRanges we get the following result

multiOverlap(gr.1, gr.2, gr.3)
#    locus gr1 gr2 gr3
#1  chr1-1   1   0   1
#2  chr1-2   1   1   0
#3 chr10-4   1   1   0

Update

Another (fast) option might be to use data.table; especially when working with genomic data data.tables pass-by-reference properties, avoiding deep copies, makes it very attractive (and fast).

Here is a solution that exactly reproduces your expected output

# Load the library
library(data.table)

# Convert GRanges to data.table and row-bind entries
dt <- rbindlist(
    lapply(list(gr.1 = gr.1, gr.2 = gr.2, gr.3 = gr.3), as.data.table),
    idcol = "id")

# Remove width and strand
dt[, c("width", "strand") := NULL]

# Expand rows by range using start and end
dt <- dt[, .(pos = seq(start, end, by = 1L)), by = .(id, seqnames, grp = 1:nrow(dt))]

# Remove helper group label
dt[, grp := NULL]

# Unite seqnames and pos into one column
dt <- dt[, .(locus = do.call(paste, c(.SD, sep = "-")), id, pos), .SDcols = seqnames:pos]

# Add count variable
dt[, ct := 1]

# Convert from long to wide
dcast(dt, locus ~ id, value.var = "ct", fill = 0)
#     locus gr.1 gr.2 gr.3
#1:  chr1-1    1    0    1
#2:  chr1-2    1    1    0
#3: chr10-3    0    1    0
#4: chr10-4    1    1    0

And we're done:-) It's easy to wrap above lines in a convenience function if necessary.

Width of the overlapped segment in GenomicRanges package

You can apply the ranges function over hits class( results of findOverlaps) . ranges returns a Ranges holding the intersection of the ranges in the Ranges objects query and subject.

You don't supply a reproducible example , so here an example :

query <- IRanges(c(1, 4, 9), c(5, 7, 10))
subject <- IRanges(c(2, 2, 10), c(2, 3, 12))
mm <- findOverlaps(query,subject)
ranges(mm,query,subject)
Ranges of length 3
    start end width
[1]     2   2     1
[2]     2   3     2
[3]    10  10     1

Overlapping Genomic Ranges