Collapse rows with overlapping ranges
You can try this:
library(dplyr)
ranges %>%
arrange(start) %>%
group_by(g = cumsum(cummax(lag(stop, default = first(stop))) < start)) %>%
summarise(start = first(start), stop = max(stop))
# A tibble: 2 × 3
# g start stop
# <int> <dbl> <dbl>
#1 0 65.72000 87.75625
#2 1 89.61625 104.94062
Merge overlapping ranges per group
I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.
> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 4F 2500-3401 +
[2] 4F 19116-20730 +
[3] 4F 1420-2527 -
[4] 0F 1405-1700 -
[5] 0F 1727-2038 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
which differs from your expectation because there are two OF
non-overlapping ranges?
Find the intersection of overlapping ranges in two tables using data.table function foverlaps
@Seth provided the fastest way to solve the problem of intersection overlaps using the data.table foverlaps function. However, this solution did not take into account the fact that the input bed files may have overlapping ranges that needed to be reduced into single regions. @Martin Morgan solved that with his solution using the GenomicRanges package, that did both the intersecting and range reducing. However, Martin's solution didn't use the foverlaps function. @Arun pointed out that the overlapping ranges in different rows within a table was not currently possible using foverlaps. Thanks to the answers provided, and some additional research on stackoverflow, I came up with this hybrid solution.
Create example BED files without overlapping regions within each file.
chr <- c(1:22,"X","Y","MT")
#bedA contains 5 million rows
bedA <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,200000))),
START = rep(as.integer(seq(1,200000000,1000)),25),
STOP = rep(as.integer(seq(500,200000000,1000)),25),
key = c("CHROM","START","STOP")
)
#bedB contains 500 thousand rows
bedB <- data.table(
CHROM = as.vector(sapply(chr, function(x) rep(x,20000))),
START = rep(as.integer(seq(200,200000000,10000)),25),
STOP = rep(as.integer(seq(600,200000000,10000)),25),
key = c("CHROM","START","STOP")
)
Now create a new bed file containing the intersecting regions in bedA and bedB.
#This solution uses foverlaps
system.time(tmpA <- intersectBedFiles.foverlaps(bedA,bedB))
user system elapsed
1.25 0.02 1.37
#This solution uses GenomicRanges
system.time(tmpB <- intersectBedFiles.GR(bedA,bedB))
user system elapsed
12.95 0.06 13.04
identical(tmpA,tmpB)
[1] TRUE
Now, modify bedA and bedB such that they contain overlapping regions:
#Create overlapping ranges
makeOverlaps <- as.integer(c(0,0,600,0,0,0,600,0,0,0))
bedC <- bedA[, STOP := STOP + makeOverlaps, by=CHROM]
bedD <- bedB[, STOP := STOP + makeOverlaps, by=CHROM]
Test time to intersect bed files with overlapping ranges using either the foverlaps or GenomicRanges fucntions.
#This solution uses foverlaps to find the intersection and then run GenomicRanges on the result
system.time(tmpC <- intersectBedFiles.foverlaps(bedC,bedD))
user system elapsed
1.83 0.05 1.89
#This solution uses GenomicRanges
system.time(tmpD <- intersectBedFiles.GR(bedC,bedD))
user system elapsed
12.95 0.04 12.99
identical(tmpC,tmpD)
[1] TRUE
The winner: foverlaps!
FUNCTIONS USED
This is the function based upon foverlaps, and will only call the GenomicRanges function (reduceBed.GenomicRanges) if there are overlapping ranges (which are checked for using the rowShift function).
intersectBedFiles.foverlaps <- function(bed1,bed2) {
require(data.table)
bedKey <- c("CHROM","START","STOP")
if(nrow(bed1)>nrow(bed2)) {
bed <- foverlaps(bed1, bed2, nomatch = 0)
} else {
bed <- foverlaps(bed2, bed1, nomatch = 0)
}
bed[, START := pmax(START, i.START)]
bed[, STOP := pmin(STOP, i.STOP)]
bed[, `:=`(i.START = NULL, i.STOP = NULL)]
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
if(any(bed[, STOP+1 >= rowShift(START), by=CHROM][,V1], na.rm = T)) {
bed <- reduceBed.GenomicRanges(bed)
}
return(bed)
}
rowShift <- function(x, shiftLen = 1L) {
#Note this function was described in this thread:
#http://stackoverflow.com/questions/14689424/use-a-value-from-the-previous-row-in-an-r-data-table-calculation
r <- (1L + shiftLen):(length(x) + shiftLen)
r[r<1] <- NA
return(x[r])
}
reduceBed.GenomicRanges <- function(bed) {
setnames(bed,colnames(bed),bedKey)
if(!identical(key(bed),bedKey)) setkeyv(bed,bedKey)
grBed <- makeGRangesFromDataFrame(bed,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grBed <- reduce(grBed)
grBed <- data.table(
CHROM=as.character(seqnames(grBed)),
START=start(grBed),
STOP=end(grBed),
key = c("CHROM","START","STOP"))
return(grBed)
}
This function strictly used the GenomicRanges package, produces the same result, but is about 10 fold slower that the foverlaps funciton.
intersectBedFiles.GR <- function(bed1,bed2) {
require(data.table)
require(GenomicRanges)
bed1 <- makeGRangesFromDataFrame(bed1,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
bed2 <- makeGRangesFromDataFrame(bed2,
seqnames.field = "CHROM",start.field="START",end.field="STOP")
grMerge <- suppressWarnings(intersect(bed1,bed2))
resultTable <- data.table(
CHROM=as.character(seqnames(grMerge)),
START=start(grMerge),
STOP=end(grMerge),
key = c("CHROM","START","STOP"))
return(resultTable)
}
An additional comparison using IRanges
I found a solution to collapse overlapping regions using IRanges but it is more than 10 fold slower than GenomicRanges.
reduceBed.IRanges <- function(bed) {
bed.tmp <- bed
bed.tmp[,group := {
ir <- IRanges(START, STOP);
subjectHits(findOverlaps(ir, reduce(ir)))
}, by=CHROM]
bed.tmp <- bed.tmp[, list(CHROM=unique(CHROM),
START=min(START),
STOP=max(STOP)),
by=list(group,CHROM)]
setkeyv(bed.tmp,bedKey)
bed[,group := NULL]
return(bed.tmp[, -(1:2)])
}
system.time(bedC.reduced <- reduceBed.GenomicRanges(bedC))
user system elapsed
10.86 0.01 10.89
system.time(bedD.reduced <- reduceBed.IRanges(bedC))
user system elapsed
137.12 0.14 137.58
identical(bedC.reduced,bedD.reduced)
[1] TRUE
What's a good, generic algorithm for collapsing a set of potentially-overlapping ranges?
This seems to works and is easy to understand.
public static IEnumerable<Range<T>> Collapse<T>(this IEnumerable<Range<T>> me, IComparer<T> comparer)
{
List<Range<T>> orderdList = me.OrderBy(r => r.Start).ToList();
List<Range<T>> newList = new List<Range<T>>();
T max = orderdList[0].End;
T min = orderdList[0].Start;
foreach (var item in orderdList.Skip(1))
{
if (comparer.Compare(item.End, max) > 0 && comparer.Compare(item.Start, max) > 0)
{
newList.Add(new Range<T> { Start = min, End = max });
min = item.Start;
}
max = comparer.Compare(max, item.End) > 0 ? max : item.End;
}
newList.Add(new Range<T>{Start=min,End=max});
return newList;
}
Here is the variation which I mentioned in the comments. It's basically the same thing, but with some checking and yielding of the results instead of collecting in a list before returning.
public static IEnumerable<Range<T>> Collapse<T>(this IEnumerable<Range<T>> ranges, IComparer<T> comparer)
{
if(ranges == null || !ranges.Any())
yield break;
if (comparer == null)
comparer = Comparer<T>.Default;
var orderdList = ranges.OrderBy(r => r.Start);
var firstRange = orderdList.First();
T min = firstRange.Start;
T max = firstRange.End;
foreach (var current in orderdList.Skip(1))
{
if (comparer.Compare(current.End, max) > 0 && comparer.Compare(current.Start, max) > 0)
{
yield return Create(min, max);
min = current.Start;
}
max = comparer.Compare(max, current.End) > 0 ? max : current.End;
}
yield return Create(min, max);
}
Pandas: Finding overlapping regions based on start- and stop coordinates
I think there is a clever way to do this in one pass, but a brute force solution is to just loop over the rows in the dataframe.
For example:
import pandas as pd
# Sample input
df = pd.DataFrame({
"id": [1, 2, 3, 4],
"type": ["AP", "AP", "ES", "ES"],
"start": [0, 3, 5, 12],
"stop": [10, 7, 15, 18]
})
df['count'] = 0
for row in df.itertuples():
mask = (row.start <= df.stop) & (row.stop >= df.start)
df.loc[row.Index, 'count'] = sum(mask) - 1
And we get
id start stop type count
0 1 0 10 AP 2
1 2 3 7 AP 2
2 3 5 15 ES 3
3 4 12 18 ES 1
Efficiently finding intersecting regions in two huge dictionaries
I really recommend you to use PANDAS to cope with this kind of problem.
for proof that can be simply done with pandas:
import pandas as pd #install this, and read de docs
from StringIO import StringIO #You dont need this
#simulating a reading the file
first_file = """contig17 GRMZM2G052619_P03 x
contig33 AT2G41790.1 x
contig98 GRMZM5G888620_P01 x
contig102 GRMZM5G886789_P02 x
contig123 AT3G57470.1 x"""
#simulating reading the second file
second_file = """y GRMZM2G052619_P03 y
y GRMZM5G888620_P01 y
y GRMZM5G886789_P02 y"""
#here is how you open the files. Instead using StringIO
#you will simply the file path. Give the correct separator
#sep="\t" (for tabular data). Here im using a space.
#In name, put some relevant names for your columns
f_df = pd.read_table(StringIO(first_file),
header=None,
sep=" ",
names=['a', 'b', 'c'])
s_df = pd.read_table(StringIO(second_file),
header=None,
sep=" ",
names=['d', 'e', 'f'])
#this is the hard bit. Here I am using a bit of my experience with pandas
#Basicly it select the rows in the second data frame, which "isin"
#in the second columns for each data frames.
my_df = s_df[s_df.e.isin(f_df.b)]
Output:
Out[180]:
d e f
0 y GRMZM2G052619_P03 y
1 y GRMZM5G888620_P01 y
2 y GRMZM5G886789_P02 y
#you can save this with:
my_df.to_csv("result.txt", sep="\t")
chers!
Related Topics
What Does the Function Invisible() Do
How to Use Dplyr's Summarize and Which() to Lookup Min/Max Values
Convert Factor to Date/Time in R
Mutate Multiple Variable to Create Multiple New Variables
Why I Get This Error Writing Data to a File
Save Imported CSV Data in Vector - R
Create an R Package That Depends on Another R Package Located on Github
Non-Numeric Argument to Binary Operator Error in R
Saving and Loading a Model in R
The Simplest Way to Convert a List with Various Length Vectors to a Data.Frame in R