Merge by Range in R - Applying Loops

Merge by Range in R - Applying Loops

The GenomicRanges package in Bioconductor is designed for this type of operation. Read your data in with, e.g., read.delim so that

con <- textConnection("SNP     BP
rs064 12292
rs319 345367
rs285 700042")
snps <- read.delim(con, head=TRUE, sep="")

con <- textConnection("Gene BP_start BP_end
E613 345344 363401
E92 694501 705408
E49 362370 368340") ## missing trailing digit on BP_end??
genes <- read.delim(con, head=TRUE, sep="")

then create 'IRanges' out of each

library(IRanges)
isnps <- with(snps, IRanges(BP, width=1, names=SNP))
igenes <- with(genes, IRanges(BP_start, BP_end, names=Gene))

(pay attention to coordinate systems, IRanges expects start and end to be included in the range; also, end >= start expect for 0-width ranges when end = start - 1). Then find the SNPs ('query' in IRanges terminology) that overlap the genes ('subject')

olaps <- findOverlaps(isnps, igenes)

two of the SNPs overlap

> queryHits(olaps)
[1] 2 3

and they overlap the first and second genes

> subjectHits(olaps)
[1] 1 2

If a query overlapped multiple genes, it would have been repeated in queryHits (and vice versa). You could then merge your data frames as

> cbind(snps[queryHits(olaps),], genes[subjectHits(olaps),])
SNP BP Gene BP_start BP_end
2 rs319 345367 E613 345344 363401
3 rs285 700042 E92 694501 705408

Usually genes and SNPs have chromosome and strand ('+', '-', or '*' to indicate that strand isn't important) information, and you'd want to do overlaps in the context of these; instead of creating 'IRanges' instances, you'd create 'GRanges' (genomic ranges) and the subsequent book-keeping would be taken care of for you

library(GenomicRanges)
isnps <-
with(snps, GRanges("chrA", IRanges(BP, width=1, names=SNP), "*")
igenes <-
with(genes, GRanges("chrA", IRanges(BP_start, BP_end, names=Gene), "+"))

Matched Range Merge in R

UPDATE: This question was more complicated than indicated here. The solution can be found here: Merge by Range in R - Applying Loops, and is delivered by using the GenomicRangespackage in Bioconductor. Very useful package!

Merge two data frames considering a range match between key columns

You want to merge two data frames considering a range match between key columns. Here are two solutions.

using sqldf

library(sqldf)

output <- sqldf("select * from FD left join shpxt
on (FD.X >= shpxt.Xmin and FD.X <= shpxt.Xmax and
FD.Y >= shpxt.Ymin and FD.Y <= shpxt.Ymax ) ")

using data.table

library(data.table)

# convert your datasets in data.table
setDT(FD)
setDT(shpxt)

output <- FD[shpxt, on = .(X >= Xmin , X <= Xmax, # indicate x range
Y >= Ymin , Y <= Ymax), nomatch = NA, # indicate y range
.(Survival, X, Y, Xmin, Xmax, Ymin, Ymax, Sites )] # indicate columns in the output

There are different alternatives to solve this problem, as you will find it in other SO questions here and here.

ps. Keep in mind that for loop is not necessarily the best solution.

Using apply() for nested for loop in R

Non-equal joins are supported in SQL natively, and in data.table within R. Neither base R nor tidyverse functions support it locally[1].

library(data.table)
setDT(dfA)
setDT(dfB)
dfB[dfA, on = .(common == Common, cDate <= Date, bDate >= Date)]
# cDate bDate common a
# 1: 2005-01-01 2005-01-01 20141331123 1
# 2: 2005-01-02 2005-01-02 20141331123 2
# 3: 2005-01-03 2005-01-03 20141331123 3
# 4: 2005-01-04 2005-01-04 20141331123 4
# 5: 2005-01-05 2005-01-05 20141331123 5
# 6: 2005-01-06 2005-01-06 20141331123 6

The sample data is a little uninteresting in that everything fits in the single interval, but perhaps this will work with your more varied data.

[1]: Since SQL supports it, it's supported in dbplyr using sql_on.


Data:

dfA <- structure(list(Common = c("20141331123", "20141331123", "20141331123", "20141331123", "20141331123", "20141331123"), a = 1:6, Date = structure(c(12784, 12785, 12786, 12787, 12788, 12789), class = "Date")), row.names = c(NA, -6L), class = "data.frame")
dfB <- structure(list(cDate = structure(12784, class = "Date"), bDate = structure(12947, class = "Date"), common = "20141331123"), row.names = c(NA, -1L), class = "data.frame")

how to merge columns in a for loop?

You can do this without a for loop using tidyr::unite():

library(tidyr)

res %>% unite(col = 'name_for_new_col', sep = ' ')

This will paste together all columns in res and put a space between each element. If you want to specify particular columns to do this on you can instead say:

res %>% unite(col = 'name_for_new_col', x1:x3, sep = ' ')

If you want to keep the old columns as well you can specify remove = FALSE within unite().


Alternatively, you can do it in base R using apply and paste:

res2 <- as.data.frame(apply(res, 1, paste, collapse = ' '))
names(res2)[1] <- 'nicer.name' #rename the single column

Or with just paste and a for loop:

df = data.frame(matrix(nrow = nrow(res), ncol = 1))
for (i in 1:nrow(res)) {
df[i,1] <- (paste(res[i,], collapse = ' '))
}

Merge overlapping ranges per group

I used the Bioconductor GenomicRanges package, which seems highly appropriate to your domain.


> ## install.packages("BiocManager")
> ## BiocManager::install("GenomicRanges")
> library(GenomicRanges)
> my.df |> as("GRanges") |> reduce()
GRanges object with 5 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] 4F 2500-3401 +
[2] 4F 19116-20730 +
[3] 4F 1420-2527 -
[4] 0F 1405-1700 -
[5] 0F 1727-2038 -
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths

which differs from your expectation because there are two OF non-overlapping ranges?

Join one data frame to another by membership in a range

You can use findInterval to match a time with the corresponding onset, then merge your two data.frames:

df1$onset <- df2$onset[findInterval(df1$time, df2$onset)]
df3 <- merge(df1, df2, by = "onset")

head(df3)
# onset time offset
# 1 0 0.000 0.799
# 2 0 0.003 0.799
# 3 0 0.006 0.799
# 4 0 0.009 0.799
# 5 0 0.012 0.799
# 6 0 0.015 0.799

tail(df3)
# onset time offset
# 995 2.4 2.982 3
# 996 2.4 2.985 3
# 997 2.4 2.988 3
# 998 2.4 2.991 3
# 999 2.4 2.994 3
# 1000 2.4 2.997 3

R: How to create a loop for, for a range of data in a function?

You are trying to put a vector of 10 elements into one of the matrix cell. You want to assign it to the matrix row instead (you can access the ith row with A[i,]).

But using a for loop in this case is inefficient and it is quite straightforward to use one of the "apply" function. Apply functions typically return a list (which is the most versatile container since there is basically no constraint).

Here sapply is an apply function which tries to Simplify its result to a convenient data structure. In this case, since all results have the same length (10), sapply will simplify the result to a matrix.

Note that I modified your function to make it explicitly depend on L_inf. Otherwise it will not do what you think it should do (see keyword "closures" if you want more info).

L_inf_range <- seq(17,20,by=0.1)
B <- 1

fun <- function(x, L_inf) {
L_inf*(1-exp(-B*(x-0)))
}

sapply(L_inf_range, function(L) fun(1:10, L_inf=L))

merge.xts on specific columns over a for loop

You're on the right track by using an environment. Good job!

You want to extract the adjusted close column from each symbol and merge them into one xts object. You're on the right track with your for loop, but there's an easier way. You can loop over the elements in an environment using lapply(), and use the Ad() function to extract the adjusted close column. Then use do.call(merge, ...) to call the merge function on the output of lapply().

merged_prices <- do.call(merge, lapply(e, Ad))

You can merge the adjusted close for GSPC at this point (merge(Ad(GSPC), merged_prices)), or you could include it in your list of tickers.



Related Topics



Leave a reply



Submit