Efficiently merging two data frames on a non-trivial criteria
A data table solution: a rolling join to fulfill the first inequality, followed by a vector scan to satisfy the second inequality. The join-on-first-inequality will have more rows than the final result (and therefore may run into memory issues), but it will be smaller than a straight-up merge in this answer.
require(data.table)
genes_start <- as.data.table(genes)
## create the start bound as a separate column to join to
genes_start[,`:=`(start_bound = start - 10)]
setkey(genes_start, chromosome, start_bound)
markers <- as.data.table(markers)
setkey(markers, chromosome, position)
new <- genes_start[
##join genes to markers
markers,
##rolling the last key column of genes_start (start_bound) forward
##to match the last key column of markers (position)
roll = Inf,
##inner join
nomatch = 0
##rolling join leaves positions column from markers
##with the column name from genes_start (start_bound)
##now vector scan to fulfill the other criterion
][start_bound <= end + 10]
##change names and column order to match desired result in question
setnames(new,"start_bound","position")
setcolorder(new,c("chromosome","gene","start","end","marker","position"))
# chromosome gene start end marker position
# 1: 1 b 100 200 1 105
# 2: 1 b 100 200 9 120
# 3: 1 b 100 200 5 150
# 4: 2 a 100 200 3 96
# 5: 2 a 100 200 4 206
# 6: 3 e 321 567 6 400
One could do a double join, but as it involves re-keying the data table before the second join, I don't think that it will be faster than the vector scan solution above.
##makes a copy of the genes object and keys it by end
genes_end <- as.data.table(genes)
genes_end[,`:=`(end_bound = end + 10, start = NULL, end = NULL)]
setkey(genes_end, chromosome, gene, end_bound)
## as before, wrapped in a similar join (but rolling backwards this time)
new_2 <- genes_end[
setkey(
genes_start[
markers,
roll = Inf,
nomatch = 0
], chromosome, gene, start_bound),
roll = -Inf,
nomatch = 0
]
setnames(new2, "end_bound", "position")
Merge pandas dataframes where one value is between two others
As you say, this is pretty easy in SQL, so why not do it in SQL?
import pandas as pd
import sqlite3
#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
"president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
"name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)
qry = '''
select
start_date PresTermStart,
end_date PresTermEnd,
wars.date WarStart,
presidents.name Pres
from
terms join wars on
date between start_date and end_date join presidents on
terms.president_id = presidents.president_id
'''
df = pd.read_sql_query(qry, conn)
df:
PresTermStart PresTermEnd WarStart Pres
0 2001-01-31 00:00:00 2005-01-31 00:00:00 2001-09-14 00:00:00 Bush
1 2001-01-31 00:00:00 2005-01-31 00:00:00 2003-03-03 00:00:00 Bush
dplyr left_join by less than, greater than condition
Use a filter
. (But note that this answer does not produce a correct LEFT JOIN
; but the MWE gives the right result with an INNER JOIN
instead.)
The dplyr
package isn't happy if asked merge two tables without something to merge on, so in the following, I make a dummy variable in both tables for this purpose, then filter, then drop dummy
:
fdata %>%
mutate(dummy=TRUE) %>%
left_join(sdata %>% mutate(dummy=TRUE)) %>%
filter(fyear >= byear, fyear < eyear) %>%
select(-dummy)
And note that if you do this in PostgreSQL (for example), the query optimizer sees through the dummy
variable as evidenced by the following two query explanations:
> fdata %>%
+ mutate(dummy=TRUE) %>%
+ left_join(sdata %>% mutate(dummy=TRUE)) %>%
+ filter(fyear >= byear, fyear < eyear) %>%
+ select(-dummy) %>%
+ explain()
Joining by: "dummy"
<SQL>
SELECT "id" AS "id", "fyear" AS "fyear", "byear" AS "byear", "eyear" AS "eyear", "val" AS "val"
FROM (SELECT * FROM (SELECT "id", "fyear", TRUE AS "dummy"
FROM "fdata") AS "zzz136"
LEFT JOIN
(SELECT "byear", "eyear", "val", TRUE AS "dummy"
FROM "sdata") AS "zzz137"
USING ("dummy")) AS "zzz138"
WHERE "fyear" >= "byear" AND "fyear" < "eyear"
<PLAN>
Nested Loop (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
and doing it more cleanly with SQL gives exactly the same result:
> tbl(pg, sql("
+ SELECT *
+ FROM fdata
+ LEFT JOIN sdata
+ ON fyear >= byear AND fyear < eyear")) %>%
+ explain()
<SQL>
SELECT "id", "fyear", "byear", "eyear", "val"
FROM (
SELECT *
FROM fdata
LEFT JOIN sdata
ON fyear >= byear AND fyear < eyear) AS "zzz140"
<PLAN>
Nested Loop Left Join (cost=0.00..50886.88 rows=322722 width=40)
Join Filter: ((fdata.fyear >= sdata.byear) AND (fdata.fyear < sdata.eyear))
-> Seq Scan on fdata (cost=0.00..28.50 rows=1850 width=16)
-> Materialize (cost=0.00..33.55 rows=1570 width=24)
-> Seq Scan on sdata (cost=0.00..25.70 rows=1570 width=24)
Data selection error
Your code seems to be completely backwards to what you're trying to achieve:
"For each gene (in d2) which SNPs (from d1) are within 10kb of that gene?"
First of all, your code for d1$matched
is backwards. All your p
's and d2
s should be the other way round (currently it doesn't make much sense?), giving you a list of SNPs whom are in cis with each gene (+/- 10kb).
I would approach it the way i've phrased your question:
cisWindow <- 10000 # size of your +/- window, in this case 10kb.
d3 <- data.frame()
# For each gene, locate the cis-SNPs
for (i in 1:nrow(d2)) {
# Broken down into steps for readability.
inCis <- d1[which(d1[,"CHR"] == d2[i, "chromosome"]),]
inCis <- inCis[which(inCis[,"POS"] >= (d2[i, "start"] - cisWindow)),]
inCis <- inCis[which(inCis[,"POS"] <= (d2[i, "end"] + cisWindow)),]
# Now we have the cis-SNPs, so lets build the data.frame for this gene,
# and grow our data.frame d3:
if (nrow(inCis) > 0) {
d3 <- rbind(d3, cbind(d2[i,], inCis))
}
}
I tried to find a solution which didn't involve growing d3
in the loop, but because you're attaching each row of d2
to 0 or more rows from d1
I wasn't able to come up with a solution that's not horribly inefficient.
Inner_join with two conditions and interval within interval condition
Using data.table, you can perform operations while joining. Here is an example
library(data.table)
df2[df1, # left join
.(time_condition = sam1 > time1 & sam2 < time2), # condition while joining
on = .(key1, key2), # keys
by = .EACHI, # check condition per join
nomatch = 0L] # make it an inner join
# key1 key2 time_condition
# 1: a 1 TRUE
# 2: b 2 FALSE
# your data generated using data.table
df1 <- data.table(key1 = c("a", "b", "c", "d", "e"),
key2 = c(1:5),
time1 = as.ITime(c("00:00:15", "00:15:15", "00:30:15", "00:40:15", "01:10:15")),
time2 = as.ITime(c("00:05:15", "00:20:15", "00:35:15", "00:45:15", "01:15:15")))
df2 <- data.table(key1 = c("b", "c", "a", "e", "d"),
key2 = c(2, 6, 1, 8, 5),
sam1 = as.ITime(c("00:21:15", "00:31:15", "00:03:15", "01:20:15", "00:43:15")),
sam2 = as.ITime(c("00:23:15", "00:34:15", "00:04:15", "01:25:15", "00:44:15")))
How to join (merge) data frames (inner, outer, left, right)
By using the merge
function and its optional parameters:
Inner join: merge(df1, df2)
will work for these examples because R automatically joins the frames by common variable names, but you would most likely want to specify merge(df1, df2, by = "CustomerId")
to make sure that you were matching on only the fields you desired. You can also use the by.x
and by.y
parameters if the matching variables have different names in the different data frames.
Outer join: merge(x = df1, y = df2, by = "CustomerId", all = TRUE)
Left outer: merge(x = df1, y = df2, by = "CustomerId", all.x = TRUE)
Right outer: merge(x = df1, y = df2, by = "CustomerId", all.y = TRUE)
Cross join: merge(x = df1, y = df2, by = NULL)
Just as with the inner join, you would probably want to explicitly pass "CustomerId" to R as the matching variable. I think it's almost always best to explicitly state the identifiers on which you want to merge; it's safer if the input data.frames change unexpectedly and easier to read later on.
You can merge on multiple columns by giving by
a vector, e.g., by = c("CustomerId", "OrderId")
.
If the column names to merge on are not the same, you can specify, e.g., by.x = "CustomerId_in_df1", by.y = "CustomerId_in_df2"
where CustomerId_in_df1
is the name of the column in the first data frame and CustomerId_in_df2
is the name of the column in the second data frame. (These can also be vectors if you need to merge on multiple columns.)
Related Topics
Make Sequential Numeric Column Names Prefixed with a Letter
What Evaluates to True/False in R
How to Deal with Spaces in Column Names
Downloading Png from Shiny (R)
R: Extracting "Clean" Utf-8 Text from a Web Page Scraped with Rcurl
How to Facet a Plot_Ly() Chart
Plot Data Over Background Image with Ggplot
Convert Factor to Date/Time in R
R Shiny Table Not Rendering HTML
How to Detect Free Variable Names in R Functions
How to Generalize Outer to N Dimensions
R Sum a Variable by Two Groups
Force Ggplot Legend to Show All Categories When No Values Are Present
How to Resolve the "No Font Name" Issue When Importing Fonts into R Using Extrafont