Can't reproduce R data.table::foverlaps in Python
To do interval overlaps in Python, simply use pyranges:
import pyranges as pr
c1 = """Chromosome Start End Gene
1 10 20 blo
1 45 46 bla"""
c2 = """Chromosome Start End Gene
1 10 35 bip
1 25 50 P53
1 40 10000 boop"""
gr1, gr2 = pr.from_string(c1), pr.from_string(c2)
j = gr1.join(gr2)
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# | Chromosome | Start | End | Gene | Start_b | End_b | Gene_b |
# | (category) | (int32) | (int32) | (object) | (int32) | (int32) | (object) |
# |--------------+-----------+-----------+------------+-----------+-----------+------------|
# | 1 | 10 | 20 | blo | 10 | 35 | bip |
# | 1 | 45 | 46 | bla | 25 | 50 | P53 |
# | 1 | 45 | 46 | bla | 40 | 10000 | boop |
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# Unstranded PyRanges object has 3 rows and 7 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
How to join two dataframes for which column values are within a certain range?
One simple solution is create interval index
from start and end
setting closed = both
then use get_loc
to get the event i.e (Hope all the date times are in timestamps dtype )
df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
Output :
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
Merge pandas dataframes where one value is between two others
As you say, this is pretty easy in SQL, so why not do it in SQL?
import pandas as pd
import sqlite3
#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
"president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
"name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)
qry = '''
select
start_date PresTermStart,
end_date PresTermEnd,
wars.date WarStart,
presidents.name Pres
from
terms join wars on
date between start_date and end_date join presidents on
terms.president_id = presidents.president_id
'''
df = pd.read_sql_query(qry, conn)
df:
PresTermStart PresTermEnd WarStart Pres
0 2001-01-31 00:00:00 2005-01-31 00:00:00 2001-09-14 00:00:00 Bush
1 2001-01-31 00:00:00 2005-01-31 00:00:00 2003-03-03 00:00:00 Bush
Matching values between data frames based on overlapping dates
You may consider data.table
which allows for "non-equi joins", i.e. joins based on >=
, >
, <=
and <
. In the same call, aggregate operations may be performed on the groups in the LHS data set that each row in the RHS data set (i
) matches (by = .EACHI
).
d1[d2, on = .(id = id, end >= begin),
.(i.begin, i.end, val_str = toString(val)), by = .EACHI]
# id end i.begin i.end val_str
# 1: 5 2017-03-03 2017-03-03 2017-03-05 Cat3, Cat1
# 2: 6 2017-05-03 2017-05-03 2017-05-05 NA
# 3: 8 2017-03-03 2017-03-03 2017-03-05 NA
# 4: 10 2017-12-05 2017-12-05 2017-12-06 Cat4
Data preparation:
d1 <- data.frame(id = c(5, 10, 5, 10),
begin = as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5')),
end = as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10')),
val = c("Cat1", "Cat2", "Cat3", "Cat4"))
d2 <- data.frame(id = c(5, 6, 8, 10),
begin = as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5')),
end = as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6')))
library(data.table)
setDT(d1)
setDT(d2)
Coding Matrix with overlap counts in R
You can use the foverlap
function in the data.table
package to create an edgelist and then turn this into a weighted adjacency matrix. (See here).
Using a combination of data.table
, dplyr
, and igraph
, I think this gets you what you want (can't verify without data, though).
First, you set your data frame as a data table and set the key for index1 and index2. Then, foverlap
identities entries where index1 and index2 have any overlap. After eliminating self-overlaps, replace the ids generated by foverlaps with corresponding codenames from the data set. This creates an edgelist. Pass this edgelist to igraph to create an igraph object and return it as an adjacency matrix.
require(igraph); require(data.table); require(dplyr)
el <- setkey(setDT(coding_table), filename, index1, index2) %>%
foverlaps(., ., type="any", which=TRUE) %>%
.[coding_table$codename[xid] != coding_table$codename[yid]] %>%
.[, `:=`(xid = coding_table$codename[xid], yid = coding_table$codename[yid])]
m <- as.matrix(get.adjacency(graph.data.frame(el)))
Of course, dplyr
is totally optional; the piping just makes it a bit neater and avoids creating more objects in the environment.
Related Topics
Python Locale Error: Unsupported Locale Setting
How to Use Tailwindcss with Django
Learning Ruby from Python; Differences and Similarities
How to Change the Default MySQL Connection Timeout When Connecting Through Python
How to Reversibly Store and Load a Pandas Dataframe To/From Disk
Django, Creating a Custom 500/404 Error Page
Which Is the Recommended Way to Plot: Matplotlib or Pylab
Database Does Not Update Automatically with MySQL and Python
Python Image Library Fails with Message "Decoder Jpeg Not Available" - Pil
Multiple Level Template Inheritance in Jinja2
Vscode -- How to Set Working Directory for Debugging a Python Program