R Foverlaps Equivalent in Python

Can't reproduce R data.table::foverlaps in Python

To do interval overlaps in Python, simply use pyranges:

import pyranges as pr

c1 = """Chromosome Start End Gene
1 10 20 blo
1 45 46 bla"""

c2 = """Chromosome Start End Gene
1 10 35 bip
1 25 50 P53
1 40 10000 boop"""

gr1, gr2 = pr.from_string(c1), pr.from_string(c2)

j = gr1.join(gr2)
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# |   Chromosome |     Start |       End | Gene       |   Start_b |     End_b | Gene_b     |
# |   (category) |   (int32) |   (int32) | (object)   |   (int32) |   (int32) | (object)   |
# |--------------+-----------+-----------+------------+-----------+-----------+------------|
# |            1 |        10 |        20 | blo        |        10 |        35 | bip        |
# |            1 |        45 |        46 | bla        |        25 |        50 | P53        |
# |            1 |        45 |        46 | bla        |        40 |     10000 | boop       |
# +--------------+-----------+-----------+------------+-----------+-----------+------------+
# Unstranded PyRanges object has 3 rows and 7 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.

How to join two dataframes for which column values are within a certain range?

One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event i.e (Hope all the date times are in timestamps dtype )

df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])

Output :


            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

Merge pandas dataframes where one value is between two others

As you say, this is pretty easy in SQL, so why not do it in SQL?

import pandas as pd
import sqlite3

#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
                           "president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
                      'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
                      'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
                                 "name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)

qry = '''
    select  
        start_date PresTermStart,
        end_date PresTermEnd,
        wars.date WarStart,
        presidents.name Pres
    from
        terms join wars on
        date between start_date and end_date join presidents on
        terms.president_id = presidents.president_id
    '''
df = pd.read_sql_query(qry, conn)

df:

         PresTermStart          PresTermEnd             WarStart  Pres
0  2001-01-31 00:00:00  2005-01-31 00:00:00  2001-09-14 00:00:00  Bush
1  2001-01-31 00:00:00  2005-01-31 00:00:00  2003-03-03 00:00:00  Bush

Matching values between data frames based on overlapping dates

You may consider data.table which allows for "non-equi joins", i.e. joins based on >=, >, <= and <. In the same call, aggregate operations may be performed on the groups in the LHS data set that each row in the RHS data set (i) matches (by = .EACHI).

d1[d2, on = .(id = id, end >= begin),
         .(i.begin, i.end, val_str = toString(val)), by = .EACHI]

#    id        end    i.begin      i.end    val_str
# 1:  5 2017-03-03 2017-03-03 2017-03-05 Cat3, Cat1
# 2:  6 2017-05-03 2017-05-03 2017-05-05         NA
# 3:  8 2017-03-03 2017-03-03 2017-03-05         NA
# 4: 10 2017-12-05 2017-12-05 2017-12-06       Cat4

Data preparation:

d1 <- data.frame(id = c(5, 10, 5, 10),
                 begin = as.Date(c('1990-3-1','1993-12-1','1991-3-1','1995-12-5')),
                 end = as.Date(c('2017-3-10','2017-12-2','2017-3-3','2017-12-10')),
                 val = c("Cat1", "Cat2", "Cat3", "Cat4"))

d2 <- data.frame(id = c(5, 6, 8, 10),
                 begin = as.Date(c('2017-3-3','2017-5-3','2017-3-3','2017-12-5')),
                 end = as.Date(c('2017-3-5','2017-5-5','2017-3-5','2017-12-6')))

library(data.table)
setDT(d1)
setDT(d2)

Coding Matrix with overlap counts in R

You can use the foverlap function in the data.table package to create an edgelist and then turn this into a weighted adjacency matrix. (See here).

Using a combination of data.table, dplyr, and igraph, I think this gets you what you want (can't verify without data, though).

First, you set your data frame as a data table and set the key for index1 and index2. Then, foverlap identities entries where index1 and index2 have any overlap. After eliminating self-overlaps, replace the ids generated by foverlaps with corresponding codenames from the data set. This creates an edgelist. Pass this edgelist to igraph to create an igraph object and return it as an adjacency matrix.

require(igraph); require(data.table); require(dplyr)

el <- setkey(setDT(coding_table), filename, index1, index2) %>%
  foverlaps(., ., type="any", which=TRUE) %>%
  .[coding_table$codename[xid] != coding_table$codename[yid]] %>%
  .[, `:=`(xid = coding_table$codename[xid], yid = coding_table$codename[yid])]

m <- as.matrix(get.adjacency(graph.data.frame(el)))

Of course, dplyr is totally optional; the piping just makes it a bit neater and avoids creating more objects in the environment.