Overlap Join With Start and End Positions

Overlap join with start and end positions

Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called foverlaps. From NEWS:

29) Overlap joins #528 is now here, finally!! Except for type="equal" and maxgap and minoverlap arguments, everything else is implemented. Check out ?foverlaps and the examples there on its usage. This is a major feature addition to data.table.

Let's consider x, an interval defined as [a, b], where a <= b, and y, another interval defined as [c, d], where c <= d. The interval y is said to overlap x at all, iff d >= a and c <= b 1. And y is entirely contained within x, iff a <= c,d <= b 2. For the different types of overlaps implemented, please have a look at ?foverlaps.

Your question is a special case of an overlap join: in d1 you have true physical intervals with start and end positions. In d2 on the other hand, there are only positions (pos), not intervals. To be able to do an overlap join, we need to create intervals also in d2. This is achieved by creating an additional variable pos2, which is identical to pos (d2[, pos2 := pos]). Thus, we now have an interval in d2, albeit with identical start and end coordinates. This 'virtual, zero-width interval' in d2 can then be used in foverlap to do an overlap join with d1:

require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
#    x start end pos pos2
# 1: a     1   3   2    2
# 2: a     1   3   3    3
# 3: c    19  22  20   20
# 4: e     7  25  10   10

by.y by default is key(y), so we skipped it. by.x by default takes key(x) if it exists, and if not takes key(y). But a key doesn't exist for d2, and we can't set the columns from y, because they don't have the same names. So, we set by.x explicitly.

The type of overlap is within, and we'd like to have all matches, only if there is a match.

NB: foverlaps uses data.table's binary search feature (along with roll where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function findOverlaps() from the Bioconductor package IRanges, an excellent package (and so is GenomicRanges, which extends IRanges for Genomics).

So what's the advantage?

A benchmark on the code above on your data results in foverlaps() slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?

What would be really interesting is to see how well it scales - in terms of both speed and memory. In Gabor's answer, we join based on the key column x. And then filter the results.

What if d1 has about 40K rows and d2 has a 100K rows (or more)? For each row in d2 that matches x in d1, all those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:

Generate data:

require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE), 
                start = pmin(idx1, idx2), 
                end = pmax(idx1, idx2))

d2 = data.table(x = sample(letters[1:15], k, TRUE), 
                pos1 = sample(60:150, k, TRUE))

foverlaps:

system.time({
    setkey(d1)
    d2[, pos2 := pos1]
    ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user  system elapsed 
#   3.028   0.635   3.745

This took ~ 1GB of memory in total, out of which ans1 is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument verbose=TRUE.

Gabor's solutions:

## new session - data.table solution
system.time({
    setkey(d1, x)
    ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
#   user  system elapsed 
# 15.714   4.424  20.324

And this took a total of ~3.5GB.

I just noted that Gabor already mentions the memory required for intermediate results. So, trying out sqldf:

# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join 
            d2 using (x) where pos1 between start and end"))
#   user  system elapsed 
# 73.955   1.605  77.049

Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.

[The answers were verified to be identical after removing pos2 from ans1 and setting key on both answers.]

Note that this overlap join is designed with problems where d2 doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where d2 is usually about 30-150 million or more rows).

foverlaps() is stable, but is still under development, meaning some arguments and names might get changed.

NB: Since I mentioned GenomicRanges above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, foverlaps() is faster. But that's for another (blog) post, some other time.

How to match rows within in a range of another dataset

For a data.table solution, you should have looked at the second answer by Arun on non-equi joins in the link provided by @Henrik.
Overlap join with start and end positions

Based on that, we have

library(data.table)

df1 <- data.table(Chromosome=1:3,Position=c(101,101,600),
                  Start=c(101,101,600),End=c(101,101,600))

df2 <- data.table(Chromosome=c(1,1,4),
                  Start=c(50,300,100),End=c(200,400,200),CpG=c(10,2,5))

df1[df2,.(Chromosome,Position=x.Position,Start,End,CpG),
    on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]

giving

       Chromosome Position Start End CpG
1:              1      101   101 101  10

That's not quite right because it takes Start and End from df1 rather than df2. Why do you even have Start and End in df1?

One way to deal with that is to not include them in the join statement:

df1[,.(Chromosome,Position)][df2,
    .(Chromosome,Position=x.Position,Start,End,CpG),
   on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]

giving

   Chromosome Position Start End CpG
1:          1      101    50 200  10

[EDIT to note that @Carles Sans Fuentes identified the same issue in his dplyr answer.]

As a check on cases with more matches, I added some more data:

 df1 <- data.table(Chromosome=c(1,1:4),Position=c(350,101,101,600,200),
                       Start=c(350,101,101,600,200),End=c(350,101,101,600,200))
    
    df1
       Chromosome Position Start End
    1:          1      350   350 350
    2:          1      101   101 101
    3:          2      101   101 101
    4:          3      600   600 600
    5:          4      200   200 200
    
    
    
        df1[,.(Chromosome,Position)][df2,
            .(Chromosome,Position=x.Position,Start,End,CpG),
           on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]
    
       Chromosome Position Start End CpG
    1:          1      101    50 200  10
    2:          1      350   300 400   2
    3:          4      200   100 200   5

Which I guess to be what you'd want.

Join within date range

Your logic is not as simple as "between", since it appears that you want any kind of overlap, whether a superset or otherwise. For that, we need a slightly different query (and should include ID on the left-join as well, I'm inferring).

sqldf::sqldf("
  select h.*, o.DT_START_OUT, o.DT_END_OUT
  from HOSP h
    left join OUT o on h.ID = o.ID
      and h.DT_START_HOSP < o.DT_END_OUT
      and h.DT_END_HOSP > o.DT_START_OUT")
#    ID DT_START_HOSP DT_END_HOSP DT_START_OUT DT_END_OUT
# 1 111    2021-01-07  2021-01-10         <NA>       <NA>
# 2 222    2021-01-11  2021-01-20   2021-01-15 2021-01-15
# 3 333    2021-01-21  2021-01-25         <NA>       <NA>
# 4 444    2021-01-21  2021-02-01         <NA>       <NA>
# 5 555    2021-01-21  2021-01-29         <NA>       <NA>
# 6 666    2021-01-22  2021-02-02   2021-01-25 2021-01-25
# 7 666    2021-01-22  2021-02-02   2021-01-28 2021-01-30

(Thank you for fixing your data from the previous question and the first draft of this one. For the record, you may want this, some handy code that deals well with vectors of dates in inconsistent/different formats.)

In R how to merge two dataframe according same variable and the restriction of date period

You can use dplyr::left_join

library(dplyr)
table_a %>%
  left_join(table_b, by='category')%>%
  filter(date>=start_date & date<=end_date) %>% 
    select(date, category, start_date, seller)

R - more effective left_join

Another data.table solution using non-equijoins:

library(data.table)

setDT(df_measurements)
setDT(df_limits) 

df_limits[df_measurements, .(station_id, measurement_id, value, Title),
          on=.(station_id = station_id, limit_from < value, limit_to >= value)]

   station_id measurement_id value        Title
 1:          1       12121534   172 Level_3_High
 2:          1       12121618    87  Level_2_Low
 3:          1       12121703     9  Level_3_Low
 4:          2       12121709    80  Level_2_Low
 5:          2       12121760    80  Level_2_Low
 6:          2       12121813   115 Level_1_High
 7:          3       12121881    67  Level_3_Low
 8:          3       12121907   100  Level_1_Low
 9:          3       12121920   108      Optimal
10:          1       12121979   102      Optimal
11:          1       12121995    53  Level_3_Low
12:          1       12122022    77  Level_2_Low
13:          2       12122065   158 Level_3_High
14:          2       12122107   144 Level_2_High
15:          2       12122113     5  Level_3_Low
16:          3       12122135   100  Level_1_Low
17:          3       12122187   136 Level_2_High
18:          3       12122267   130 Level_1_High
19:          1       12122359   105      Optimal
20:          1       12122366   126 Level_1_High
21:          1       12122398   143 Level_2_High

any recommend for finding the intersection of two continuous variable in r

Something like the following determines if one segment in the real line defined by the endpoints Start and End overlaps another segment. The row combinations are created with combn and an anonymous function is applied to each combination.

`%overlaps%` <- function(X, Y){
  f <- function(x, y){
    a1 <- x[1] <= y[1] && y[1] <= x[2]
    a2 <- x[1] <= y[2] && y[2] <= x[2]
    a1 || a2
  }
  f(X, Y) || f(Y, X)
}

combn(1:nrow(d1), 2, function(x) {
  d1[x[1], ] %overlaps% d1[x[2], ]
})
#[1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

Overlap Join With Start and End Positions