## Overlap join with start and end positions

Overlap joins was implemented with commit 1375 in data.table v1.9.3, and is available in the current stable release, v1.9.4. The function is called `foverlaps`

. From NEWS:

29)

`Overlap joins`

#528 is now here, finally!! Except for`type="equal"`

and`maxgap`

and`minoverlap`

arguments, everything else is implemented. Check out`?foverlaps`

and the examples there on its usage. This is a major feature addition to`data.table`

.

Let's consider x, an interval defined as `[a, b]`

, where `a <= b`

, and y, another interval defined as `[c, d]`

, where `c <= d`

. The interval y is said to *overlap* x at all, iff `d >= a`

**and** `c <= b`

1. And y is entirely contained *within* x, iff `a <= c,d <= b`

2. For the different types of overlaps implemented, please have a look at `?foverlaps`

.

Your question is a special case of an overlap join: in `d1`

you have true physical intervals with `start`

and `end`

positions. In `d2`

on the other hand, there are only positions (`pos`

), not intervals. To be able to do an overlap join, we need to create intervals also in `d2`

. This is achieved by creating an additional variable `pos2`

, which is identical to `pos`

(`d2[, pos2 := pos]`

). Thus, we now have an interval in `d2`

, albeit with identical *start* and *end* coordinates. This 'virtual, zero-width interval' in `d2`

can then be used in `foverlap`

to do an overlap join with `d1`

:

`require(data.table) ## 1.9.3`

setkey(d1)

d2[, pos2 := pos]

foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)

# x start end pos pos2

# 1: a 1 3 2 2

# 2: a 1 3 3 3

# 3: c 19 22 20 20

# 4: e 7 25 10 10

`by.y`

by default is `key(y)`

, so we skipped it. `by.x`

by default takes `key(x)`

if it exists, and if not takes `key(y)`

. But a key doesn't exist for `d2`

, and we can't set the columns from `y`

, because they don't have the same names. So, we set `by.x`

explicitly.

The *type of overlap* is *within*, and we'd like to have *all* matches, only if there is a match.

NB: `foverlaps`

uses data.table's binary search feature (along with `roll`

where necessary) under the hood, but some function arguments (types of overlaps, maxgap, minoverlap etc..) are inspired by the function `findOverlaps()`

from the Bioconductor package `IRanges`

, an excellent package (and so is `GenomicRanges`

, which extends `IRanges`

for Genomics).

So what's the advantage?

A benchmark on the code above on your data results in `foverlaps()`

slower than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps = 0.021 seconds). But does it really matter at this granularity?

What would be really interesting is to see how well it scales - in terms of both *speed* and *memory*. In Gabor's answer, we join based on the key column `x`

. *And then* filter the results.

What if `d1`

has about 40K rows and `d2`

has a 100K rows (or more)? For *each row* in `d2`

that matches `x`

in `d1`

, *all* those rows will be matched and returned, only to be filtered later. Here's an example of your Q scaled only slightly:

### Generate data:

`require(data.table)`

set.seed(1L)

n = 20e3L; k = 100e3L

idx1 = sample(100, n, TRUE)

idx2 = sample(100, n, TRUE)

d1 = data.table(x = sample(letters[1:5], n, TRUE),

start = pmin(idx1, idx2),

end = pmax(idx1, idx2))

d2 = data.table(x = sample(letters[1:15], k, TRUE),

pos1 = sample(60:150, k, TRUE))

### foverlaps:

`system.time({`

setkey(d1)

d2[, pos2 := pos1]

ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)

})

# user system elapsed

# 3.028 0.635 3.745

This took ~ 1GB of memory in total, out of which `ans1`

is 420MB. Most of the time spent here is on subset really. You can check it by setting the argument `verbose=TRUE`

.

### Gabor's solutions:

`## new session - data.table solution`

system.time({

setkey(d1, x)

ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]

})

# user system elapsed

# 15.714 4.424 20.324

And this took a total of ~3.5GB.

I just noted that Gabor already mentions the memory required for intermediate results. So, trying out `sqldf`

:

`# new session - sqldf solution`

system.time(ans3 <- sqldf("select * from d1 join

d2 using (x) where pos1 between start and end"))

# user system elapsed

# 73.955 1.605 77.049

Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.

[The answers were verified to be identical after removing `pos2`

from `ans1`

and setting key on both answers.]

Note that this overlap join is designed with problems where `d2`

doesn't necessarily have identical start and end coordinates (ex: genomics, the field where I come from, where `d2`

is usually about 30-150 million or more rows).

`foverlaps()`

is stable, but is still under development, meaning some arguments and names might get changed.

NB: Since I mentioned `GenomicRanges`

above, it is also perfectly capable of solving this problem. It uses interval trees under the hood, and is quite memory efficient as well. In my benchmarks on genomics data, `foverlaps()`

is faster. But that's for another (blog) post, some other time.

## How to match rows within in a range of another dataset

For a `data.table`

solution, you should have looked at the second answer by Arun on non-equi joins in the link provided by @Henrik.

Overlap join with start and end positions

Based on that, we have

`library(data.table)`

df1 <- data.table(Chromosome=1:3,Position=c(101,101,600),

Start=c(101,101,600),End=c(101,101,600))

df2 <- data.table(Chromosome=c(1,1,4),

Start=c(50,300,100),End=c(200,400,200),CpG=c(10,2,5))

df1[df2,.(Chromosome,Position=x.Position,Start,End,CpG),

on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]

giving

` Chromosome Position Start End CpG`

1: 1 101 101 101 10

That's not quite right because it takes `Start`

and `End`

from `df1`

rather than `df2`

. Why do you even have `Start`

and `End`

in `df1`

?

One way to deal with that is to not include them in the join statement:

`df1[,.(Chromosome,Position)][df2,`

.(Chromosome,Position=x.Position,Start,End,CpG),

on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]

giving

` Chromosome Position Start End CpG`

1: 1 101 50 200 10

[EDIT to note that @Carles Sans Fuentes identified the same issue in his `dplyr`

answer.]

As a check on cases with more matches, I added some more data:

` df1 <- data.table(Chromosome=c(1,1:4),Position=c(350,101,101,600,200),`

Start=c(350,101,101,600,200),End=c(350,101,101,600,200))

df1

Chromosome Position Start End

1: 1 350 350 350

2: 1 101 101 101

3: 2 101 101 101

4: 3 600 600 600

5: 4 200 200 200

df1[,.(Chromosome,Position)][df2,

.(Chromosome,Position=x.Position,Start,End,CpG),

on=.(Chromosome,Position>=Start,Position<=End),nomatch=0L]

Chromosome Position Start End CpG

1: 1 101 50 200 10

2: 1 350 300 400 2

3: 4 200 100 200 5

Which I guess to be what you'd want.

## Join within date range

Your logic is not as simple as "between", since it appears that you want any kind of overlap, whether a superset or otherwise. For that, we need a slightly different query (and should include `ID`

on the left-join as well, I'm inferring).

`sqldf::sqldf("`

select h.*, o.DT_START_OUT, o.DT_END_OUT

from HOSP h

left join OUT o on h.ID = o.ID

and h.DT_START_HOSP < o.DT_END_OUT

and h.DT_END_HOSP > o.DT_START_OUT")

# ID DT_START_HOSP DT_END_HOSP DT_START_OUT DT_END_OUT

# 1 111 2021-01-07 2021-01-10 <NA> <NA>

# 2 222 2021-01-11 2021-01-20 2021-01-15 2021-01-15

# 3 333 2021-01-21 2021-01-25 <NA> <NA>

# 4 444 2021-01-21 2021-02-01 <NA> <NA>

# 5 555 2021-01-21 2021-01-29 <NA> <NA>

# 6 666 2021-01-22 2021-02-02 2021-01-25 2021-01-25

# 7 666 2021-01-22 2021-02-02 2021-01-28 2021-01-30

(Thank you for fixing your data from the previous question and the first draft of this one. For the record, you may want this, some handy code that deals well with vectors of dates in inconsistent/different formats.)

## In R how to merge two dataframe according same variable and the restriction of date period

You can use `dplyr::left_join`

`library(dplyr)`

table_a %>%

left_join(table_b, by='category')%>%

filter(date>=start_date & date<=end_date) %>%

select(date, category, start_date, seller)

## R - more effective left_join

Another `data.table`

solution using non-equijoins:

`library(data.table)`

setDT(df_measurements)

setDT(df_limits)

df_limits[df_measurements, .(station_id, measurement_id, value, Title),

on=.(station_id = station_id, limit_from < value, limit_to >= value)]

station_id measurement_id value Title

1: 1 12121534 172 Level_3_High

2: 1 12121618 87 Level_2_Low

3: 1 12121703 9 Level_3_Low

4: 2 12121709 80 Level_2_Low

5: 2 12121760 80 Level_2_Low

6: 2 12121813 115 Level_1_High

7: 3 12121881 67 Level_3_Low

8: 3 12121907 100 Level_1_Low

9: 3 12121920 108 Optimal

10: 1 12121979 102 Optimal

11: 1 12121995 53 Level_3_Low

12: 1 12122022 77 Level_2_Low

13: 2 12122065 158 Level_3_High

14: 2 12122107 144 Level_2_High

15: 2 12122113 5 Level_3_Low

16: 3 12122135 100 Level_1_Low

17: 3 12122187 136 Level_2_High

18: 3 12122267 130 Level_1_High

19: 1 12122359 105 Optimal

20: 1 12122366 126 Level_1_High

21: 1 12122398 143 Level_2_High

## any recommend for finding the intersection of two continuous variable in r

Something like the following determines if one segment in the real line defined by the endpoints `Start`

and `End`

overlaps another segment. The row combinations are created with `combn`

and an anonymous function is applied to each combination.

``%overlaps%` <- function(X, Y){`

f <- function(x, y){

a1 <- x[1] <= y[1] && y[1] <= x[2]

a2 <- x[1] <= y[2] && y[2] <= x[2]

a1 || a2

}

f(X, Y) || f(Y, X)

}

combn(1:nrow(d1), 2, function(x) {

d1[x[1], ] %overlaps% d1[x[2], ]

})

#[1] TRUE FALSE TRUE TRUE TRUE TRUE

### Related Topics

Rstudio Suddenly Stopped Showing Plots in the Plot Pane

Break Dataframe into Smaller Dataframe'S and Save Them

How to Remove the Negative Values from a Data Frame in R

Extract Rows for the First Occurrence of a Variable in a Data Frame

Reshaping Multiple Sets of Measurement Columns (Wide Format) into Single Columns (Long Format)

Reshape Three Column Data Frame to Matrix ("Long" to "Wide" Format)

How to Read Data When Some Numbers Contain Commas as Thousand Separator

Ggplot2 - Annotate Outside of Plot

Replace Values in a Dataframe Based on Lookup Table

Replace Na With Previous or Next Value, by Group, Using Dplyr

Adding Value from One Data.Frame to Another Data.Frame by Matching a Variable

How to Change the Default Colors in Plotly Chart

R - Getting Characters After Symbol

Choose the Top Five Values from Each Group in R

Order Bars in Ggplot2 Bar Graph

Finding All Duplicate Rows, Including "Elements With Smaller Subscripts"

What Specifically Are the Dangers of Eval(Parse(...))

How to Use R'S Ellipsis Feature When Writing Your Own Function