Non-Equi Join Using Data.Table: Column Missing from the Output

Non-equi join using data.table: column missing from the output

In data.table, joins of the form x[i] traditionally uses values from i but uses column names from x. Even though this is different from SQL which returns both, this default makes a lot of sense for equi joins since we are interested in all rows from i and if they match then both data.tables have equal values anyway, and if they don't we need to keep those unmatched values from i in result.

But for non-equi joins, since the values might not match exactly, i.e., can fall within a range, there might be cases where we will have to return outputs similar to SQL (or identify such cases and return the result user expects, similar to the case of equi joins). This hasn't been done yet, but I've placed a provision for it at the moment, which is to refer to the columns explicitly with a x. prefix. It is not convenient, I agree. Hopefully this'll be taken care of automatically soon.

Here's how to get your result using x. prefix.

ans <- DT2[DT1, .(CERT_NUMBER, FORENAME, SURNAME, x.START_DATE, x.EXPIRY_DATE, ID, MONTH), 
         on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

IIRC there's an issue filed on the GitHub project page about this as well.

unexpected behavior non-equi join data.table R

I think the behavior is the expected, except for the variable being named quite surprisingly. I made a little tweak to your sample data to show that everything is alright:

dt1 <- data.table(foo = 1:4, bar = letters[1:4])
dt2 <- data.table(foo1 = 2:4, bar1 = letters[2:4]) # small change here

dt1[dt2, on = .(foo < foo1), allow.cartesian = TRUE][dt1, on = "bar"]
    foo bar bar1 i.foo
 1:   2   a    b     1
 2:   3   a    c     1
 3:   4   a    d     1
 4:   5   a    e     1
 5:   3   b    c     2
 6:   4   b    d     2
 7:   5   b    e     2
 8:   4   c    d     3
 9:   5   c    e     3
10:   5   d    e     4

For me the behavior is the expected, it's just that the first column is named foo instead of foo1. That's why you see fool1 as an erroneous copy of "erroneous foo: it's actually a copy offoo1`.

EDIT: a possible workaround:

This isn't really elegant, but is a workaround:

dt1[dt2, .(foo = x.foo, foo1, bar, bar1), on = .(foo < foo1), allow.cartesian = TRUE]
    foo foo1 bar bar1
 1:   1    2   a    b
 2:   1    3   a    c
 3:   2    3   b    c
 4:   1    4   a    d
 5:   2    4   b    d
 6:   3    4   c    d
 7:   1    5   a    e
 8:   2    5   b    e
 9:   3    5   c    e
10:   4    5   d    e

x.foo retains the original, true foo. foo1 still is what it is, so you can return both variables.

non-equi joins adding all columns of range table in data.table in one step

Since you want results for every row of a, you should do a join like b[a, ...]:

b[a, on=.(LB <= salary, UB > salary), nomatch=0, 
  .(Company_ID, salary, cat, LB = x.LB, UB = x.UB, rep)]

   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice

nomatch=0 means we'll drop rows of a that are unmatched in b.
We need to explicitly refer to the UB and LB columns from b using the x.* prefix (coming from the ?data.table docs, where the arguments are named like x[i]).

Regarding the strange default cols, there is an open issue to change that behavior: #1615.

(Issue #1989, referenced below, is fixed now -- See Uwe's answer.)

Alternately... One way that should work and avoids explicitly listing all columns: add a's columns to b, then subset b:

b[a, on=.(LB <= salary, UB > salary), names(a) := mget(paste0("i.", names(a)))] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

There are two problems with this. First, there's a bug causing non-equi join to break when confronted with mget (#1989). The temporary workaround is to enumerate a's columns:

b[a, on=.(LB <= salary, UB > salary), `:=`(Company_ID = i.Company_ID, salary = i.salary)] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

Second, it's inefficient to do this join twice (once for := and a second time for which), but I can't see any way around that... maybe justifying a feature request to allow both j and which?

How to keep join column unchanged in data.table non-equi join?

You could also use %inrange%:

df.cov[posn %inrange% df.exons]

which results in:

   posn att
1:    1   a
2:    2   b
3:    3   c
4:  165   d

As you can see this leaves the values of the posn-column unchanged.

Another possiblity with a non-equi join:

df.exons[df.cov
         , on = .(start <= posn, end >= posn)
         , mult = "first"
         , nomatch = 0
         , .(posn = i.posn, att)][]

Column name labelling in data.table joins

I ended up answering my own question.

data_table_tidy_join <- function(x,y, join_by){

    x <- data.table(x)
    y <- data.table(y)

    # Determine single join names
    single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()

    # cols from x that won't require as matching in i
    remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)

    # names need to keep
    x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
    y_names_keep_raw <- names(y)

    # cols that exist in both x and y, but not being equi joined on
    cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]

    #rename so indexing works
    x_names_keep <- x_names_keep_raw
    y_names_keep <- y_names_keep_raw

    # give prefix to necessary vars
    x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
    y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")

    # implement data.table call, keeping required cols
    joined_data <-
        x[y, on = join_by,
          mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
          mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
        as_tibble()

    return(joined_data)

}

> x <- data.table(Id  = c("A", "B", "C", "C"),
+                  X1  = c(1L, 3L, 5L, 7L),
+                  X2 = c(8L,12L,9L,18L),
+                  XY  = c("x2", "x4", "x6", "x8"))
>  
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>   
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
  ID       Z1 Z2       X1    X2 XY   
  <chr> <int> <chr> <int> <int> <chr>
1 C         5 z5        5     9 x6   
2 C         6 z6        5     9 x6   
3 C         7 z7        5     9 x6   
4 C         7 z7        7    18 x8   
5 C         8 z8        5     9 x6   
6 C         8 z8        7    18 x8   
7 C         9 z9        5     9 x6   
8 C         9 z9        7    18 x8

Non-Equi Join Using Data.Table: Column Missing from the Output