Non-equi join using data.table: column missing from the output
In data.table, joins of the form x[i]
traditionally uses values from i
but uses column names from x
. Even though this is different from SQL which returns both, this default makes a lot of sense for equi joins since we are interested in all rows from i
and if they match then both data.tables have equal values anyway, and if they don't we need to keep those unmatched values from i
in result.
But for non-equi joins, since the values might not match exactly, i.e., can fall within a range, there might be cases where we will have to return outputs similar to SQL (or identify such cases and return the result user expects, similar to the case of equi joins). This hasn't been done yet, but I've placed a provision for it at the moment, which is to refer to the columns explicitly with a x.
prefix. It is not convenient, I agree. Hopefully this'll be taken care of automatically soon.
Here's how to get your result using x.
prefix.
ans <- DT2[DT1, .(CERT_NUMBER, FORENAME, SURNAME, x.START_DATE, x.EXPIRY_DATE, ID, MONTH),
on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]
IIRC there's an issue filed on the GitHub project page about this as well.
unexpected behavior non-equi join data.table R
I think the behavior is the expected, except for the variable being named quite surprisingly. I made a little tweak to your sample data to show that everything is alright:
dt1 <- data.table(foo = 1:4, bar = letters[1:4])
dt2 <- data.table(foo1 = 2:4, bar1 = letters[2:4]) # small change here
dt1[dt2, on = .(foo < foo1), allow.cartesian = TRUE][dt1, on = "bar"]
foo bar bar1 i.foo
1: 2 a b 1
2: 3 a c 1
3: 4 a d 1
4: 5 a e 1
5: 3 b c 2
6: 4 b d 2
7: 5 b e 2
8: 4 c d 3
9: 5 c e 3
10: 5 d e 4
For me the behavior is the expected, it's just that the first column is named foo
instead of foo1
. That's why you see fool1
as an erroneous copy of "erroneous foo: it's actually a copy of
foo1`.
EDIT: a possible workaround:
This isn't really elegant, but is a workaround:
dt1[dt2, .(foo = x.foo, foo1, bar, bar1), on = .(foo < foo1), allow.cartesian = TRUE]
foo foo1 bar bar1
1: 1 2 a b
2: 1 3 a c
3: 2 3 b c
4: 1 4 a d
5: 2 4 b d
6: 3 4 c d
7: 1 5 a e
8: 2 5 b e
9: 3 5 c e
10: 4 5 d e
x.foo
retains the original, true foo
. foo1
still is what it is, so you can return both variables.
non-equi joins adding all columns of range table in data.table in one step
Since you want results for every row of a
, you should do a join like b[a, ...]
:
b[a, on=.(LB <= salary, UB > salary), nomatch=0,
.(Company_ID, salary, cat, LB = x.LB, UB = x.UB, rep)]
Company_ID salary cat LB UB rep
1: 1 2000 1 0 3000 Bob
2: 1 3000 2 3000 5000 Alice
3: 1 4000 2 3000 5000 Alice
nomatch=0
means we'll drop rows ofa
that are unmatched inb
.- We need to explicitly refer to the
UB
andLB
columns fromb
using thex.*
prefix (coming from the?data.table
docs, where the arguments are named likex[i]
).
Regarding the strange default cols, there is an open issue to change that behavior: #1615.
(Issue #1989, referenced below, is fixed now -- See Uwe's answer.)
Alternately... One way that should work and avoids explicitly listing all columns: add a
's columns to b
, then subset b
:
b[a, on=.(LB <= salary, UB > salary), names(a) := mget(paste0("i.", names(a)))]
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]
There are two problems with this. First, there's a bug causing non-equi join to break when confronted with mget
(#1989). The temporary workaround is to enumerate a
's columns:
b[a, on=.(LB <= salary, UB > salary), `:=`(Company_ID = i.Company_ID, salary = i.salary)]
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]
Second, it's inefficient to do this join twice (once for :=
and a second time for which
), but I can't see any way around that... maybe justifying a feature request to allow both j
and which
?
How to keep join column unchanged in data.table non-equi join?
You could also use %inrange%
:
df.cov[posn %inrange% df.exons]
which results in:
posn att
1: 1 a
2: 2 b
3: 3 c
4: 165 d
As you can see this leaves the values of the posn
-column unchanged.
Another possiblity with a non-equi join:
df.exons[df.cov
, on = .(start <= posn, end >= posn)
, mult = "first"
, nomatch = 0
, .(posn = i.posn, att)][]
Column name labelling in data.table joins
I ended up answering my own question.
data_table_tidy_join <- function(x,y, join_by){
x <- data.table(x)
y <- data.table(y)
# Determine single join names
single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()
# cols from x that won't require as matching in i
remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)
# names need to keep
x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
y_names_keep_raw <- names(y)
# cols that exist in both x and y, but not being equi joined on
cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]
#rename so indexing works
x_names_keep <- x_names_keep_raw
y_names_keep <- y_names_keep_raw
# give prefix to necessary vars
x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")
# implement data.table call, keeping required cols
joined_data <-
x[y, on = join_by,
mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
as_tibble()
return(joined_data)
}
> x <- data.table(Id = c("A", "B", "C", "C"),
+ X1 = c(1L, 3L, 5L, 7L),
+ X2 = c(8L,12L,9L,18L),
+ XY = c("x2", "x4", "x6", "x8"))
>
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
ID Z1 Z2 X1 X2 XY
<chr> <int> <chr> <int> <int> <chr>
1 C 5 z5 5 9 x6
2 C 6 z6 5 9 x6
3 C 7 z7 5 9 x6
4 C 7 z7 7 18 x8
5 C 8 z8 5 9 x6
6 C 8 z8 7 18 x8
7 C 9 z9 5 9 x6
8 C 9 z9 7 18 x8
Related Topics
"Correct" Way to Specifiy Optional Arguments in R Functions
Adding Space Between Bars in Ggplot2
Specification of First and Last Tick Marks with Scale_X_Date
Function to Calculate R2 (R-Squared) in R
Filter Data Frame Rows Based on Values in Vector
Return Index from a Vector of the Value Closest to a Given Element
How to Calculate Combination and Permutation in R
Why Is Allow.Cartesian Required at Times When When Joining Data.Tables with Duplicate Keys
Differencebetween Assign() and <<- in R
Ggplot2 Pie and Donut Chart on Same Plot
Load Multiple Packages at Once
Using a Pre-Defined Color Palette in Ggplot
Performing Dplyr Mutate on Subset of Columns
How to Read Only Lines That Fulfil a Condition from a CSV into R
Making a Stacked Bar Plot for Multiple Variables - Ggplot2 in R
How to Count the Frequency of a String for Each Row in R
Avoid Ggplot Sorting the X-Axis While Plotting Geom_Bar()
How to Connect Two Coordinates with a Line Using Leaflet in R