Use 'J' to Select the Join Column of 'X' and All Its Non-Join Columns

Use `j` to select the join column of `x` and all its non-join columns

PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).

The amount of post processing is limited by how many on= cols there are:

d2[d1, on=.(grp, from <= val, to >= val), nomatch=0][, 
`:=`(val = from, from = NULL, to = NULL)][]

That doesn't seem too bad.


Following @Jaap's comment, here's another way, adding columns to d1 with an update join:

nm2 = setdiff(names(d2), c("from","to","grp"))
d1[d2, on=.(grp, val >= from, val <= to), (nm2) := mget(sprintf("i.%s", nm2))]

This makes sense here because the desired output is essentially d1 plus some columns from d2 (since each row of d1 matches at most one row of d2).

What does stand for in data.table joins with on=

When doing a non-equi join like X[Y, on = .(A < A)] data.table returns the A-column from Y (the i-data.table).

To get the desired result, you could do:

X[Y, on = .(A < A), .(A = x.A, B)]

which gives:

   A B
1: 1 1
2: 2 1
3: 3 1

In the next release, data.table will return both A columns. See here for the discussion.

How to use specify and extract variables from X and Y in the non-equi join X[Y, ...]?

To prevent confusion i would suggest to rename columns that have the same name in both data.tables and create copies of the columns of non-equi join columns.

setnames(tmp_dt2, "time", "time_dt2")
tmp_dt2[, c("time_from_join", "time_to_join"):=list(time_from, time_to)]
tmp_dt1[ , time_join := time]

Then we can join and then throw away all the temporary columns that data.table messes with in the non-equi join.

tmp_dt1[tmp_dt2, on=.(grp==grp, time_join >= time_from_join, time_join <= time_to_join)][ , c("grp", "time", "time_from", "time_to", "time_dt2")]

grp time time_from time_to time_dt2
1: 1 0.1079436 0.1 0.3 0.1
2: 1 0.1216919 0.1 0.3 0.1
3: 1 0.1255551 0.1 0.3 0.1
4: 1 0.1433044 0.1 0.3 0.1
...

unexpected behavior non-equi join data.table R

I think the behavior is the expected, except for the variable being named quite surprisingly. I made a little tweak to your sample data to show that everything is alright:

dt1 <- data.table(foo = 1:4, bar = letters[1:4])
dt2 <- data.table(foo1 = 2:4, bar1 = letters[2:4]) # small change here

dt1[dt2, on = .(foo < foo1), allow.cartesian = TRUE][dt1, on = "bar"]
foo bar bar1 i.foo
1: 2 a b 1
2: 3 a c 1
3: 4 a d 1
4: 5 a e 1
5: 3 b c 2
6: 4 b d 2
7: 5 b e 2
8: 4 c d 3
9: 5 c e 3
10: 5 d e 4

For me the behavior is the expected, it's just that the first column is named foo instead of foo1. That's why you see fool1 as an erroneous copy of "erroneous foo: it's actually a copy offoo1`.

EDIT: a possible workaround:

This isn't really elegant, but is a workaround:

dt1[dt2, .(foo = x.foo, foo1, bar, bar1), on = .(foo < foo1), allow.cartesian = TRUE]
foo foo1 bar bar1
1: 1 2 a b
2: 1 3 a c
3: 2 3 b c
4: 1 4 a d
5: 2 4 b d
6: 3 4 c d
7: 1 5 a e
8: 2 5 b e
9: 3 5 c e
10: 4 5 d e

x.foo retains the original, true foo. foo1 still is what it is, so you can return both variables.

How to keep join column unchanged in data.table non-equi join?

You could also use %inrange%:

df.cov[posn %inrange% df.exons]

which results in:

   posn att
1: 1 a
2: 2 b
3: 3 c
4: 165 d

As you can see this leaves the values of the posn-column unchanged.


Another possiblity with a non-equi join:

df.exons[df.cov
, on = .(start <= posn, end >= posn)
, mult = "first"
, nomatch = 0
, .(posn = i.posn, att)][]

Column name labelling in data.table joins

I ended up answering my own question.

data_table_tidy_join <- function(x,y, join_by){

x <- data.table(x)
y <- data.table(y)

# Determine single join names
single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()

# cols from x that won't require as matching in i
remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)

# names need to keep
x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
y_names_keep_raw <- names(y)

# cols that exist in both x and y, but not being equi joined on
cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]

#rename so indexing works
x_names_keep <- x_names_keep_raw
y_names_keep <- y_names_keep_raw

# give prefix to necessary vars
x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")

# implement data.table call, keeping required cols
joined_data <-
x[y, on = join_by,
mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
as_tibble()

return(joined_data)

}

> x <- data.table(Id = c("A", "B", "C", "C"),
+ X1 = c(1L, 3L, 5L, 7L),
+ X2 = c(8L,12L,9L,18L),
+ XY = c("x2", "x4", "x6", "x8"))
>
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
ID Z1 Z2 X1 X2 XY
<chr> <int> <chr> <int> <int> <chr>
1 C 5 z5 5 9 x6
2 C 6 z6 5 9 x6
3 C 7 z7 5 9 x6
4 C 7 z7 7 18 x8
5 C 8 z8 5 9 x6
6 C 8 z8 7 18 x8
7 C 9 z9 5 9 x6
8 C 9 z9 7 18 x8

Join within date range

Your logic is not as simple as "between", since it appears that you want any kind of overlap, whether a superset or otherwise. For that, we need a slightly different query (and should include ID on the left-join as well, I'm inferring).

sqldf::sqldf("
select h.*, o.DT_START_OUT, o.DT_END_OUT
from HOSP h
left join OUT o on h.ID = o.ID
and h.DT_START_HOSP < o.DT_END_OUT
and h.DT_END_HOSP > o.DT_START_OUT")
# ID DT_START_HOSP DT_END_HOSP DT_START_OUT DT_END_OUT
# 1 111 2021-01-07 2021-01-10 <NA> <NA>
# 2 222 2021-01-11 2021-01-20 2021-01-15 2021-01-15
# 3 333 2021-01-21 2021-01-25 <NA> <NA>
# 4 444 2021-01-21 2021-02-01 <NA> <NA>
# 5 555 2021-01-21 2021-01-29 <NA> <NA>
# 6 666 2021-01-22 2021-02-02 2021-01-25 2021-01-25
# 7 666 2021-01-22 2021-02-02 2021-01-28 2021-01-30

(Thank you for fixing your data from the previous question and the first draft of this one. For the record, you may want this, some handy code that deals well with vectors of dates in inconsistent/different formats.)

R data.table - simple way to join on all columns without specifying column names

I think that the way to do this in the data table is the following:

require(data.table)
dt1 <- data.table(A1 = c(1,2,3), A2 = c("A", "B", "D"))
dt2 <- data.table(A1 = c(3,2,3), A2 = c("A", "B", "C"))

setkeyv(dt1, names(dt1))
setkeyv(dt2, names(dt2))

and the inner join on all common columns is:

dt1[dt2, nomatch = 0]

Other options include the following (credits to Frank in the comments):

dt1[dt2, on=names(dt2), nomatch = 0]

This has the benefit of not requiring to key the data table. (More info can be found here: What is the purpose of setting a key in data.table? )


Another option using the data sets operations (available in version 1.9.7 or later):

fintersect(dt1, dt2)

Select which column I receive the results from in data.table join - R

We need the by argument

tab2[tab1, val := get(column), on = .(z), by = .EACHI]

-output

> tab2
z column val
1: 1 x A
2: 2 x B
3: 3 y G
4: 4 y H


Related Topics



Leave a reply



Submit