Use `j` to select the join column of `x` and all its non-join columns
PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).
The amount of post processing is limited by how many on=
cols there are:
d2[d1, on=.(grp, from <= val, to >= val), nomatch=0][,
`:=`(val = from, from = NULL, to = NULL)][]
That doesn't seem too bad.
Following @Jaap's comment, here's another way, adding columns to d1
with an update join:
nm2 = setdiff(names(d2), c("from","to","grp"))
d1[d2, on=.(grp, val >= from, val <= to), (nm2) := mget(sprintf("i.%s", nm2))]
This makes sense here because the desired output is essentially d1
plus some columns from d2
(since each row of d1
matches at most one row of d2
).
What does stand for in data.table joins with on=
When doing a non-equi join like X[Y, on = .(A < A)]
data.table returns the A
-column from Y
(the i
-data.table).
To get the desired result, you could do:
X[Y, on = .(A < A), .(A = x.A, B)]
which gives:
A B
1: 1 1
2: 2 1
3: 3 1
In the next release, data.table will return both A
columns. See here for the discussion.
How to use specify and extract variables from X and Y in the non-equi join X[Y, ...]?
To prevent confusion i would suggest to rename columns that have the same name in both data.tables and create copies of the columns of non-equi join columns.
setnames(tmp_dt2, "time", "time_dt2")
tmp_dt2[, c("time_from_join", "time_to_join"):=list(time_from, time_to)]
tmp_dt1[ , time_join := time]
Then we can join and then throw away all the temporary columns that data.table messes with in the non-equi join.
tmp_dt1[tmp_dt2, on=.(grp==grp, time_join >= time_from_join, time_join <= time_to_join)][ , c("grp", "time", "time_from", "time_to", "time_dt2")]
grp time time_from time_to time_dt2
1: 1 0.1079436 0.1 0.3 0.1
2: 1 0.1216919 0.1 0.3 0.1
3: 1 0.1255551 0.1 0.3 0.1
4: 1 0.1433044 0.1 0.3 0.1
...
unexpected behavior non-equi join data.table R
I think the behavior is the expected, except for the variable being named quite surprisingly. I made a little tweak to your sample data to show that everything is alright:
dt1 <- data.table(foo = 1:4, bar = letters[1:4])
dt2 <- data.table(foo1 = 2:4, bar1 = letters[2:4]) # small change here
dt1[dt2, on = .(foo < foo1), allow.cartesian = TRUE][dt1, on = "bar"]
foo bar bar1 i.foo
1: 2 a b 1
2: 3 a c 1
3: 4 a d 1
4: 5 a e 1
5: 3 b c 2
6: 4 b d 2
7: 5 b e 2
8: 4 c d 3
9: 5 c e 3
10: 5 d e 4
For me the behavior is the expected, it's just that the first column is named foo
instead of foo1
. That's why you see fool1
as an erroneous copy of "erroneous foo: it's actually a copy of
foo1`.
EDIT: a possible workaround:
This isn't really elegant, but is a workaround:
dt1[dt2, .(foo = x.foo, foo1, bar, bar1), on = .(foo < foo1), allow.cartesian = TRUE]
foo foo1 bar bar1
1: 1 2 a b
2: 1 3 a c
3: 2 3 b c
4: 1 4 a d
5: 2 4 b d
6: 3 4 c d
7: 1 5 a e
8: 2 5 b e
9: 3 5 c e
10: 4 5 d e
x.foo
retains the original, true foo
. foo1
still is what it is, so you can return both variables.
How to keep join column unchanged in data.table non-equi join?
You could also use %inrange%
:
df.cov[posn %inrange% df.exons]
which results in:
posn att
1: 1 a
2: 2 b
3: 3 c
4: 165 d
As you can see this leaves the values of the posn
-column unchanged.
Another possiblity with a non-equi join:
df.exons[df.cov
, on = .(start <= posn, end >= posn)
, mult = "first"
, nomatch = 0
, .(posn = i.posn, att)][]
Column name labelling in data.table joins
I ended up answering my own question.
data_table_tidy_join <- function(x,y, join_by){
x <- data.table(x)
y <- data.table(y)
# Determine single join names
single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()
# cols from x that won't require as matching in i
remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)
# names need to keep
x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
y_names_keep_raw <- names(y)
# cols that exist in both x and y, but not being equi joined on
cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]
#rename so indexing works
x_names_keep <- x_names_keep_raw
y_names_keep <- y_names_keep_raw
# give prefix to necessary vars
x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")
# implement data.table call, keeping required cols
joined_data <-
x[y, on = join_by,
mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
as_tibble()
return(joined_data)
}
> x <- data.table(Id = c("A", "B", "C", "C"),
+ X1 = c(1L, 3L, 5L, 7L),
+ X2 = c(8L,12L,9L,18L),
+ XY = c("x2", "x4", "x6", "x8"))
>
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
ID Z1 Z2 X1 X2 XY
<chr> <int> <chr> <int> <int> <chr>
1 C 5 z5 5 9 x6
2 C 6 z6 5 9 x6
3 C 7 z7 5 9 x6
4 C 7 z7 7 18 x8
5 C 8 z8 5 9 x6
6 C 8 z8 7 18 x8
7 C 9 z9 5 9 x6
8 C 9 z9 7 18 x8
Join within date range
Your logic is not as simple as "between", since it appears that you want any kind of overlap, whether a superset or otherwise. For that, we need a slightly different query (and should include ID
on the left-join as well, I'm inferring).
sqldf::sqldf("
select h.*, o.DT_START_OUT, o.DT_END_OUT
from HOSP h
left join OUT o on h.ID = o.ID
and h.DT_START_HOSP < o.DT_END_OUT
and h.DT_END_HOSP > o.DT_START_OUT")
# ID DT_START_HOSP DT_END_HOSP DT_START_OUT DT_END_OUT
# 1 111 2021-01-07 2021-01-10 <NA> <NA>
# 2 222 2021-01-11 2021-01-20 2021-01-15 2021-01-15
# 3 333 2021-01-21 2021-01-25 <NA> <NA>
# 4 444 2021-01-21 2021-02-01 <NA> <NA>
# 5 555 2021-01-21 2021-01-29 <NA> <NA>
# 6 666 2021-01-22 2021-02-02 2021-01-25 2021-01-25
# 7 666 2021-01-22 2021-02-02 2021-01-28 2021-01-30
(Thank you for fixing your data from the previous question and the first draft of this one. For the record, you may want this, some handy code that deals well with vectors of dates in inconsistent/different formats.)
R data.table - simple way to join on all columns without specifying column names
I think that the way to do this in the data table is the following:
require(data.table)
dt1 <- data.table(A1 = c(1,2,3), A2 = c("A", "B", "D"))
dt2 <- data.table(A1 = c(3,2,3), A2 = c("A", "B", "C"))
setkeyv(dt1, names(dt1))
setkeyv(dt2, names(dt2))
and the inner join on all common columns is:
dt1[dt2, nomatch = 0]
Other options include the following (credits to Frank in the comments):
dt1[dt2, on=names(dt2), nomatch = 0
]
This has the benefit of not requiring to key the data table. (More info can be found here: What is the purpose of setting a key in data.table? )
Another option using the data sets operations (available in version 1.9.7 or later):
fintersect(dt1, dt2)
Select which column I receive the results from in data.table join - R
We need the by
argument
tab2[tab1, val := get(column), on = .(z), by = .EACHI]
-output
> tab2
z column val
1: 1 x A
2: 2 x B
3: 3 y G
4: 4 y H
Related Topics
How to Display Line Numbers for Code Chunks in Rmarkdown HTML and PDF
List Elements to Dataframes in R
Map Array of Strings to an Array of Integers
R: Ggplot2 Setting the Last Plot in the Midle with Facet_Wrap
My Group by Doesn't Appear to Be Working in Disk Frames
Download .Rdata and .CSV Files from Ftp Using Rcurl (Or Any Other Method)
Is There a Package or Technique Availabe for Calculating Large Factorials in R
Filter Group of Rows Based on Sum of Values from Different Column
Web Scraping Data Table with R Rvest
How to Pop Up the Graphics Window from Rscript
Calculate Difference Between Dates by Group in R
How to Color Bar Plots When Using ..Prop.. in Ggplot
Uri Routing for Shinydashboard Using Shiny.Router