Non-Equi-Joins in R with Data.Table - Backticked Column Name Trouble

non-equi-joins in R with data.table - backticked column name trouble

Specifying on= with strings is another option:

a[b, on = c("test name1==test name4", "test name2>test name3", "test name2<=V2")]

I think this works only if there is no whitespace around the equality/inequality operators and == is used instead of =.

I'm not sure if there's a way to write the on= along the lines of the OP's code, though it seems like there should be.

non-equi joins adding all columns of range table in data.table in one step

Since you want results for every row of a, you should do a join like b[a, ...]:

b[a, on=.(LB <= salary, UB > salary), nomatch=0, 
.(Company_ID, salary, cat, LB = x.LB, UB = x.UB, rep)]

Company_ID salary cat LB UB rep
1: 1 2000 1 0 3000 Bob
2: 1 3000 2 3000 5000 Alice
3: 1 4000 2 3000 5000 Alice
  • nomatch=0 means we'll drop rows of a that are unmatched in b.
  • We need to explicitly refer to the UB and LB columns from b using the x.* prefix (coming from the ?data.table docs, where the arguments are named like x[i]).

Regarding the strange default cols, there is an open issue to change that behavior: #1615.


(Issue #1989, referenced below, is fixed now -- See Uwe's answer.)

Alternately... One way that should work and avoids explicitly listing all columns: add a's columns to b, then subset b:

b[a, on=.(LB <= salary, UB > salary), names(a) := mget(paste0("i.", names(a)))] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

There are two problems with this. First, there's a bug causing non-equi join to break when confronted with mget (#1989). The temporary workaround is to enumerate a's columns:

b[a, on=.(LB <= salary, UB > salary), `:=`(Company_ID = i.Company_ID, salary = i.salary)] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

Second, it's inefficient to do this join twice (once for := and a second time for which), but I can't see any way around that... maybe justifying a feature request to allow both j and which?

Column name labelling in data.table joins

I ended up answering my own question.

data_table_tidy_join <- function(x,y, join_by){

x <- data.table(x)
y <- data.table(y)

# Determine single join names
single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()

# cols from x that won't require as matching in i
remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)

# names need to keep
x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
y_names_keep_raw <- names(y)

# cols that exist in both x and y, but not being equi joined on
cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]

#rename so indexing works
x_names_keep <- x_names_keep_raw
y_names_keep <- y_names_keep_raw

# give prefix to necessary vars
x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")

# implement data.table call, keeping required cols
joined_data <-
x[y, on = join_by,
mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
as_tibble()

return(joined_data)

}

> x <- data.table(Id = c("A", "B", "C", "C"),
+ X1 = c(1L, 3L, 5L, 7L),
+ X2 = c(8L,12L,9L,18L),
+ XY = c("x2", "x4", "x6", "x8"))
>
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
ID Z1 Z2 X1 X2 XY
<chr> <int> <chr> <int> <int> <chr>
1 C 5 z5 5 9 x6
2 C 6 z6 5 9 x6
3 C 7 z7 5 9 x6
4 C 7 z7 7 18 x8
5 C 8 z8 5 9 x6
6 C 8 z8 7 18 x8
7 C 9 z9 5 9 x6
8 C 9 z9 7 18 x8

Non-equi join of dates using data table

It looks like you are trying to join edits and events so that a probability value from the edits data table is associated with the correct observation from the events data table.

It looks like the error is ocuring because the time intervals used to create the edits data table are not mutually exclusive. When I modify the time intervals to what I think you intended, then your code gives the result that you were looking for.

library(data.table)

edits <- data.table(proposal=c('A','A','A'),
editField=c('probability','probability','probability'),
startDate=as.POSIXct(c('2017-04-14 00:00:00','2018-10-10 15:47:00','2019-09-06 12:12:00')),
endDate=as.POSIXct(c('2018-10-10 15:47:00','2019-09-06 12:12:00','9999-12-31 05:00:00')),
value=c(.1,.3,.1))

events <- data.table(proposal='A',
editDate=as.POSIXct(c('2017-04-14 00:00:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2018-07-04 15:33:59','2018-07-27 08:01:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-11-26 11:10:00','2019-02-05 13:06:59')),
editField=c('Created','stage','probability','estOrder','estOrder','estOrder','stage','probability','estOrder','estOrder','estOrder'))

edits[editField=='probability'][events, on=.(proposal, startDate<=editDate, endDate>editDate)]

or you can do the join with out chaining it

  edits[events, on=.(proposal, startDate<=editDate, endDate>editDate)]

or you could do as Jonny Phelps suggested and use foverlaps, but this also requires mutually exclusive time intervals in the edits data table

events[,startDate:= editDate]

setkey(events, startDate, editDate)

setkey(edits, startDate, endDate)

foverlaps(events, edits, type="any", mult="first")

Need help optimizing cumsum like code - sqldf, data.table, non-equi joins

The OP has asked to optimize a number of cascading sqldf statements (before OP's edit). Unfortunately, the OP has not explained verbally what aggregations he has implemented. So, a substantial amount of reverse engineering was required.

Anyhow, here is what I would do using data.table to achieve the same results. Execution time is down from 16 sec for OP's sqldf code to less than 0.2 sec for the data.table versions.

data.table versions of edited example

The OP has edited the question to reduce the number of sqldf statements. Now, only one aggregate is computed.

The new column winner_overall_wins in data2 is the count of all matches the winner has won before the actual tourney has started. This number is attached to all matches of the actual tourney which were won by the winner. (Note this is a different aggregation than the count of matches which were won before the actual match).

Since version 1.9.8 (on CRAN 25 Nov 2016), data.table is capable to do non-equi joins. In addition, fread() can be advised to only read selected columns which further speeds up I/O.

library(data.table)  # v1.11.2

urls <- sprintf(
"https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_%i.csv",
2000:2018)
selected_cols <- c("tourney_name", "tourney_date", "match_num",
"winner_id", "winner_name",
"loser_id", "loser_name")

# read only selected columns from files & combine into one data object
matches <- rbindlist(lapply(urls, fread, select = selected_cols))

# non-equi join to compute aggregate, second join to append, order result
system.time({
result_nej <- matches[
unique(matches[matches, on = .(winner_id, tourney_date < tourney_date),
.(winner_overall_wins = .N), by = .EACHI]),
on = .(winner_id, tourney_date)][
order(-tourney_date, tourney_name, -match_num)]
})

The two data.table joins and the subsequent ordering took an elapsed time of around 0.15 sec on my system vs 16 to 19 sec for various runs of OP's sqldf code.

The history of a particular player can be retrieved by

p_name <- "Federer"; result_nej[winner_name %like% p_name | loser_id %like% p_name]
                     tourney_name tourney_date match_num winner_id   winner_name loser_id         loser_name winner_overall_wins
1: Australian Open 20180115 701 103819 Roger Federer 105227 Marin Cilic 1128
2: Australian Open 20180115 602 103819 Roger Federer 111202 Hyeon Chung 1128
3: Australian Open 20180115 504 103819 Roger Federer 104607 Tomas Berdych 1128
4: Australian Open 20180115 408 103819 Roger Federer 105916 Marton Fucsovics 1128
5: Australian Open 20180115 316 103819 Roger Federer 104755 Richard Gasquet 1128
---
1131: Marseille 20000207 3 103819 Roger Federer 102179 Antony Dupuis 4
1132: Davis Cup WG R1: SUI vs AUS 20000204 2 103819 Roger Federer 102882 Mark Philippoussis 3
1133: Australian Open 20000117 90 103819 Roger Federer 102466 Jan Kroslak 1
1134: Australian Open 20000117 52 103819 Roger Federer 102021 Michael Chang 1
1135: Adelaide 20000103 2 103819 Roger Federer 102533 Jens Knippschild 0

There is an alternative and faster solution using cumsum() and shift():

system.time({
# cumumlative operations require ordered data
setorder(matches, tourney_date, tourney_name, match_num)
# add tourney id for convenience and conciseness
matches[, t_id := rleid(tourney_date, tourney_name)]
# aggregate by player and tourney
p_t_hist <- matches[, .(winner_won = .N), by = .(winner_id, t_id)]
# compute cumulative sum for each player and
# lag to show only matches of previous tourneys
tmp <- p_t_hist[order(t_id),
.(t_id, winner_overall_wins = shift(cumsum(winner_won))),
by = winner_id]
# append new column & order result
result_css <- matches[tmp, on = .(t_id, winner_id)][order(-t_id)]
})
p_name <- "Federer"; result_css[winner_name %like% p_name | loser_id %like% p_name]

On my system, elapsed time is at about 0.05 sec which is 3 times faster than the non-equi join variant and magnitudes faster than OP's approach.



Related Topics



Leave a reply



Submit