Non-Equi-Joins in R with Data.Table - Backticked Column Name Trouble

non-equi-joins in R with data.table - backticked column name trouble

Specifying on= with strings is another option:

a[b, on = c("test name1==test name4", "test name2>test name3", "test name2<=V2")]

I think this works only if there is no whitespace around the equality/inequality operators and == is used instead of =.

I'm not sure if there's a way to write the on= along the lines of the OP's code, though it seems like there should be.

non-equi joins adding all columns of range table in data.table in one step

Since you want results for every row of a, you should do a join like b[a, ...]:

b[a, on=.(LB <= salary, UB > salary), nomatch=0, 
  .(Company_ID, salary, cat, LB = x.LB, UB = x.UB, rep)]

   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice

nomatch=0 means we'll drop rows of a that are unmatched in b.
We need to explicitly refer to the UB and LB columns from b using the x.* prefix (coming from the ?data.table docs, where the arguments are named like x[i]).

Regarding the strange default cols, there is an open issue to change that behavior: #1615.

(Issue #1989, referenced below, is fixed now -- See Uwe's answer.)

Alternately... One way that should work and avoids explicitly listing all columns: add a's columns to b, then subset b:

b[a, on=.(LB <= salary, UB > salary), names(a) := mget(paste0("i.", names(a)))] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

There are two problems with this. First, there's a bug causing non-equi join to break when confronted with mget (#1989). The temporary workaround is to enumerate a's columns:

b[a, on=.(LB <= salary, UB > salary), `:=`(Company_ID = i.Company_ID, salary = i.salary)] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

Second, it's inefficient to do this join twice (once for := and a second time for which), but I can't see any way around that... maybe justifying a feature request to allow both j and which?

Column name labelling in data.table joins

I ended up answering my own question.

data_table_tidy_join <- function(x,y, join_by){

    x <- data.table(x)
    y <- data.table(y)

    # Determine single join names
    single_join_names <- purrr::keep((stringr::str_split(join_by, "==|>=|<=")), ~length(.) == 1) %>% unlist()

    # cols from x that won't require as matching in i
    remove_from_x_names <- c(trimws(na.omit(stringr::str_extract(join_by, ".*(?=[=]{2})"))), single_join_names)

    # names need to keep
    x_names_keep_raw <- names(x)[!names(x) %in% remove_from_x_names]
    y_names_keep_raw <- names(y)

    # cols that exist in both x and y, but not being equi joined on
    cols_rename_index <- x_names_keep_raw[x_names_keep_raw %in% y_names_keep_raw]

    #rename so indexing works
    x_names_keep <- x_names_keep_raw
    y_names_keep <- y_names_keep_raw

    # give prefix to necessary vars
    x_names_keep[x_names_keep %in% cols_rename_index] <- paste("x.",cols_rename_index, sep ="")
    y_names_keep[y_names_keep %in% cols_rename_index] <- paste("i.",cols_rename_index, sep ="")

    # implement data.table call, keeping required cols
    joined_data <-
        x[y, on = join_by,
          mget(c(paste0("i.", y_names_keep_raw),paste0("x.", x_names_keep_raw))) %>% set_names(c(y_names_keep,x_names_keep)),
          mult = "all", allow.cartesian = TRUE, nomatch = NA] %>%
        as_tibble()

    return(joined_data)

}

> x <- data.table(Id  = c("A", "B", "C", "C"),
+                  X1  = c(1L, 3L, 5L, 7L),
+                  X2 = c(8L,12L,9L,18L),
+                  XY  = c("x2", "x4", "x6", "x8"))
>  
> z <- data.table(ID = "C", Z1 = 5:9, Z2 = paste0("z", 5:9))
>   
> data_table_tidy_join(x, z, join_by = c("Id == ID","X1 <= Z1", "X2 >= Z1"))
# A tibble: 8 x 6
  ID       Z1 Z2       X1    X2 XY   
  <chr> <int> <chr> <int> <int> <chr>
1 C         5 z5        5     9 x6   
2 C         6 z6        5     9 x6   
3 C         7 z7        5     9 x6   
4 C         7 z7        7    18 x8   
5 C         8 z8        5     9 x6   
6 C         8 z8        7    18 x8   
7 C         9 z9        5     9 x6   
8 C         9 z9        7    18 x8

Non-equi join of dates using data table

It looks like you are trying to join edits and events so that a probability value from the edits data table is associated with the correct observation from the events data table.

It looks like the error is ocuring because the time intervals used to create the edits data table are not mutually exclusive. When I modify the time intervals to what I think you intended, then your code gives the result that you were looking for.

library(data.table)

edits <- data.table(proposal=c('A','A','A'),
    editField=c('probability','probability','probability'),
    startDate=as.POSIXct(c('2017-04-14 00:00:00','2018-10-10 15:47:00','2019-09-06 12:12:00')),
    endDate=as.POSIXct(c('2018-10-10 15:47:00','2019-09-06 12:12:00','9999-12-31 05:00:00')),
    value=c(.1,.3,.1))

events <- data.table(proposal='A',
    editDate=as.POSIXct(c('2017-04-14 00:00:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2018-07-04 15:33:59','2018-07-27 08:01:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-11-26 11:10:00','2019-02-05 13:06:59')),
    editField=c('Created','stage','probability','estOrder','estOrder','estOrder','stage','probability','estOrder','estOrder','estOrder'))

edits[editField=='probability'][events, on=.(proposal, startDate<=editDate, endDate>editDate)]

or you can do the join with out chaining it

  edits[events, on=.(proposal, startDate<=editDate, endDate>editDate)]

or you could do as Jonny Phelps suggested and use foverlaps, but this also requires mutually exclusive time intervals in the edits data table

events[,startDate:= editDate]

setkey(events, startDate, editDate)

setkey(edits, startDate, endDate)

foverlaps(events, edits, type="any", mult="first")

Need help optimizing cumsum like code - sqldf, data.table, non-equi joins

The OP has asked to optimize a number of cascading sqldf statements (before OP's edit). Unfortunately, the OP has not explained verbally what aggregations he has implemented. So, a substantial amount of reverse engineering was required.

Anyhow, here is what I would do using data.table to achieve the same results. Execution time is down from 16 sec for OP's sqldf code to less than 0.2 sec for the data.table versions.

`data.table` versions of edited example

The OP has edited the question to reduce the number of sqldf statements. Now, only one aggregate is computed.

The new column winner_overall_wins in data2 is the count of all matches the winner has won before the actual tourney has started. This number is attached to all matches of the actual tourney which were won by the winner. (Note this is a different aggregation than the count of matches which were won before the actual match).

Since version 1.9.8 (on CRAN 25 Nov 2016), data.table is capable to do non-equi joins. In addition, fread() can be advised to only read selected columns which further speeds up I/O.

library(data.table)  # v1.11.2

urls <- sprintf(
  "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_%i.csv", 
  2000:2018)
selected_cols <- c("tourney_name", "tourney_date", "match_num", 
                   "winner_id", "winner_name", 
                   "loser_id", "loser_name") 

# read only selected columns from files & combine into one data object
matches <- rbindlist(lapply(urls, fread, select = selected_cols))

# non-equi join to compute aggregate, second join to append, order result 
system.time({
  result_nej <- matches[
    unique(matches[matches, on = .(winner_id, tourney_date < tourney_date), 
                   .(winner_overall_wins = .N), by = .EACHI]),
    on = .(winner_id, tourney_date)][
      order(-tourney_date, tourney_name, -match_num)]
})

The two data.table joins and the subsequent ordering took an elapsed time of around 0.15 sec on my system vs 16 to 19 sec for various runs of OP's sqldf code.

The history of a particular player can be retrieved by

p_name <- "Federer"; result_nej[winner_name %like% p_name | loser_id %like% p_name]

                     tourney_name tourney_date match_num winner_id   winner_name loser_id         loser_name winner_overall_wins
   1:             Australian Open     20180115       701    103819 Roger Federer   105227        Marin Cilic                1128
   2:             Australian Open     20180115       602    103819 Roger Federer   111202        Hyeon Chung                1128
   3:             Australian Open     20180115       504    103819 Roger Federer   104607      Tomas Berdych                1128
   4:             Australian Open     20180115       408    103819 Roger Federer   105916   Marton Fucsovics                1128
   5:             Australian Open     20180115       316    103819 Roger Federer   104755    Richard Gasquet                1128
  ---                                                                                                                           
1131:                   Marseille     20000207         3    103819 Roger Federer   102179      Antony Dupuis                   4
1132: Davis Cup WG R1: SUI vs AUS     20000204         2    103819 Roger Federer   102882 Mark Philippoussis                   3
1133:             Australian Open     20000117        90    103819 Roger Federer   102466        Jan Kroslak                   1
1134:             Australian Open     20000117        52    103819 Roger Federer   102021      Michael Chang                   1
1135:                    Adelaide     20000103         2    103819 Roger Federer   102533   Jens Knippschild                   0

There is an alternative and faster solution using cumsum() and shift():

system.time({
  # cumumlative operations require ordered data
  setorder(matches, tourney_date, tourney_name, match_num)
  # add tourney id for convenience and conciseness
  matches[, t_id := rleid(tourney_date, tourney_name)]
  # aggregate by player and tourney
  p_t_hist <- matches[, .(winner_won = .N), by = .(winner_id, t_id)]
  # compute cumulative sum for each player and 
  # lag to show only matches of previous tourneys
  tmp <- p_t_hist[order(t_id), 
                  .(t_id, winner_overall_wins = shift(cumsum(winner_won))), 
                  by = winner_id]
  # append new column & order result
  result_css <- matches[tmp, on = .(t_id, winner_id)][order(-t_id)]
})
p_name <- "Federer"; result_css[winner_name %like% p_name | loser_id %like% p_name]

On my system, elapsed time is at about 0.05 sec which is 3 times faster than the non-equi join variant and magnitudes faster than OP's approach.

Non-Equi-Joins in R with Data.Table - Backticked Column Name Trouble