Non-Joins with Data.Tables

non-joins with data.tables

As far as I know, this is a part of base R.

# This works
(1:4)[c(-2,-3)]

# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] :
# only 0's may be mixed with negative subscripts

The textual error message indicates that it is intended behavior.

Here's my best guess as to why that is the intended behavior:

From the way they treat NA's elsewhere (e.g. typically defaulting to na.rm=FALSE), it seems that R's designers view NA's as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting nomatch=0 gives you a clean way to pass that instruction along!)

In this context, the designers' preference probably explains why NA's are accepted for positive indexing, but not for negative indexing:

# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]

# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]

data.table non-join on one column

In data.table there is a ! operator for exclusions.

setkey(tmp, mgr, year, stocks)
all.stocks <- tmp[, .(stocks=unique(tmp$stocks)), by=.(mgr, year)]
setkey(all.stocks, mgr, year, stocks)
missing <- all.stocks[!tmp]

This should be pretty fast.

Non-equi join using data.table: column missing from the output

In data.table, joins of the form x[i] traditionally uses values from i but uses column names from x. Even though this is different from SQL which returns both, this default makes a lot of sense for equi joins since we are interested in all rows from i and if they match then both data.tables have equal values anyway, and if they don't we need to keep those unmatched values from i in result.

But for non-equi joins, since the values might not match exactly, i.e., can fall within a range, there might be cases where we will have to return outputs similar to SQL (or identify such cases and return the result user expects, similar to the case of equi joins). This hasn't been done yet, but I've placed a provision for it at the moment, which is to refer to the columns explicitly with a x. prefix. It is not convenient, I agree. Hopefully this'll be taken care of automatically soon.

Here's how to get your result using x. prefix.

ans <- DT2[DT1, .(CERT_NUMBER, FORENAME, SURNAME, x.START_DATE, x.EXPIRY_DATE, ID, MONTH), 
on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

IIRC there's an issue filed on the GitHub project page about this as well.

non-equi-joins in R with data.table - backticked column name trouble

Specifying on= with strings is another option:

a[b, on = c("test name1==test name4", "test name2>test name3", "test name2<=V2")]

I think this works only if there is no whitespace around the equality/inequality operators and == is used instead of =.

I'm not sure if there's a way to write the on= along the lines of the OP's code, though it seems like there should be.

Why does X[Y] join of data.tables not allow a full outer join, or a left join?

To quote from the data.table FAQ 1.11 What is the difference between X[Y] and merge(X, Y)?

X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index.

Y[X] is a join, looking up Y's rows using X (or X's key if it has one)

merge(X,Y) does both ways at the same time. The number of rows of X[Y] and Y[X] usually differ, whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same.

BUT that misses the main point. Most tasks require something to be done on the
data after a join or merge. Why merge all the columns of data, only to
use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for
you. When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge of everything wastefully followed by a subset?


If you want a left outer join of X[Y]

le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE

If you want a full outer join

# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
## t b a
## 1: 1 NA 1
## 2: 2 NA 4
## 3: 3 9 9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA

# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]

Non-equi join of data table operation

Another option is to use by=.EACHI:

library(data.table)
setDT(proposal_df)
setDT(proposal_log)
proposal_df[, c("countWon","countLost","wonValueMean","pctWon") :=
proposal_log[.SD, on=.(owner, editDate<=editDate), by=.EACHI, {
cw <- sum(outcome==1L)
.(cw, sum(outcome==0L), mean(x.totalAtEdit[outcome==1L]), cw/.N)
}][, (1L:2L) := NULL]
]

What does stand for in data.table joins with on=

When doing a non-equi join like X[Y, on = .(A < A)] data.table returns the A-column from Y (the i-data.table).

To get the desired result, you could do:

X[Y, on = .(A < A), .(A = x.A, B)]

which gives:

   A B
1: 1 1
2: 2 1
3: 3 1

In the next release, data.table will return both A columns. See here for the discussion.

Non-equi join of dates using data table

It looks like you are trying to join edits and events so that a probability value from the edits data table is associated with the correct observation from the events data table.

It looks like the error is ocuring because the time intervals used to create the edits data table are not mutually exclusive. When I modify the time intervals to what I think you intended, then your code gives the result that you were looking for.

library(data.table)

edits <- data.table(proposal=c('A','A','A'),
editField=c('probability','probability','probability'),
startDate=as.POSIXct(c('2017-04-14 00:00:00','2018-10-10 15:47:00','2019-09-06 12:12:00')),
endDate=as.POSIXct(c('2018-10-10 15:47:00','2019-09-06 12:12:00','9999-12-31 05:00:00')),
value=c(.1,.3,.1))

events <- data.table(proposal='A',
editDate=as.POSIXct(c('2017-04-14 00:00:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2018-07-04 15:33:59','2018-07-27 08:01:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-11-26 11:10:00','2019-02-05 13:06:59')),
editField=c('Created','stage','probability','estOrder','estOrder','estOrder','stage','probability','estOrder','estOrder','estOrder'))

edits[editField=='probability'][events, on=.(proposal, startDate<=editDate, endDate>editDate)]

or you can do the join with out chaining it

  edits[events, on=.(proposal, startDate<=editDate, endDate>editDate)]

or you could do as Jonny Phelps suggested and use foverlaps, but this also requires mutually exclusive time intervals in the edits data table

events[,startDate:= editDate]

setkey(events, startDate, editDate)

setkey(edits, startDate, endDate)

foverlaps(events, edits, type="any", mult="first")


Related Topics



Leave a reply



Submit