non-joins with data.tables
As far as I know, this is a part of base R.
# This works
(1:4)[c(-2,-3)]
# But this gives you the same error you described above
(1:4)[c(-2, -3, NA)]
# Error in (1:4)[c(-2, -3, NA)] :
# only 0's may be mixed with negative subscripts
The textual error message indicates that it is intended behavior.
Here's my best guess as to why that is the intended behavior:
From the way they treat NA
's elsewhere (e.g. typically defaulting to na.rm=FALSE
), it seems that R's designers view NA
's as carrying important information, and are loath to drop that without some explicit instruction to do so. (Fortunately, setting nomatch=0
gives you a clean way to pass that instruction along!)
In this context, the designers' preference probably explains why NA
's are accepted for positive indexing, but not for negative indexing:
# Positive indexing: works, because the return value retains info about NA's
(1:4)[c(2,3,NA)]
# Negative indexing: doesn't work, because it can't easily retain such info
(1:4)[c(-2,-3,NA)]
data.table non-join on one column
In data.table
there is a !
operator for exclusions.
setkey(tmp, mgr, year, stocks)
all.stocks <- tmp[, .(stocks=unique(tmp$stocks)), by=.(mgr, year)]
setkey(all.stocks, mgr, year, stocks)
missing <- all.stocks[!tmp]
This should be pretty fast.
Non-equi join using data.table: column missing from the output
In data.table, joins of the form x[i]
traditionally uses values from i
but uses column names from x
. Even though this is different from SQL which returns both, this default makes a lot of sense for equi joins since we are interested in all rows from i
and if they match then both data.tables have equal values anyway, and if they don't we need to keep those unmatched values from i
in result.
But for non-equi joins, since the values might not match exactly, i.e., can fall within a range, there might be cases where we will have to return outputs similar to SQL (or identify such cases and return the result user expects, similar to the case of equi joins). This hasn't been done yet, but I've placed a provision for it at the moment, which is to refer to the columns explicitly with a x.
prefix. It is not convenient, I agree. Hopefully this'll be taken care of automatically soon.
Here's how to get your result using x.
prefix.
ans <- DT2[DT1, .(CERT_NUMBER, FORENAME, SURNAME, x.START_DATE, x.EXPIRY_DATE, ID, MONTH),
on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]
IIRC there's an issue filed on the GitHub project page about this as well.
non-equi-joins in R with data.table - backticked column name trouble
Specifying on=
with strings is another option:
a[b, on = c("test name1==test name4", "test name2>test name3", "test name2<=V2")]
I think this works only if there is no whitespace around the equality/inequality operators and ==
is used instead of =
.
I'm not sure if there's a way to write the on=
along the lines of the OP's code, though it seems like there should be.
Why does X[Y] join of data.tables not allow a full outer join, or a left join?
To quote from the data.table
FAQ 1.11 What is the difference between X[Y]
and merge(X, Y)
?
X[Y]
is a join, looking up X's rows using Y (or Y's key if it has one) as an index.
Y[X]
is a join, looking up Y's rows using X (or X's key if it has one)
merge(X,Y)
does both ways at the same time. The number of rows ofX[Y]
andY[X]
usually differ, whereas the number of rows returned bymerge(X,Y)
andmerge(Y,X)
is the same.BUT that misses the main point. Most tasks require something to be done on the
data after a join or merge. Why merge all the columns of data, only to
use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2])
, but that requires the programmer to work out which columns are needed.X[Y,j
] in data.table does all that in one step for
you. When you writeX[Y,sum(foo*bar)]
, data.table automatically inspects thej
expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns thej
uses, andY
columns enjoy standard R recycling rules within the context of each group. Let's sayfoo
is inX
, and bar is inY
(along with 20 other columns inY
). Isn'tX[Y,sum(foo*bar)]
quicker to program and quicker to run than a merge of everything wastefully followed by a subset?
If you want a left outer join of X[Y]
le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE
If you want a full outer join
# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
## t b a
## 1: 1 NA 1
## 2: 2 NA 4
## 3: 3 9 9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA
# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]
Non-equi join of data table operation
Another option is to use by=.EACHI
:
library(data.table)
setDT(proposal_df)
setDT(proposal_log)
proposal_df[, c("countWon","countLost","wonValueMean","pctWon") :=
proposal_log[.SD, on=.(owner, editDate<=editDate), by=.EACHI, {
cw <- sum(outcome==1L)
.(cw, sum(outcome==0L), mean(x.totalAtEdit[outcome==1L]), cw/.N)
}][, (1L:2L) := NULL]
]
What does stand for in data.table joins with on=
When doing a non-equi join like X[Y, on = .(A < A)]
data.table returns the A
-column from Y
(the i
-data.table).
To get the desired result, you could do:
X[Y, on = .(A < A), .(A = x.A, B)]
which gives:
A B
1: 1 1
2: 2 1
3: 3 1
In the next release, data.table will return both A
columns. See here for the discussion.
Non-equi join of dates using data table
It looks like you are trying to join edits and events so that a probability value from the edits data table is associated with the correct observation from the events data table.
It looks like the error is ocuring because the time intervals used to create the edits data table are not mutually exclusive. When I modify the time intervals to what I think you intended, then your code gives the result that you were looking for.
library(data.table)
edits <- data.table(proposal=c('A','A','A'),
editField=c('probability','probability','probability'),
startDate=as.POSIXct(c('2017-04-14 00:00:00','2018-10-10 15:47:00','2019-09-06 12:12:00')),
endDate=as.POSIXct(c('2018-10-10 15:47:00','2019-09-06 12:12:00','9999-12-31 05:00:00')),
value=c(.1,.3,.1))
events <- data.table(proposal='A',
editDate=as.POSIXct(c('2017-04-14 00:00:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2019-09-06 12:12:00','2018-07-04 15:33:59','2018-07-27 08:01:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-10-10 15:47:00','2018-11-26 11:10:00','2019-02-05 13:06:59')),
editField=c('Created','stage','probability','estOrder','estOrder','estOrder','stage','probability','estOrder','estOrder','estOrder'))
edits[editField=='probability'][events, on=.(proposal, startDate<=editDate, endDate>editDate)]
or you can do the join with out chaining it
edits[events, on=.(proposal, startDate<=editDate, endDate>editDate)]
or you could do as Jonny Phelps suggested and use foverlaps, but this also requires mutually exclusive time intervals in the edits data table
events[,startDate:= editDate]
setkey(events, startDate, editDate)
setkey(edits, startDate, endDate)
foverlaps(events, edits, type="any", mult="first")
Related Topics
Remove White Space Between Plots and Table in Grid.Arrange
Shade (Fill or Color) Area Under Density Curve by Quantile
Format Latitude and Longitude Axis Labels in Ggplot
Exporting R Regression Summary for Publishable Paper
Multiple Condition If-Else Using Dplyr, Custom Function, or Purrr
R Name Colnames and Rownames in List of Data.Frames with Lapply
Save Output Between Pipes in Dplyr
Chain Arithmetic Operators in Dplyr with %>% Pipe
R Memory Management Advice (Caret, Model Matrices, Data Frames)
How to Summarizing Data Statistics Using R
Setting Hex Bins in Ggplot2 to Same Size
How to Skip Error Checking at Rmarkdown Compiling
Offline Installation of R Packages
Changing Word Template for Knitr in Rmarkdown
How to Pass the "..." Parameters in the Parent Function to Its Two Children Functions in R
How to Count the Observations Falling in Each Node of a Tree