DT[!(x == .)] and DT[x != .] treat NA in x inconsistently
As of version 1.8.11 the !
does not trigger a not-join for logical expressions and the results for the two expressions are the same:
DT <- data.table(x=c(1,0,NA), y=1:3)
DT[x != 0]
# x y
#1: 1 1
DT[!(x == 0)]
# x y
#1: 1 1
A couple other expressions mentioned in @mnel's answer also behave in a more predictable fashion now:
DT[!(x != 0)]
# x y
#1: 0 2
DT[!!(x == 0)]
# x y
#1: 0 2
R data.table - row subsetting behavior - NA values
From the help file, ?data.table
, under the discussion of i
:
integer and logical vectors work the same way they do in [.data.frame except logical NAs are treated as FALSE.
In data.frame
, NAs are treated as NA.
NA in `i` expression of data.table (possible bug)
As @flodel points out, the question can be simplified to, Why is this not TRUE
:
identical(x[as.logical(a)], x[!!as.logical(a)]) # note the double bangs
The answer lies in how data.table handles NA
in i
and how it handles !
in i
. Both of which receive special treatment. The problem really arises in the combination of the two.
NA
's ini
are treated asFALSE
.!
ini
are treated as a negation.
This is well documented in ?.data.table
(as G. Grothendieck points out in another answer).
The relevant portions being:
integer and logical vectors work the same way they do in [.data.frame. Other than NAs in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in [.data.frame.
...
All types of 'i' may be prefixed with !. This signals a not-join or not-select should be performed. Throughout data.table documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present.
If you look at the code for [.data.table
, the way !
are handled, if present, is by
- removing the preceding
!
- Interpreting the remaining
i
- Negating that interpretation
The way NA
s are handled is by setting those values to FALSE
.
However -- and very importantly -- this happens within step 2 above.
Thus, what is really happening is that when i
contains NA
AND i
is prefixed by !
, then the NA's are effectively interpreted as TRUE
. While technically, this is as documented, I am not sure if this is as intended.
Of course, there is the final question of @flodel's point: Why is x[as.logical(a)]
not the same as x[!!as.logical(a)]
? The reason for this is that only the first bang gets special treatment. The second bang is interpreted as normal by R
.
Since !NA
is still NA
, the sequence of modification for the interpretation of !!(NA) is:
!!(NA)
!( !(NA) )
!( NA )
!( FALSE )
TRUE
Subset data.table by logical column
From ?data.table
Advanced: When
i
is a single variable name, it is not considered an
expression of column names and is instead evaluated in calling scope.
So dt[x]
will try to evaluate x
in the calling scope (in this case the global environment)
You can get around this by using (
or {
or force
dt[(x)]
dt[{x}]
dt[force(x)]
data.table := does not support logical data types when adding new column?
Until this bug is fixed (see Matthew Dowle's comment above), you can get around it by directly specifying the type of NA that you want in the new column (except of course for "logical", which is the type that doesn't work at the moment):
DT <- data.table(a=LETTERS[c(1,1:3)],b=4:7,key="a")
DT[ ,newcol := NA_real_] ## Other options are NA_integer_ and NA_character_
# a b newcol
# 1: A 4 NA
# 2: A 5 NA
# 3: B 6 NA
# 4: C 7 NA
## Plain old NA has type and class "logical", partly explaining the
## error message returned by DT[,newcol:=NA]
c(typeof(NA), class(NA))
# [1] "logical" "logical"
Need to identify all rows meeting a complex logical condition in R
You could use melt
and rle
:
melt(dt,id.vars="colref")[,.(detect=with(rle(value),
which.max(!values & lengths>=3)>1&sum(values)>1))
,by=colref]
colref detect
<int> <lgcl>
1: 1 TRUE
2: 2 TRUE
3: 3 FALSE
4: 4 FALSE
5: 5 FALSE
6: 6 FALSE
7: 7 FALSE
8: 8 TRUE
9: 9 TRUE
10: 10 FALSE
11: 11 FALSE
12: 12 FALSE
13: 13 FALSE
14: 14 FALSE
15: 15 FALSE
16: 16 TRUE
17: 17 FALSE
18: 18 FALSE
19: 19 FALSE
20: 20 FALSE
colref detected
subsetting a data.table using != some non-NA excludes NA too
To provide a solution to your question:
You should use %in%
. It gives you back a logical vector.
a %in% ""
# [1] FALSE TRUE FALSE
x[!a %in% ""]
# a
# 1: 1
# 2: NA
To find out why this is happening in data.table
:
(as opposted to data.frame
)
If you look at the data.table
source code on the file data.table.R
under the function "[.data.table"
, there's a set of if-statements
that check for i
argument. One of them is:
if (!missing(i)) {
# Part (1)
isub = substitute(i)
# Part (2)
if (is.call(isub) && isub[[1L]] == as.name("!")) {
notjoin = TRUE
if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
nomatch = 0L
isub = isub[[2L]]
}
.....
# "isub" is being evaluated using "eval" to result in a logical vector
# Part 3
if (is.logical(i)) {
# see DT[NA] thread re recycling of NA logical
if (identical(i,NA)) i = NA_integer_
# avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
else i[is.na(i)] = FALSE
}
....
}
To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.
First, why dt[a != ""]
doesn't work as expected (by the OP)?
First, part 1
evaluates to an object of class call
. The second part of the if statement in part 2
returns FALSE. Following that, the call
is "evaluated" to give c(TRUE, FALSE, NA)
. Then part 3
is executed. So, NA
is replaced to FALSE
(the last line of the logical loop).
why does x[!(a== "")]
work as expected (by the OP)?
part 1
returns a call once again. But, part 2
evaluates to TRUE and therefore sets:
1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)
That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval
) to logical again. So, (a=="")
evaluates to c(FALSE, TRUE, NA)
.
Now, this is checked for is.logical
in part 3
. So, here, NA
gets replaced to FALSE
. It therefore becomes, c(FALSE, TRUE, FALSE)
. At some point later, a which(c(F,T,F))
is executed, which results in 2 here. Because notjoin = TRUE
(from part 2
) seq_len(nrow(x))[-2]
= c(1,3) is returned. so, x[!(a=="")]
basically returns x[c(1,3)]
which is the desired result. Here's the relevant code snippet:
if (notjoin) {
if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
irows = irows[irows!=0L]
# WHERE MAGIC HAPPENS (returns c(1,3))
i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x))
# Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
# column when irows contains negatives.
}
Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.
get repeated rows by vector of values in data table
Here is a nice solution without any complex technique.
df[match(v,ID)]
Related Topics
Use Pipe Operator %>% with Replacement Functions Like Colnames()<-
Create a Ranking Variable with Dplyr
The Condition Has Length ≫ 1 and Only the First Element Will Be Used
Appending a List to a List of Lists in R
Scale and Size of Plot in Rstudio Shiny
How to Fix 'Tar: Failed to Set Default Locale' Error
Melt Using Patterns When Variable Names Contain String Information - Avoid Coercion to Numeric
How to Interrupt a Running Code in R with a Keyboard Command
Tooltip When You Mouseover a Ggplot on Shiny
R Sum a Variable by Two Groups
How to Set Legend Alpha with Ggplot2
R Random Forest Error - Type of Predictors in New Data Do Not Match
Creating Multi Column Legend in Ggplot
How to Select R Data.Table Rows Based on Substring Match (A La SQL Like)