Dt[!(X == .)] and Dt[X != .] Treat Na in X Inconsistently

DT[!(x == .)] and DT[x != .] treat NA in x inconsistently

As of version 1.8.11 the ! does not trigger a not-join for logical expressions and the results for the two expressions are the same:

DT <- data.table(x=c(1,0,NA), y=1:3)
DT[x != 0]
# x y
#1: 1 1
DT[!(x == 0)]
# x y
#1: 1 1

A couple other expressions mentioned in @mnel's answer also behave in a more predictable fashion now:

DT[!(x != 0)]
# x y
#1: 0 2
DT[!!(x == 0)]
# x y
#1: 0 2

R data.table - row subsetting behavior - NA values

From the help file, ?data.table, under the discussion of i:

integer and logical vectors work the same way they do in [.data.frame except logical NAs are treated as FALSE.

In data.frame, NAs are treated as NA.

NA in `i` expression of data.table (possible bug)

As @flodel points out, the question can be simplified to, Why is this not TRUE:

identical(x[as.logical(a)], x[!!as.logical(a)])   # note the double bangs

The answer lies in how data.table handles NA in i and how it handles ! in i. Both of which receive special treatment. The problem really arises in the combination of the two.

  • NA's in i are treated as FALSE.
  • ! in i are treated as a negation.

This is well documented in ?.data.table (as G. Grothendieck points out in another answer).
The relevant portions being:

integer and logical vectors work the same way they do in [.data.frame. Other than NAs in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in [.data.frame.

...

All types of 'i' may be prefixed with !. This signals a not-join or not-select should be performed. Throughout data.table documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present.

If you look at the code for [.data.table, the way ! are handled, if present, is by

  1. removing the preceding !
  2. Interpreting the remaining i
  3. Negating that interpretation

The way NAs are handled is by setting those values to FALSE.

However -- and very importantly -- this happens within step 2 above.

Thus, what is really happening is that when i contains NA AND i is prefixed by !, then the NA's are effectively interpreted as TRUE. While technically, this is as documented, I am not sure if this is as intended.


Of course, there is the final question of @flodel's point: Why is x[as.logical(a)] not the same as x[!!as.logical(a)]? The reason for this is that only the first bang gets special treatment. The second bang is interpreted as normal by R.

Since !NA is still NA, the sequence of modification for the interpretation of !!(NA) is:

!!(NA)  
!( !(NA) )
!( NA )
!( FALSE )
TRUE

Subset data.table by logical column

From ?data.table

Advanced: When i is a single variable name, it is not considered an
expression of column names and is instead evaluated in calling scope.

So dt[x] will try to evaluate x in the calling scope (in this case the global environment)

You can get around this by using ( or { or force

dt[(x)]
dt[{x}]
dt[force(x)]

data.table := does not support logical data types when adding new column?

Until this bug is fixed (see Matthew Dowle's comment above), you can get around it by directly specifying the type of NA that you want in the new column (except of course for "logical", which is the type that doesn't work at the moment):

DT <- data.table(a=LETTERS[c(1,1:3)],b=4:7,key="a")
DT[ ,newcol := NA_real_] ## Other options are NA_integer_ and NA_character_
# a b newcol
# 1: A 4 NA
# 2: A 5 NA
# 3: B 6 NA
# 4: C 7 NA

## Plain old NA has type and class "logical", partly explaining the
## error message returned by DT[,newcol:=NA]
c(typeof(NA), class(NA))
# [1] "logical" "logical"

Need to identify all rows meeting a complex logical condition in R

You could use melt and rle:

melt(dt,id.vars="colref")[,.(detect=with(rle(value), 
which.max(!values & lengths>=3)>1&sum(values)>1))
,by=colref]

colref detect
<int> <lgcl>
1: 1 TRUE
2: 2 TRUE
3: 3 FALSE
4: 4 FALSE
5: 5 FALSE
6: 6 FALSE
7: 7 FALSE
8: 8 TRUE
9: 9 TRUE
10: 10 FALSE
11: 11 FALSE
12: 12 FALSE
13: 13 FALSE
14: 14 FALSE
15: 15 FALSE
16: 16 TRUE
17: 17 FALSE
18: 18 FALSE
19: 19 FALSE
20: 20 FALSE
colref detected

subsetting a data.table using != some non-NA excludes NA too

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

a %in% ""
# [1] FALSE TRUE FALSE

x[!a %in% ""]
# a
# 1: 1
# 2: NA

To find out why this is happening in data.table:

(as opposted to data.frame)

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

if (!missing(i)) {
# Part (1)
isub = substitute(i)

# Part (2)
if (is.call(isub) && isub[[1L]] == as.name("!")) {
notjoin = TRUE
if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
nomatch = 0L
isub = isub[[2L]]
}

.....
# "isub" is being evaluated using "eval" to result in a logical vector

# Part 3
if (is.logical(i)) {
# see DT[NA] thread re recycling of NA logical
if (identical(i,NA)) i = NA_integer_
# avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
else i[is.na(i)] = FALSE
}
....
}

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

First, why dt[a != ""] doesn't work as expected (by the OP)?

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

why does x[!(a== "")] work as expected (by the OP)?

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

if (notjoin) {
if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
irows = irows[irows!=0L]
# WHERE MAGIC HAPPENS (returns c(1,3))
i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x))
# Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
# column when irows contains negatives.
}

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

get repeated rows by vector of values in data table

Here is a nice solution without any complex technique.

df[match(v,ID)]


Related Topics



Leave a reply



Submit