Subsetting a Data.Table Using !=<Some Non-Na> Excludes Na Too

subsetting a data.table using != some non-NA excludes NA too

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

a %in% ""
# [1] FALSE  TRUE FALSE

x[!a %in% ""]
#     a
# 1:  1
# 2: NA

To find out why this is happening in `data.table`:

(as opposted to data.frame)

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

if (!missing(i)) {
    # Part (1)
    isub = substitute(i)

    # Part (2)
    if (is.call(isub) && isub[[1L]] == as.name("!")) {
        notjoin = TRUE
        if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
        nomatch = 0L
        isub = isub[[2L]]
    }

    .....
    # "isub" is being evaluated using "eval" to result in a logical vector

    # Part 3
    if (is.logical(i)) {
        # see DT[NA] thread re recycling of NA logical
        if (identical(i,NA)) i = NA_integer_  
        # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        else i[is.na(i)] = FALSE  
    }
    ....
}

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

First, why `dt[a != ""]` doesn't work as expected (by the OP)?

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

why does `x[!(a== "")]` work as expected (by the OP)?

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

if (notjoin) {
    if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
    irows = irows[irows!=0L]
    # WHERE MAGIC HAPPENS (returns c(1,3))
    i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL  # NULL meaning all rows i.e. seq_len(nrow(x))
    # Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
    # column when irows contains negatives.
}

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

R data.table - row subsetting behavior - NA values

From the help file, ?data.table, under the discussion of i:

integer and logical vectors work the same way they do in [.data.frame except logical NAs are treated as FALSE.

In data.frame, NAs are treated as NA.

Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable

Here some data.table-based solutions.

setDT(df_id_year_and_type)

method 1

na.omit(df_id_year_and_type, cols="type") drops NA rows based on column type.
unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE) finds all the groups.
And by joining them (using the last match: mult="last"), we obtain the desired output.

na.omit(df_id_year_and_type, cols="type"
        )[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE), 
          on=c('id', 'year'), 
          mult="last"]

#       id  year   type
#    <num> <num> <char>
# 1:     1  2002      A
# 2:     2  2008      B
# 3:     3  2010      D
# 4:     3  2013   <NA>
# 5:     4  2020      C
# 6:     5  2009      A
# 7:     6  2010      B
# 8:     6  2012   <NA>

method 2

df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]

method 3

(likely slower because of [ overhead)

df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]

How to subset data in R without losing NA rows?

If we decide to use subset function, then we need to watch out:

For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.

So only non-NA values will be retained.

If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:

subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`

Don't use directly (to be explained soon):

df2 <- df1[df1$Height < 40, ]

Example

df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)

subset(df1, Height < 40 | is.na(Height))

#  Height y
#1     NA 1
#2      2 2
#3      4 3
#4     NA 4

df1[df1$Height < 40, ]

#  Height  y
#1     NA NA
#2      2  2
#3      4  3
#4     NA NA

The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:

x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA  2 NA

We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):

x[ind | is.na(ind)]
# [1] 1 2 3

This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.

r data.table excluding rows with certain value in a column removes NAs too

This isn't a data.table issue.

In the first case you don't select NAs:

NA != 4
[1] NA

In the second case you do:

!NA %in% 4
[1] TRUE

NA in `i` expression of data.table (possible bug)

As @flodel points out, the question can be simplified to, Why is this not TRUE:

identical(x[as.logical(a)], x[!!as.logical(a)])   # note the double bangs

The answer lies in how data.table handles NA in i and how it handles ! in i. Both of which receive special treatment. The problem really arises in the combination of the two.

NA's in i are treated as FALSE.
! in i are treated as a negation.

This is well documented in ?.data.table (as G. Grothendieck points out in another answer).
The relevant portions being:

integer and logical vectors work the same way they do in [.data.frame. Other than NAs in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in [.data.frame.

...

All types of 'i' may be prefixed with !. This signals a not-join or not-select should be performed. Throughout data.table documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present.

If you look at the code for [.data.table, the way ! are handled, if present, is by

removing the preceding !
Interpreting the remaining i
Negating that interpretation

The way NAs are handled is by setting those values to FALSE.

However -- and very importantly -- this happens within step 2 above.

Thus, what is really happening is that when i contains NA AND i is prefixed by !, then the NA's are effectively interpreted as TRUE. While technically, this is as documented, I am not sure if this is as intended.

Of course, there is the final question of @flodel's point: Why is x[as.logical(a)] not the same as x[!!as.logical(a)]? The reason for this is that only the first bang gets special treatment. The second bang is interpreted as normal by R.

Since !NA is still NA, the sequence of modification for the interpretation of !!(NA) is:

!!(NA)  
!( !(NA) )  
!(  NA   )
!( FALSE )
TRUE

data.table subsetting by NaN doesn't work

Update: This has been fixed a while back, in v1.9.2. From NEWS:

NA, NaN, +Inf and -Inf are now considered distinct values, may be in keys, can be joined to and can be grouped. data.table defines: NA < NaN < -Inf. Thanks to Martin Liberts for the suggestions, #4684, #4815 and #4883.

require(data.table) ## 1.9.2+
my.dt[J(NaN)]
#      x  y
# 1: NaN  2
# 2: NaN  4
# 3: NaN  6

This issue is part design choice, part bug. There are several questions on SO and a few emails on the listserv exploring NA's in data.table key.

The main idea is outlined in the FAQ in that NA's are treated as FALSE

Please feel free chime in on the conversation in the mailing list. There was a conversation started by @Arun,

http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097.html

Also, you can read more in the answers and comments to any of the following questions on SO:

subsetting a data.table using !=<some non-NA> excludes NA too

NA in `i` expression of data.table (possible bug)

DT[!(x == .)] and DT[x != .] treat NA in x inconsistently

In the meantime, your best bet is to use is.na.

While it is slower than a radix search, it is still faster than most vector searches in R, and certainly much, much faster than any fancy workarounds

library(microbenchmark)
microbenchmark(my.dt[.(1)], my.dt[is.na(ID)], my.dt[ID==1], my.dt[!!!(ID)])
# Unit: milliseconds
               expr    median 
        my.dt[.(1)]  1.309948 
   my.dt[is.na(ID)]  3.444689   <~~ Not bad
     my.dt[ID == 1]  4.005093 
 my.dt[!(!(!(ID)))] 10.038134 

### using the following for my.dt
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)

Replace NAs in a Single Column of a Data Table in R

Your code isn't off unless the data in the column is not a character in which case you would have to set -999 as inter/numeric without ""

data <- read.table(header=TRUE, text='
 id weight   size
 1     20  small
 2     27  large
 3     24 medium
 ')

data <- data.table(data)

> data[size == 'small', weight := NA]
> data
     size id weight
1:  small  1     NA
2:  large  2     27
3: medium  3     24
> is.na(data)
      size    id weight
[1,] FALSE FALSE   TRUE
[2,] FALSE FALSE  FALSE
[3,] FALSE FALSE  FALSE
> data[is.na(weight), weight := -999]
> data
     size id weight
1:  small  1   -999
2:  large  2     27
3: medium  3     24
> data[size == 'small', weight := NA]
> data[is.na(weight), weight := "-999"]
Warning message:
In `[.data.table`(data, is.na(weight), `:=`(weight, "-999")) :
  Coerced 'character' RHS to 'integer' to match the column's type.

EDIT: This is, I just saw, what @dracodoc suggested in comment

Subsetting a Data.Table Using !=<Some Non-Na> Excludes Na Too