subsetting a data.table using != some non-NA excludes NA too
To provide a solution to your question:
You should use %in%
. It gives you back a logical vector.
a %in% ""
# [1] FALSE TRUE FALSE
x[!a %in% ""]
# a
# 1: 1
# 2: NA
To find out why this is happening in data.table
:
(as opposted to data.frame
)
If you look at the data.table
source code on the file data.table.R
under the function "[.data.table"
, there's a set of if-statements
that check for i
argument. One of them is:
if (!missing(i)) {
# Part (1)
isub = substitute(i)
# Part (2)
if (is.call(isub) && isub[[1L]] == as.name("!")) {
notjoin = TRUE
if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
nomatch = 0L
isub = isub[[2L]]
}
.....
# "isub" is being evaluated using "eval" to result in a logical vector
# Part 3
if (is.logical(i)) {
# see DT[NA] thread re recycling of NA logical
if (identical(i,NA)) i = NA_integer_
# avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
else i[is.na(i)] = FALSE
}
....
}
To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.
First, why dt[a != ""]
doesn't work as expected (by the OP)?
First, part 1
evaluates to an object of class call
. The second part of the if statement in part 2
returns FALSE. Following that, the call
is "evaluated" to give c(TRUE, FALSE, NA)
. Then part 3
is executed. So, NA
is replaced to FALSE
(the last line of the logical loop).
why does x[!(a== "")]
work as expected (by the OP)?
part 1
returns a call once again. But, part 2
evaluates to TRUE and therefore sets:
1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)
That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval
) to logical again. So, (a=="")
evaluates to c(FALSE, TRUE, NA)
.
Now, this is checked for is.logical
in part 3
. So, here, NA
gets replaced to FALSE
. It therefore becomes, c(FALSE, TRUE, FALSE)
. At some point later, a which(c(F,T,F))
is executed, which results in 2 here. Because notjoin = TRUE
(from part 2
) seq_len(nrow(x))[-2]
= c(1,3) is returned. so, x[!(a=="")]
basically returns x[c(1,3)]
which is the desired result. Here's the relevant code snippet:
if (notjoin) {
if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
irows = irows[irows!=0L]
# WHERE MAGIC HAPPENS (returns c(1,3))
i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL # NULL meaning all rows i.e. seq_len(nrow(x))
# Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
# column when irows contains negatives.
}
Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.
R data.table - row subsetting behavior - NA values
From the help file, ?data.table
, under the discussion of i
:
integer and logical vectors work the same way they do in [.data.frame except logical NAs are treated as FALSE.
In data.frame
, NAs are treated as NA.
Selecting non `NA` values from duplicate rows with `data.table` -- when having more than one grouping variable
Here some data.table-based solutions.
setDT(df_id_year_and_type)
method 1
na.omit(df_id_year_and_type, cols="type")
drops NA
rows based on column type
.unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE)
finds all the groups.
And by joining them (using the last match: mult="last"
), we obtain the desired output.
na.omit(df_id_year_and_type, cols="type"
)[unique(df_id_year_and_type[, .(id, year)], fromLast=TRUE),
on=c('id', 'year'),
mult="last"]
# id year type
# <num> <num> <char>
# 1: 1 2002 A
# 2: 2 2008 B
# 3: 3 2010 D
# 4: 3 2013 <NA>
# 5: 4 2020 C
# 6: 5 2009 A
# 7: 6 2010 B
# 8: 6 2012 <NA>
method 2
df_id_year_and_type[df_id_year_and_type[, .I[which.max(cumsum(!is.na(type)))], .(id, year)]$V1,]
method 3
(likely slower because of [
overhead)
df_id_year_and_type[, .SD[which.max(cumsum(!is.na(type)))], .(id, year)]
How to subset data in R without losing NA rows?
If we decide to use subset
function, then we need to watch out:
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
So only non-NA values will be retained.
If you want to keep NA
cases, use logical or condition to tell R not to drop NA
cases:
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
Don't use directly (to be explained soon):
df2 <- df1[df1$Height < 40, ]
Example
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
subset(df1, Height < 40 | is.na(Height))
# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4
df1[df1$Height < 40, ]
# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA
The reason that the latter fails, is that indexing by NA
gives NA
. Consider this simple example with a vector:
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA
We need to somehow replace those NA
with TRUE
. The most straightforward way is to add another "or" condition is.na(ind)
:
x[ind | is.na(ind)]
# [1] 1 2 3
This is exactly what will happen in your situation. If your Height
contains NA
, then logical operation Height < 40
ends up a mix of TRUE
/ FALSE
/ NA
, so we need replace NA
by TRUE
as above.
r data.table excluding rows with certain value in a column removes NAs too
This isn't a data.table
issue.
In the first case you don't select NA
s:
NA != 4
[1] NA
In the second case you do:
!NA %in% 4
[1] TRUE
NA in `i` expression of data.table (possible bug)
As @flodel points out, the question can be simplified to, Why is this not TRUE
:
identical(x[as.logical(a)], x[!!as.logical(a)]) # note the double bangs
The answer lies in how data.table handles NA
in i
and how it handles !
in i
. Both of which receive special treatment. The problem really arises in the combination of the two.
NA
's ini
are treated asFALSE
.!
ini
are treated as a negation.
This is well documented in ?.data.table
(as G. Grothendieck points out in another answer).
The relevant portions being:
integer and logical vectors work the same way they do in [.data.frame. Other than NAs in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in [.data.frame.
...
All types of 'i' may be prefixed with !. This signals a not-join or not-select should be performed. Throughout data.table documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present.
If you look at the code for [.data.table
, the way !
are handled, if present, is by
- removing the preceding
!
- Interpreting the remaining
i
- Negating that interpretation
The way NA
s are handled is by setting those values to FALSE
.
However -- and very importantly -- this happens within step 2 above.
Thus, what is really happening is that when i
contains NA
AND i
is prefixed by !
, then the NA's are effectively interpreted as TRUE
. While technically, this is as documented, I am not sure if this is as intended.
Of course, there is the final question of @flodel's point: Why is x[as.logical(a)]
not the same as x[!!as.logical(a)]
? The reason for this is that only the first bang gets special treatment. The second bang is interpreted as normal by R
.
Since !NA
is still NA
, the sequence of modification for the interpretation of !!(NA) is:
!!(NA)
!( !(NA) )
!( NA )
!( FALSE )
TRUE
data.table subsetting by NaN doesn't work
Update: This has been fixed a while back, in v1.9.2. From NEWS:
NA
,NaN
,+Inf
and-Inf
are now considered distinct values, may be in keys, can be joined to and can be grouped. data.table defines: NA < NaN < -Inf. Thanks to Martin Liberts for the suggestions, #4684, #4815 and #4883.
require(data.table) ## 1.9.2+
my.dt[J(NaN)]
# x y
# 1: NaN 2
# 2: NaN 4
# 3: NaN 6
This issue is part design choice, part bug. There are several questions on SO and a few emails on the listserv exploring NA's in data.table
key.
The main idea is outlined in the FAQ in that NA
's are treated as FALSE
Please feel free chime in on the conversation in the mailing list. There was a conversation started by @Arun,
http://r.789695.n4.nabble.com/Follow-up-on-subsetting-data-table-with-NAs-td4669097.html
Also, you can read more in the answers and comments to any of the following questions on SO:
subsetting a data.table using !=<some non-NA> excludes NA too
NA in `i` expression of data.table (possible bug)
DT[!(x == .)] and DT[x != .] treat NA in x inconsistently
In the meantime, your best bet is to use is.na
.
While it is slower than a radix search, it is still faster than most vector searches in R
, and certainly much, much faster than any fancy workarounds
library(microbenchmark)
microbenchmark(my.dt[.(1)], my.dt[is.na(ID)], my.dt[ID==1], my.dt[!!!(ID)])
# Unit: milliseconds
expr median
my.dt[.(1)] 1.309948
my.dt[is.na(ID)] 3.444689 <~~ Not bad
my.dt[ID == 1] 4.005093
my.dt[!(!(!(ID)))] 10.038134
### using the following for my.dt
my.dt <- as.data.table(replicate(20, sample(100, 1e5, TRUE)))
setnames(my.dt, 1, "ID")
my.dt[sample(1e5, 1e3), ID := NA]
setkey(my.dt, ID)
Replace NAs in a Single Column of a Data Table in R
Your code isn't off unless the data in the column is not a character in which case you would have to set -999 as inter/numeric without ""
data <- read.table(header=TRUE, text='
id weight size
1 20 small
2 27 large
3 24 medium
')
data <- data.table(data)
> data[size == 'small', weight := NA]
> data
size id weight
1: small 1 NA
2: large 2 27
3: medium 3 24
> is.na(data)
size id weight
[1,] FALSE FALSE TRUE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
> data[is.na(weight), weight := -999]
> data
size id weight
1: small 1 -999
2: large 2 27
3: medium 3 24
> data[size == 'small', weight := NA]
> data[is.na(weight), weight := "-999"]
Warning message:
In `[.data.table`(data, is.na(weight), `:=`(weight, "-999")) :
Coerced 'character' RHS to 'integer' to match the column's type.
EDIT: This is, I just saw, what @dracodoc suggested in comment
Related Topics
Function to Calculate Geospatial Distance Between Two Points (Lat,Long) Using R
How to Install Development Version of R Packages Github Repository
How to Run R on a Server Without X11, and Avoid Broken Dependencies
How to Replace Nan Value with Zero in a Huge Data Frame
Display a Time Clock in the R Command Line
Examples of the Perils of Globals in R and Stata
Dealing with True, False, Na and Nan
Split Up '...' Arguments and Distribute to Multiple Functions
Extracting Unique Numbers from String in R
Finding Overlaps Between Interval Sets/Efficient Overlap Joins
Linear Regression Loop for Each Independent Variable Individually Against Dependent
Display a Time Clock in the R Command Line
Extract Matrix Column Values by Matrix Column Name
How to Use a String Variable to Select a Data Frame Column Using $ Notation
Force Character Vector Encoding from "Unknown" to "Utf-8" in R