In R Does Na == Na

in R does NA == NA?

NA is identical to NA, but doesn't equal it. If you run NA==NA, the response will be NA, because the equal operator doesn't apply to NAs. From the identical documentation:

A call to identical is the way to test exact equality in if and while
statements, as well as in logical expressions that use && or ||. In
all these applications you need to be assured of getting a single
logical value.

Users often use the comparison operators, such as == or !=, in these
situations. It looks natural, but it is not what these operators are
designed to do in R. They return an object like the arguments. If you
expected x and y to be of length 1, but it happened that one of them
was not, you will not get a single FALSE. Similarly, if one of the
arguments is NA, the result is also NA. In either case, the expression
if(x == y).... won't work as expected.

And from the documentation for ==:

Missing values (NA) and NaN values are regarded as non-comparable even
to themselves, so comparisons involving them will always result in NA.
Missing values can also result when character strings are compared and
one is not valid in the current collation locale.

The rationale is that missing values, at a conceptual level, are not the same as one another. They could potentially represent very different values, but we just don't know what those values are.

An alternative in this situation is to add | is.na(birth_year).

NA matches NA, but is not equal to NA. Why?

It's a matter of convention. There are good reasons for the way == works. NA is a special value in R that is supposed to represent data that is missing and should be treated differently from the rest of data. There are innumerable very subtle bugs that could come up if we started comparing missing values as if they were known or as if two missing values were equal to each other.

Think of NA as meaning "I don't know what's there". The correct answer to 3 > NA is obviously NA because we don't know if the missing value is larger than 3 or not. Well, it's the same for NA == NA. They are both missing values but the true values could be quite different, so the correct answer is "I don't know."

R doesn't know what you are doing in your analysis, so instead of potentially introducing bugs that would later end up being published and embarrassing you, it doesn't allow comparison operators to think NA is a value.

match() was written with a more specific purpose in mind: finding the indexes of matching values. If you ask the question "Should I match 3 with NA", a reasonable answer is "no." Different (and very useful) convention, and justified because R pretty much knows what you are trying to do when you invoke match(). Now, should we match NA with NA for this purpose? It could be argued.

Come to think of it, I suppose it is a a little odd that the authors of match() chose to allow NA to match to itself by default. You can imagine cases where you might use match() to find NA rows in table along with other values, but it's dangerous. You just have to be a bit more careful about knowing whether you have any NA values in x and only permitting them if you really wanted to. You can change this behavior by specifying incomparables=NA when calling match().

Why is NA | FALSE = NA?

Essentially, it asks whether at least one side is TRUE. As there is one TRUE value, the result is also TRUE.

It is the same as with:

1 > 0 | 0 > 2
[1] TRUE

Conversely, when it asks whether all sides are TRUE:

TRUE & FALSE
[1] FALSE

As with the numerical example:

1 > 0 & 0 > 2
[1] FALSE

What is the difference between NA and NA?

When you are dealing with factors, when the NA is wrapped in angled brackets ( <NA> ), that indicates thtat it is in fact NA.

When it is NA without brackets, then it is not NA, but rather a proper factor whose label is "NA"

# Note a 'real' NA and a string with the word "NA"
x <- factor(c("hello", NA, "world", "NA"))

x
[1] hello <NA> world NA
Levels: hello NA world <~~ The string appears as a level, the actual NA does not.

as.numeric(x)
[1] 1 NA 3 2 <~~ The string has a numeric value (here, 2, alphabetically)
The NA's numeric value is just NA

Edit to answer @Arun's question:

R is simply trying to distinguish between a string whose value are the two letters "NA" and an actual missing value, NA
Thus the difference you see when displaying df versus df$y. Example:

df <- data.frame(x=1:4, y=c("a", NA_character_, "c", "NA"), stringsAsFactors=FALSE)

Note the two different styles of NA:

> df
x y
1 1 a
2 2 <NA>
3 3 c
4 4 NA

However, if we look at just 'df$y'

[1] "a"  NA   "c"  "NA"

But, if we remove the quotation marks (similar to what we see when printing a data.frame to the console):

print(df$y, quote=FALSE)
[1] a <NA> c NA

And thus, we once again have the distinction of NA via the angled brackets.

NA == 1 check returns NA

The NA simply means that a value is missing/unknown. Therefore NA == 1 yields NA. The outcome of the comparison with == is unknown, since we don't know if the missing value is 1 or something else.

The same reasoning can be applied to the other tests, which is why they all return NA.


As pointed out by @akrun in a comment, the proper way to check whether a value x is missing is to use the function is.na(x). A comparison of the type x == NA would not give the desired result.

Dealing with TRUE, FALSE, NA and NaN

To answer your questions in order:

1) The == operator does indeed not treat NA's as you would expect it to. A very useful function is this compareNA function from r-cookbook.com:

  compareNA <- function(v1,v2) {
# This function returns TRUE wherever elements are the same, including NA's,
# and false everywhere else.
same <- (v1 == v2) | (is.na(v1) & is.na(v2))
same[is.na(same)] <- FALSE
return(same)
}

2) NA stands for "Not available", and is not the same as the general NaN ("not a number"). NA is generally used for a default value for a number to stand in for missing data; NaN's are normally generated because a numerical issue (taking log of -1 or similar).

3) I'm not really sure what you mean by "logical things"--many different data types, including numeric vectors, can be used as input to logical operators. You might want to try reading the R logical operators page: http://stat.ethz.ch/R-manual/R-patched/library/base/html/Logic.html.

Hope this helps!

NA values are not recognized properly using dplyr

Welcome to SO! Use this to get NAs mutated and then delete the NAs:

data <- data %>% 
mutate(ID = ifelse(ID == "NA",NA,ID)) %>%
filter(!is.na(ID))

R changes my list of character strings with na into the words as missing values (ex : BDNA3 -- NA) - How to deal with this?

To set your NA values, you should use the code df[df == "NA"] <- NA. I used this with your test dataset and produced the desired results. You can then use the na.omit() function on your df to remove the now set NA data. I don't have a working code from you, so I will supply the outline of what your code should look like:

df <- data.frame(lapply(df, as.character), stringAsFactors = FALSE)
df
X1 X2
1 1 SCYL3
2 2 C1orf112
3 3 FGR
4 4 CFH
5 5 STPG1
6 6 NIPAL3
7 7 AK2
8 8 KDM1A
9 9 TTC22
10 10 ST7L
11 11 DNAJC11
12 12 FMO3
13 13 E2F2
14 14 CDK11A
15 15 NADK
16 16 CSDE1
17 17 MASP2
df[df == "NA"] <- NA

The is.na(df) function will return FALSE for all results. If you add any data which is NA, you can omit that row using the na.omit(df) now.

Logical Indexing with NA in R - How to set to FALSE or exclude rather than return NA?

Whenever you ask whether Not Available (NA) value is equal to number or anything else - you got the only possible answer: The answer is Not Available (NA).

NA might be equal to 6, or to John the Baptist, or to ⛄ as well as to any other object. It is just impossible to say if it does, since the value is not available.

To get the answer you want, you can use na.omit() or na.exclude() on the results. Or you can apply yet another logical condition during subsetting:

with(df, B[A == 6 & !is.na(A)])
# [1] 60


Related Topics



Leave a reply



Submit