Na Matches Na, But Is Not Equal to Na. Why

NA matches NA, but is not equal to NA. Why?

It's a matter of convention. There are good reasons for the way == works. NA is a special value in R that is supposed to represent data that is missing and should be treated differently from the rest of data. There are innumerable very subtle bugs that could come up if we started comparing missing values as if they were known or as if two missing values were equal to each other.

Think of NA as meaning "I don't know what's there". The correct answer to 3 > NA is obviously NA because we don't know if the missing value is larger than 3 or not. Well, it's the same for NA == NA. They are both missing values but the true values could be quite different, so the correct answer is "I don't know."

R doesn't know what you are doing in your analysis, so instead of potentially introducing bugs that would later end up being published and embarrassing you, it doesn't allow comparison operators to think NA is a value.

match() was written with a more specific purpose in mind: finding the indexes of matching values. If you ask the question "Should I match 3 with NA", a reasonable answer is "no." Different (and very useful) convention, and justified because R pretty much knows what you are trying to do when you invoke match(). Now, should we match NA with NA for this purpose? It could be argued.

Come to think of it, I suppose it is a a little odd that the authors of match() chose to allow NA to match to itself by default. You can imagine cases where you might use match() to find NA rows in table along with other values, but it's dangerous. You just have to be a bit more careful about knowing whether you have any NA values in x and only permitting them if you really wanted to. You can change this behavior by specifying incomparables=NA when calling match().

Replace NA when last and next non-NA values are equal

Perform an na.locf0 both fowards and backwards and if they are the same then use the common value; otherwise, use NA. The grouping is done with ave.

library(zoo)

filler <- function(x) {
  forward <- na.locf0(x)
  backward <- na.locf0(x, fromLast = TRUE)
  ifelse(forward == backward, forward, NA)
}
transform(dat, message = ave(message, id, FUN = filler))

giving:

   id message index
1   1    <NA>     1
2   1     foo     2
3   1     foo     3
4   1     foo     4
5   1     foo     5
6   1    <NA>     6
7   2    <NA>     1
8   2     baz     2
9   2     baz     3
10  2     baz     4
11  2     baz     5
12  2     baz     6
13  3     bar     1
14  3     bar     2
15  3     bar     3
16  3     bar     4
17  3     bar     5
18  3     bar     6
19  3    <NA>     7
20  3     qux     8

Why I get NA when I do indexing a vector (or dataframe) that do not match my condition?

Your assumption is kind of correct that is you get NA values when there is NA in the data.

The comparison yields NA values

iris_test$Sepal.Length < 0
#[1]    NA FALSE FALSE FALSE.....

When you subset a vector with NA it returns NA. See for example,

iris$Sepal.Length[c(1, NA)]
#[1] 5.1  NA

This is what the second case returns. For first case, all the values are FALSE so you get numeric(0)

iris$Sepal.Length[FALSE]
#numeric(0)

NA values are not recognized properly using dplyr

Welcome to SO! Use this to get NAs mutated and then delete the NAs:

data <- data %>% 
  mutate(ID = ifelse(ID == "NA",NA,ID)) %>%
  filter(!is.na(ID))

Dplyr join: NA match to any

Here's a tidyverse solution :

tbl2 %>%
  split(seq(nrow(.))) %>%               # split into one row data frames
  map_dfr(~modify_if(.,is.na,~NULL) %>% # remove na columns
        inner_join(tbl1,.))             # inner join to table1

# # A tibble: 2 x 4
#    subj   run session outcomedata
#       <dbl> <dbl>   <dbl> <list>     
# 1     1     1       1 <list [2]> 
# 2     1     1       1 <list [1]>

I use inner_join(tbl1,.) instead of inner_join(tbl1) to preserve column order.

And a base R translation :

df_list <- split(tbl2,seq(nrow(tbl2)))
df_list <- lapply(df_list,function(dfi){
  merge(tbl1, dfi[!sapply(dfi,is.na)])
})
do.call(rbind,df_list)
#   subj run session outcomedata
# 1    1   1       1     155, 80
# 2    1   1       1          30

Bonus

2 100% tidyverse approaches using group_by instead of split. one with do, one with nest and map. do is being soft deprecated FYI but here it offers more compact and readable syntax:

tbl2 %>%
  group_by(n=seq(n())) %>%
  do(modify_if(.,is.na,~NULL) %>% # remove na columns
            inner_join(tbl1,.)) %>%
  ungroup %>%
  select(-n)

tbl2 %>%
  rowid_to_column("n") %>%
  group_by(n) %>%
  nest(.key="dfi") %>%
  mutate_at("dfi",~map(.,
                       ~ modify_if(.,is.na,~NULL) %>% # remove na columns
                         inner_join(tbl1,.))) %>%
  unnest %>%
  select(-n)

Don't Select For NA Values

Do you need which?

> x <- c(1, 2, 1, NA)
> x[which(x==1)]
[1] 1 1

To explain, which(x==1) will give you the locations in your vector x that matches the test, x==1. You use this result to subset x, giving the output.

> which(x==1)
[1] 1 3

Replace NA values if last and next non-NA value are the same

You can fill forwards and backwards, then set the rows where they don't match to NA.

library(zoo)
library(dplyr)

df %>% 
  mutate_if(is.factor, as.character) %>% 
  group_by(ID) %>%
  mutate(result = na.locf(with_missing, fromLast = T),
         result = ifelse(result == na.locf(with_missing), result, NA))

#    ID with_missing desired_result result
# 1   1            a              a      a
# 2   1            a              a      a
# 3   1         <NA>              a      a
# 4   1         <NA>              a      a
# 5   1            a              a      a
# 6   1            a              a      a
# 7   2            a              a      a
# 8   2            a              a      a
# 9   2         <NA>           <NA>   <NA>
# 10  2            b              b      b
# 11  2            b              b      b
# 12  2            b              b      b
# 13  3            a              a      a
# 14  3         <NA>           <NA>   <NA>
# 15  3         <NA>           <NA>   <NA>
# 16  3         <NA>           <NA>   <NA>
# 17  3            c              c      c
# 18  3            c              c      c
# 19  4            b              b      b
# 20  4         <NA>           <NA>   <NA>
# 21  4            a              a      a
# 22  4            a              a      a
# 23  4            a              a      a
# 24  4            a              a      a
# 25  5            a              a      a
# 26  5         <NA>              a      a
# 27  5         <NA>              a      a
# 28  5         <NA>              a      a
# 29  5         <NA>              a      a
# 30  5            a              a      a
# 31  6            a              a      a
# 32  6            a              b      a
# 33  6         <NA>              b   <NA>
# 34  6            b              b      b
# 35  6            a              a      a
# 36  6            a              a      a
# 37  7            a              a      a
# 38  7            a              a      a
# 39  7         <NA>              a      a
# 40  7         <NA>              a      a
# 41  7            a              a      a
# 42  7            a              a      a
# 43  8            a              a      a
# 44  8            a              a      a
# 45  8         <NA>           <NA>   <NA>
# 46  8            b              b      b
# 47  8            b              b      b
# 48  8            b              b      b
# 49  9            a              a      a
# 50  9         <NA>           <NA>   <NA>
# 51  9         <NA>           <NA>   <NA>
# 52  9         <NA>           <NA>   <NA>
# 53  9            c              c      c
# 54  9            c              c      c
# 55 10            b              b      b
# 56 10         <NA>           <NA>   <NA>
# 57 10            a              a      a
# 58 10            a              a      a
# 59 10            a              a      a
# 60 10            a              a      a

why is.na and != filter out the info differently

TYN != 'Yes' is not equal to is.na(TYN).

In both the cases the second condition is what we are checking.

For the first case TYN != 'Yes' returns all NAs

df12$TYN != 'Yes'
#[1] NA NA NA NA

hence, the code goes to check the third condition which is Test %in% c("Fail", "NA")

df12$Test %in% c("Fail", "NA")
#[1] FALSE  TRUE  TRUE  TRUE

Hence, you got 'Fail' as output in first case.

For second case is.na works -

df12$Test %in% "NA" & is.na(df12$TYN)
#[1] FALSE  TRUE  TRUE  TRUE

Hence, you get output from second condition in this case which is "NA".

Why do conditions with %in% ignore missing values?

%in% checks to see if NA is in the list. Consider these two scenarios

NA %in% 1:3
# [1] FALSE
NA %in% c(1:3, NA)
# [1] TRUE

This allows you to check of NA is in the vector or not.

If you want to preserve NA values, you could write your own alternative

`%nain%` <- function(val, list) {
  ifelse(is.na(val), NA, val %in% list)
}

And then you can use

dt$is_warm3 <- dt$colour %nain% c("red", "orange")

Na Matches Na, But Is Not Equal to Na. Why