NA matches NA, but is not equal to NA. Why?
It's a matter of convention. There are good reasons for the way ==
works. NA
is a special value in R that is supposed to represent data that is missing and should be treated differently from the rest of data. There are innumerable very subtle bugs that could come up if we started comparing missing values as if they were known or as if two missing values were equal to each other.
Think of NA
as meaning "I don't know what's there". The correct answer to 3 > NA
is obviously NA
because we don't know if the missing value is larger than 3 or not. Well, it's the same for NA == NA
. They are both missing values but the true values could be quite different, so the correct answer is "I don't know."
R doesn't know what you are doing in your analysis, so instead of potentially introducing bugs that would later end up being published and embarrassing you, it doesn't allow comparison operators to think NA is a value.
match()
was written with a more specific purpose in mind: finding the indexes of matching values. If you ask the question "Should I match 3 with NA", a reasonable answer is "no." Different (and very useful) convention, and justified because R pretty much knows what you are trying to do when you invoke match()
. Now, should we match NA
with NA
for this purpose? It could be argued.
Come to think of it, I suppose it is a a little odd that the authors of match()
chose to allow NA
to match to itself by default. You can imagine cases where you might use match()
to find NA
rows in table
along with other values, but it's dangerous. You just have to be a bit more careful about knowing whether you have any NA values in x and only permitting them if you really wanted to. You can change this behavior by specifying incomparables=NA
when calling match()
.
Replace NA when last and next non-NA values are equal
Perform an na.locf0
both fowards and backwards and if they are the same then use the common value; otherwise, use NA. The grouping is done with ave
.
library(zoo)
filler <- function(x) {
forward <- na.locf0(x)
backward <- na.locf0(x, fromLast = TRUE)
ifelse(forward == backward, forward, NA)
}
transform(dat, message = ave(message, id, FUN = filler))
giving:
id message index
1 1 <NA> 1
2 1 foo 2
3 1 foo 3
4 1 foo 4
5 1 foo 5
6 1 <NA> 6
7 2 <NA> 1
8 2 baz 2
9 2 baz 3
10 2 baz 4
11 2 baz 5
12 2 baz 6
13 3 bar 1
14 3 bar 2
15 3 bar 3
16 3 bar 4
17 3 bar 5
18 3 bar 6
19 3 <NA> 7
20 3 qux 8
Why I get NA when I do indexing a vector (or dataframe) that do not match my condition?
Your assumption is kind of correct that is you get NA
values when there is NA
in the data.
The comparison yields NA
values
iris_test$Sepal.Length < 0
#[1] NA FALSE FALSE FALSE.....
When you subset a vector with NA
it returns NA
. See for example,
iris$Sepal.Length[c(1, NA)]
#[1] 5.1 NA
This is what the second case returns. For first case, all the values are FALSE
so you get numeric(0)
iris$Sepal.Length[FALSE]
#numeric(0)
NA values are not recognized properly using dplyr
Welcome to SO! Use this to get NAs mutated and then delete the NAs:
data <- data %>%
mutate(ID = ifelse(ID == "NA",NA,ID)) %>%
filter(!is.na(ID))
Dplyr join: NA match to any
Here's a tidyverse
solution :
tbl2 %>%
split(seq(nrow(.))) %>% # split into one row data frames
map_dfr(~modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.)) # inner join to table1
# # A tibble: 2 x 4
# subj run session outcomedata
# <dbl> <dbl> <dbl> <list>
# 1 1 1 1 <list [2]>
# 2 1 1 1 <list [1]>
I use inner_join(tbl1,.)
instead of inner_join(tbl1)
to preserve column order.
And a base R
translation :
df_list <- split(tbl2,seq(nrow(tbl2)))
df_list <- lapply(df_list,function(dfi){
merge(tbl1, dfi[!sapply(dfi,is.na)])
})
do.call(rbind,df_list)
# subj run session outcomedata
# 1 1 1 1 155, 80
# 2 1 1 1 30
Bonus
2 100% tidyverse approaches using group_by
instead of split
. one with do
, one with nest
and map
. do
is being soft deprecated FYI but here it offers more compact and readable syntax:
tbl2 %>%
group_by(n=seq(n())) %>%
do(modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.)) %>%
ungroup %>%
select(-n)
tbl2 %>%
rowid_to_column("n") %>%
group_by(n) %>%
nest(.key="dfi") %>%
mutate_at("dfi",~map(.,
~ modify_if(.,is.na,~NULL) %>% # remove na columns
inner_join(tbl1,.))) %>%
unnest %>%
select(-n)
Don't Select For NA Values
Do you need which?
> x <- c(1, 2, 1, NA)
> x[which(x==1)]
[1] 1 1
To explain, which(x==1)
will give you the locations in your vector x
that matches the test, x==1
. You use this result to subset x
, giving the output.
> which(x==1)
[1] 1 3
Replace NA values if last and next non-NA value are the same
You can fill forwards and backwards, then set the rows where they don't match to NA
.
library(zoo)
library(dplyr)
df %>%
mutate_if(is.factor, as.character) %>%
group_by(ID) %>%
mutate(result = na.locf(with_missing, fromLast = T),
result = ifelse(result == na.locf(with_missing), result, NA))
# ID with_missing desired_result result
# 1 1 a a a
# 2 1 a a a
# 3 1 <NA> a a
# 4 1 <NA> a a
# 5 1 a a a
# 6 1 a a a
# 7 2 a a a
# 8 2 a a a
# 9 2 <NA> <NA> <NA>
# 10 2 b b b
# 11 2 b b b
# 12 2 b b b
# 13 3 a a a
# 14 3 <NA> <NA> <NA>
# 15 3 <NA> <NA> <NA>
# 16 3 <NA> <NA> <NA>
# 17 3 c c c
# 18 3 c c c
# 19 4 b b b
# 20 4 <NA> <NA> <NA>
# 21 4 a a a
# 22 4 a a a
# 23 4 a a a
# 24 4 a a a
# 25 5 a a a
# 26 5 <NA> a a
# 27 5 <NA> a a
# 28 5 <NA> a a
# 29 5 <NA> a a
# 30 5 a a a
# 31 6 a a a
# 32 6 a b a
# 33 6 <NA> b <NA>
# 34 6 b b b
# 35 6 a a a
# 36 6 a a a
# 37 7 a a a
# 38 7 a a a
# 39 7 <NA> a a
# 40 7 <NA> a a
# 41 7 a a a
# 42 7 a a a
# 43 8 a a a
# 44 8 a a a
# 45 8 <NA> <NA> <NA>
# 46 8 b b b
# 47 8 b b b
# 48 8 b b b
# 49 9 a a a
# 50 9 <NA> <NA> <NA>
# 51 9 <NA> <NA> <NA>
# 52 9 <NA> <NA> <NA>
# 53 9 c c c
# 54 9 c c c
# 55 10 b b b
# 56 10 <NA> <NA> <NA>
# 57 10 a a a
# 58 10 a a a
# 59 10 a a a
# 60 10 a a a
why is.na and != filter out the info differently
TYN != 'Yes'
is not equal to is.na(TYN)
.
In both the cases the second condition is what we are checking.
For the first case TYN != 'Yes'
returns all NA
s
df12$TYN != 'Yes'
#[1] NA NA NA NA
hence, the code goes to check the third condition which is Test %in% c("Fail", "NA")
df12$Test %in% c("Fail", "NA")
#[1] FALSE TRUE TRUE TRUE
Hence, you got 'Fail'
as output in first case.
For second case is.na
works -
df12$Test %in% "NA" & is.na(df12$TYN)
#[1] FALSE TRUE TRUE TRUE
Hence, you get output from second condition in this case which is "NA"
.
Why do conditions with %in% ignore missing values?
%in%
checks to see if NA
is in the list. Consider these two scenarios
NA %in% 1:3
# [1] FALSE
NA %in% c(1:3, NA)
# [1] TRUE
This allows you to check of NA is in the vector or not.
If you want to preserve NA values, you could write your own alternative
`%nain%` <- function(val, list) {
ifelse(is.na(val), NA, val %in% list)
}
And then you can use
dt$is_warm3 <- dt$colour %nain% c("red", "orange")
Related Topics
Extract Time (Hms) from Lubridate Date Time Object
Enriching a Ggplot2 Plot with Multiple Geom_Segment in a Loop
Split Data.Frame into Groups by Column Name
R: How to Aggregate Some Columns While Keeping Other Columns
Extracting Output from Principal Function in Psych Package as a Data Frame
R: Find Missing Columns, Add to Data Frame If Missing
Highlight Minimum and Maximum Points in Faceted Ggplot2 Graph in R
Highlight Minimum and Maximum Points in Faceted Ggplot2 Graph in R
Add Hline with Population Median for Each Facet
Paste Several Column Values into One Value in R
Use of .By and .Eachi in the Data.Table Package
Ordering Factors in Number Order for Ggplot
Prevent Knitr/Rmarkdown from Interleaving Chunk Output with Code
Adding All Elements of Two Lists