An Na in Subsetting a Data.Frame Does Something Unexpected

Unexpected row(s) of NAs when selecting subset of dataframe

That is because there is a missing observation

> sum(is.na(example.df$census_tract))
[1] 1
> example.df[which(is.na(example.df$census_tract)), ]
census_tract number_households_est
64 NA NA

When == evaluates the 64th row it gives NA because by default we can't know wheter 27702 is equal to the missing value. Therefore the result is missing (aka NA). So a NA is putted in the logical vector used for indexing purposes. And this gives, by default, a full-of-NA row, because we are asking for a row but "we don't know which one".

The proper way is

> example.df[example.df$census_tract %in% 27702, ]
census_tract number_households_est
23611 27702 2864

HTH, Luca

subsetting !is.na for multiple conditions unexpected results

I don't know why the initial approach didn't work, but I guess there is some fault in the chaining that I can not see. Taking the opposite approach (removing those that fulfills the condition) seems to produce the desired output.

tmp <- data.frame(state = c(1,  1, 2,  2, 3, 3, 4, 5),
reg = c(NA, 3, 6, NA, 9, 1, NA, 7),
gas = c(NA, 5, NA, 9, 1, 3, NA, 1),
other = c(1, 2, 4, 2, 6, 8, 1, 1) )

res = tmp[-which(is.na(tmp$reg) & is.na(tmp$gas)),]

res
#> state reg gas other
#> 2 1 3 5 2
#> 3 2 6 NA 4
#> 4 2 NA 9 2
#> 5 3 9 1 6
#> 6 3 1 3 8
#> 8 5 7 1 1

Created on 2020-12-24 by the reprex package (v0.3.0)

unexpected error when subsetting data.frame by range of date in R?

I can see couple of problem in OP's code.

Prob#1: The default format expected by as.Date is "%Y-%m-%d" or "%Y/%m/%d". But the formats of the characters (Begin, End columns) used in code is %d.%m.%Y or %s-%m-%Y. Hence default format in function as.Date() will not work. The format argument should specifically provided to as.Date() function.

The correct code to create DATE1 and DATE2 should be:

DATE1 <- as.Date("01-01-1981", format = "%d-%m-%Y")
DATE2 <- as.Date("31-12-2014", , format = "%d-%m-%Y")

Prob#2: The Begin and End columns of dataframe should be changed to as.Date format as well before attempting filter operations.

The format of those 2 columns can be changed as:

joinedData$Begin = as.Date(joinedData$Begin, format = "%d.%m.%Y")
joinedData$End= as.Date(joinedData$End, format = "%d.%m.%Y")

Now, the OP'2 initial approach should work.

Note: Personally I prefer using as.POSIXlt over as.Date

Reason for unexpected output in subsetting data frame - R

Working through an example shows where it is going wrong:

a <- data.frame(VAL=c(1,1,1,23,24))
a
# VAL
#1 1
#2 1
#3 1
#4 23
#5 24

These work:

a$VAL %in% c(23,24)
#[1] FALSE FALSE FALSE TRUE TRUE
a$VAL==23 | a$VAL==24
#[1] FALSE FALSE FALSE TRUE TRUE

The following doesn't work due to vector recycling when comparing - take note of the warning message below E.g.:

a$VAL ==c(23,24)
#[1] FALSE FALSE FALSE FALSE FALSE
#Warning message:
#In a$VAL == c(23, 24) :
# longer object length is not a multiple of shorter object length

This last bit of code recycles what you are testing against and is basically comparing:

c( 1,  1,  1, 23, 24) #to
c(23, 24, 23, 24, 23)

...so you don't get any rows returned. Changing the order will give you

c( 1,  1,  1, 23, 24) #to
c(24, 23, 24, 23, 24)

...and you will get two rows returned (which gives the intended result by pure luck, but it is not appropriate to use).

NA values in indexing data frame

Well, if you want to omit the NA rows, one quick and hackish solution is to wrap it in which:

> with(df, df[a < 30 & b == "moon" & c == "a",])
a b c
NA NA <NA> <NA>
10 10 moon a
12 12 moon a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29 29 moon a
> with(df, df[which(a < 30 & b == "moon" & c == "a"),])
a b c
10 10 moon a
12 12 moon a
29 29 moon a

On edit: another option in circumstances like this, which might be frowned upon by some, but which I personally find very useful, is to define a local variable within the brackets:

> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i | is.na(i)},])
a b c
6 6 moon <NA>
10 10 moon a
12 12 moon a
15 15 moon <NA>
18 18 moon <NA>
29 29 moon a
> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i & !is.na(i)},])
a b c
10 10 moon a
12 12 moon a
29 29 moon a

This is more concise than either writing a special function or defining the indices on a separate line, and is applicable in many situations where there isn't an R function that does exactly what you want.

Why does is.na() change its argument?

The actual function being used here is not is.na() but the assignment function `is.na<-`, for which the default method is `is.na<-.default`. Printing that function to console we see:

function (x, value) 
{
x[value] <- NA
x
}

So clearly, value is supposed to be an index here. If you index a data.frame like df["0"], it will try to select the column named "0". If you assign something to df["0"], the column will be created and filled with (in this case) NA.

To clarify, `is.na<-` sets values to NA, it does not replace NA values with something else.

How to subset a data.frame if the column contains NAs

While I'm trying to understand the behaviour, the right/better way to do character filter in R is to use %in% operator.

foo <- data.frame(bar = c('a',NA,'b','a'),
baz = 1:4,
stringsAsFactors = FALSE)

foo[!(foo$bar %in% 'a'), ]

Output:

> foo[!(foo$bar %in% 'a'), ]
bar baz
2 <NA> 2
3 b 3

Update:

The behaviour isn't because of character filter. It's actually because NA is used to index the dataframe.

> foo[c(F,NA,T,F),]
bar baz
NA <NA> NA
3 b 3

Passing NA as index value replaces any value in that position with just NA

> foo[NA,]
bar baz
NA <NA> NA
NA.1 <NA> NA
NA.2 <NA> NA
NA.3 <NA> NA
> foo[c(T,NA),]
bar baz
1 a 1
NA <NA> NA
3 b 3
NA.1 <NA> NA

Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter

Your example of the "expected" behavior doesn't actually return what you display in your question. I get:

> df[df$y != 'a',]
x y
NA NA <NA>
3 3 c

This is arguably more wrong than what subset and dplyr::filter return. Remember that in R, NA really is intended to mean "unknown", so df$y != 'a' returns,

> df$y != 'a'
[1] FALSE NA TRUE

So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NAs.

Many people dislike this behavior, but it is what it is.

subset and dplyr::filter make a different default choice which is to simply drop the NA rows, which arguably is accurate-ish.

But really, the lesson here is that if your data has NAs, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a', or as mentioned in the other answer by using %in% which is based on match.


From base::Extract:

When extracting, a numerical, logical or character NA index picks an unknown element and so returns NA

From ?base::subset:

missing values are taken as false [...] For ordinary vectors, the result is simply x[subset & !is.na(subset)]

From ?dplyr::filter

Unlike base subsetting with [, rows where the condition evaluates to NA are dropped



Related Topics



Leave a reply



Submit