Unexpected row(s) of NAs when selecting subset of dataframe
That is because there is a missing observation
> sum(is.na(example.df$census_tract))
[1] 1
> example.df[which(is.na(example.df$census_tract)), ]
census_tract number_households_est
64 NA NA
When ==
evaluates the 64th row it gives NA
because by default we can't know wheter 27702 is equal to the missing value. Therefore the result is missing (aka NA
). So a NA
is putted in the logical vector used for indexing purposes. And this gives, by default, a full-of-NA
row, because we are asking for a row but "we don't know which one".
The proper way is
> example.df[example.df$census_tract %in% 27702, ]
census_tract number_households_est
23611 27702 2864
HTH, Luca
subsetting !is.na for multiple conditions unexpected results
I don't know why the initial approach didn't work, but I guess there is some fault in the chaining that I can not see. Taking the opposite approach (removing those that fulfills the condition) seems to produce the desired output.
tmp <- data.frame(state = c(1, 1, 2, 2, 3, 3, 4, 5),
reg = c(NA, 3, 6, NA, 9, 1, NA, 7),
gas = c(NA, 5, NA, 9, 1, 3, NA, 1),
other = c(1, 2, 4, 2, 6, 8, 1, 1) )
res = tmp[-which(is.na(tmp$reg) & is.na(tmp$gas)),]
res
#> state reg gas other
#> 2 1 3 5 2
#> 3 2 6 NA 4
#> 4 2 NA 9 2
#> 5 3 9 1 6
#> 6 3 1 3 8
#> 8 5 7 1 1
Created on 2020-12-24 by the reprex package (v0.3.0)
unexpected error when subsetting data.frame by range of date in R?
I can see couple of problem in OP's code.
Prob#1: The default format expected by as.Date
is "%Y-%m-%d" or "%Y/%m/%d"
. But the formats of the characters (Begin
, End
columns) used in code is %d.%m.%Y
or %s-%m-%Y
. Hence default format
in function as.Date()
will not work. The format
argument should specifically provided to as.Date()
function.
The correct code to create DATE1
and DATE2
should be:
DATE1 <- as.Date("01-01-1981", format = "%d-%m-%Y")
DATE2 <- as.Date("31-12-2014", , format = "%d-%m-%Y")
Prob#2: The Begin
and End
columns of dataframe should be changed to as.Date
format as well before attempting filter operations.
The format of those 2 columns can be changed as:
joinedData$Begin = as.Date(joinedData$Begin, format = "%d.%m.%Y")
joinedData$End= as.Date(joinedData$End, format = "%d.%m.%Y")
Now, the OP'2 initial approach should work.
Note: Personally I prefer using as.POSIXlt
over as.Date
Reason for unexpected output in subsetting data frame - R
Working through an example shows where it is going wrong:
a <- data.frame(VAL=c(1,1,1,23,24))
a
# VAL
#1 1
#2 1
#3 1
#4 23
#5 24
These work:
a$VAL %in% c(23,24)
#[1] FALSE FALSE FALSE TRUE TRUE
a$VAL==23 | a$VAL==24
#[1] FALSE FALSE FALSE TRUE TRUE
The following doesn't work due to vector recycling when comparing - take note of the warning message below E.g.:
a$VAL ==c(23,24)
#[1] FALSE FALSE FALSE FALSE FALSE
#Warning message:
#In a$VAL == c(23, 24) :
# longer object length is not a multiple of shorter object length
This last bit of code recycles what you are testing against and is basically comparing:
c( 1, 1, 1, 23, 24) #to
c(23, 24, 23, 24, 23)
...so you don't get any rows returned. Changing the order will give you
c( 1, 1, 1, 23, 24) #to
c(24, 23, 24, 23, 24)
...and you will get two rows returned (which gives the intended result by pure luck, but it is not appropriate to use).
NA values in indexing data frame
Well, if you want to omit the NA
rows, one quick and hackish solution is to wrap it in which
:
> with(df, df[a < 30 & b == "moon" & c == "a",])
a b c
NA NA <NA> <NA>
10 10 moon a
12 12 moon a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29 29 moon a
> with(df, df[which(a < 30 & b == "moon" & c == "a"),])
a b c
10 10 moon a
12 12 moon a
29 29 moon a
On edit: another option in circumstances like this, which might be frowned upon by some, but which I personally find very useful, is to define a local variable within the brackets:
> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i | is.na(i)},])
a b c
6 6 moon <NA>
10 10 moon a
12 12 moon a
15 15 moon <NA>
18 18 moon <NA>
29 29 moon a
> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i & !is.na(i)},])
a b c
10 10 moon a
12 12 moon a
29 29 moon a
This is more concise than either writing a special function or defining the indices on a separate line, and is applicable in many situations where there isn't an R function that does exactly what you want.
Why does is.na() change its argument?
The actual function being used here is not is.na()
but the assignment function `is.na<-`
, for which the default method is `is.na<-.default`
. Printing that function to console we see:
function (x, value)
{
x[value] <- NA
x
}
So clearly, value
is supposed to be an index here. If you index a data.frame
like df["0"]
, it will try to select the column named "0"
. If you assign something to df["0"]
, the column will be created and filled with (in this case) NA
.
To clarify, `is.na<-`
sets values to NA
, it does not replace NA
values with something else.
How to subset a data.frame if the column contains NAs
While I'm trying to understand the behaviour, the right/better way to do character filter in R is to use %in%
operator.
foo <- data.frame(bar = c('a',NA,'b','a'),
baz = 1:4,
stringsAsFactors = FALSE)
foo[!(foo$bar %in% 'a'), ]
Output:
> foo[!(foo$bar %in% 'a'), ]
bar baz
2 <NA> 2
3 b 3
Update:
The behaviour isn't because of character filter. It's actually because NA
is used to index the dataframe.
> foo[c(F,NA,T,F),]
bar baz
NA <NA> NA
3 b 3
Passing NA
as index value replaces any value in that position with just NA
> foo[NA,]
bar baz
NA <NA> NA
NA.1 <NA> NA
NA.2 <NA> NA
NA.3 <NA> NA
> foo[c(T,NA),]
bar baz
1 a 1
NA <NA> NA
3 b 3
NA.1 <NA> NA
Filtering rows in R unexpectedly removes NAs when using subset or dplyr::filter
Your example of the "expected" behavior doesn't actually return what you display in your question. I get:
> df[df$y != 'a',]
x y
NA NA <NA>
3 3 c
This is arguably more wrong than what subset
and dplyr::filter
return. Remember that in R, NA
really is intended to mean "unknown", so df$y != 'a'
returns,
> df$y != 'a'
[1] FALSE NA TRUE
So R is being told you definitely don't want the first row, you do want the last row, but whether you want the second row is literally "unknown". As a result, it includes a row of all NA
s.
Many people dislike this behavior, but it is what it is.
subset
and dplyr::filter
make a different default choice which is to simply drop the NA
rows, which arguably is accurate-ish.
But really, the lesson here is that if your data has NA
s, that just means you need to code defensively around that at all points, either by using conditions like is.na(df$y) | df$y != 'a'
, or as mentioned in the other answer by using %in%
which is based on match
.
From base::Extract
:
When extracting, a numerical, logical or character
NA
index picks an unknown element and so returnsNA
From ?base::subset
:
missing values are taken as false [...] For ordinary vectors, the result is simply
x[subset & !is.na(subset)]
From ?dplyr::filter
Unlike base subsetting with
[
, rows where the condition evaluates toNA
are dropped
Related Topics
How to Retry a Statement on Error
Row Operations in Data.Table Using 'By = .I'
Relocating Alaska and Hawaii on Thematic Map of the Usa with Ggplot2
What Evaluates to True/False in R
Error Creating R Data.Table with Date-Time Posixlt
Plot Background Colour in Gradient
Reading Psv (Pipe-Separated) File or String
Appending a List to a List of Lists in R
Error: Vector Memory Exhausted (Limit Reached) R 3.5.0 MACos
Optimized Rolling Functions on Irregular Time Series with Time-Based Window
Adding Column If It Does Not Exist
Modifying Ggplot Objects After Creation
R Dplyr Rowwise Mean or Min and Other Methods
Efficiently Merging Two Data Frames on a Non-Trivial Criteria