How to Subset Data in R Without Losing Na Rows

How to subset data in R without losing NA rows?

If we decide to use subset function, then we need to watch out:

For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.

So only non-NA values will be retained.

If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:

subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`

Don't use directly (to be explained soon):

df2 <- df1[df1$Height < 40, ]

Example

df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)

subset(df1, Height < 40 | is.na(Height))

# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4

df1[df1$Height < 40, ]

# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA

The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:

x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA

We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):

x[ind | is.na(ind)]
# [1] 1 2 3

This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.

How to filter data without losing NA rows using dplyr

The documentation for dplyr::filter says... "Unlike base subsetting, rows where the condition evaluates to NA are dropped."

NA != "str" evaluates to NA so is dropped by filter.

!grepl("str", NA) returns TRUE, so is kept.

If you want filter to keep NA, you could do filter(is.na(col)|col!="str")

Subsetting Rows in R producing NAs, but there are no NAs in Data Frame

We can specify the row with the logical expression, subset the columns with the column names as strings, get the unique and extract the distance

unique(x[x$component ==1, c("ObjectID", "distance")])$distance
#[1] 2 4

If the intention is only to get the 'distance' based on the 'unique' values of 'ObjectID', we can use duplicated

with(subset(x, component == 1, select = c(ObjectID, distance)), 
distance[!duplicated(ObjectID)])
#[1] 2 4

Or more compactly, join two conditions with &

subset(x, !duplicated(ObjectID) & component == 1)$distance
#[1] 2 4

The issue in OP's code is using the unique value of 'ObjectID' as row index, which fails as the index can be either logical or numeric index

unique(x[x$component==1,]$ObjectID)
#[1] "11AD1234" "11DA354"

If we have to convert this to logical, we can use %in%

Remove rows with all or some NAs (missing values) in data.frame

Also check complete.cases :

> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2

na.omit is nicer for just removing all NA's. complete.cases allows partial selection by including only certain columns of the dataframe:

> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2

Your solution can't work. If you insist on using is.na, then you have to do something like:

> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2

but using complete.cases is quite a lot more clear, and faster.

Subset dataframe in R, dplyr filter row values of column A not NA in row of column B

Maybe this will achieve your goal. If all participants have all StudyDay timepoints, and you just want to see if not missing in days 2 or 4, you can just check the Ab values at those time points in your filter. In this case, an ID will be omitted if is NA in both days 2 and 4 (in this example, "D").

Alternatively, if you want to require that both values are available for days 2 and 4, you can use & (AND) instead of | (OR).

library(dplyr)

fakedat %>%
group_by(ID) %>%
filter(!is.na(Ab[StudyDay == 2]) | !is.na(Ab[StudyDay == 4]))

If you have multiple days to check are not missing, you can use all and check values for NA where the StudyDay is %in% a vector of required days as follows:

required_vals <- c(2, 4)

fakedat %>%
group_by(ID) %>%
filter(all(!is.na(Ab[StudyDay %in% required_vals])))

Output

   ID    StudyDay    Ab
<chr> <dbl> <dbl>
1 A 1 10
2 A 2 NA
3 A 3 15
4 A 4 10
5 B 1 10
6 B 2 20
7 B 3 10
8 B 4 NA
9 C 1 10
10 C 2 10
11 C 3 NA
12 C 4 30
13 E 1 10
14 E 2 20
15 E 3 10
16 E 4 30
17 F 1 NA
18 F 2 10
19 F 3 NA
20 F 4 20


Related Topics



Leave a reply



Submit