How to subset data in R without losing NA rows?
If we decide to use subset
function, then we need to watch out:
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
So only non-NA values will be retained.
If you want to keep NA
cases, use logical or condition to tell R not to drop NA
cases:
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
Don't use directly (to be explained soon):
df2 <- df1[df1$Height < 40, ]
Example
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
subset(df1, Height < 40 | is.na(Height))
# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4
df1[df1$Height < 40, ]
# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA
The reason that the latter fails, is that indexing by NA
gives NA
. Consider this simple example with a vector:
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA
We need to somehow replace those NA
with TRUE
. The most straightforward way is to add another "or" condition is.na(ind)
:
x[ind | is.na(ind)]
# [1] 1 2 3
This is exactly what will happen in your situation. If your Height
contains NA
, then logical operation Height < 40
ends up a mix of TRUE
/ FALSE
/ NA
, so we need replace NA
by TRUE
as above.
How to filter data without losing NA rows using dplyr
The documentation for dplyr::filter
says... "Unlike base subsetting, rows where the condition evaluates to NA are dropped."
NA != "str"
evaluates to NA
so is dropped by filter
.
!grepl("str", NA)
returns TRUE
, so is kept.
If you want filter
to keep NA
, you could do filter(is.na(col)|col!="str")
Subsetting Rows in R producing NAs, but there are no NAs in Data Frame
We can specify the row with the logical expression, subset the columns with the column names as strings, get the unique
and extract the distance
unique(x[x$component ==1, c("ObjectID", "distance")])$distance
#[1] 2 4
If the intention is only to get the 'distance' based on the 'unique' values of 'ObjectID', we can use duplicated
with(subset(x, component == 1, select = c(ObjectID, distance)),
distance[!duplicated(ObjectID)])
#[1] 2 4
Or more compactly, join two conditions with &
subset(x, !duplicated(ObjectID) & component == 1)$distance
#[1] 2 4
The issue in OP's code is using the unique
value of 'ObjectID' as row index, which fails as the index can be either logical or numeric index
unique(x[x$component==1,]$ObjectID)
#[1] "11AD1234" "11DA354"
If we have to convert this to logical, we can use %in%
Remove rows with all or some NAs (missing values) in data.frame
Also check complete.cases
:
> final[complete.cases(final), ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
na.omit
is nicer for just removing all NA
's. complete.cases
allows partial selection by including only certain columns of the dataframe:
> final[complete.cases(final[ , 5:6]),]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
Your solution can't work. If you insist on using is.na
, then you have to do something like:
> final[rowSums(is.na(final[ , 5:6])) == 0, ]
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
but using complete.cases
is quite a lot more clear, and faster.
Subset dataframe in R, dplyr filter row values of column A not NA in row of column B
Maybe this will achieve your goal. If all participants have all StudyDay
timepoints, and you just want to see if not missing in days 2 or 4, you can just check the Ab
values at those time points in your filter
. In this case, an ID
will be omitted if is NA
in both days 2 and 4 (in this example, "D").
Alternatively, if you want to require that both values are available for days 2 and 4, you can use &
(AND) instead of |
(OR).
library(dplyr)
fakedat %>%
group_by(ID) %>%
filter(!is.na(Ab[StudyDay == 2]) | !is.na(Ab[StudyDay == 4]))
If you have multiple days to check are not missing, you can use all
and check values for NA
where the StudyDay
is %in%
a vector of required days as follows:
required_vals <- c(2, 4)
fakedat %>%
group_by(ID) %>%
filter(all(!is.na(Ab[StudyDay %in% required_vals])))
Output
ID StudyDay Ab
<chr> <dbl> <dbl>
1 A 1 10
2 A 2 NA
3 A 3 15
4 A 4 10
5 B 1 10
6 B 2 20
7 B 3 10
8 B 4 NA
9 C 1 10
10 C 2 10
11 C 3 NA
12 C 4 30
13 E 1 10
14 E 2 20
15 E 3 10
16 E 4 30
17 F 1 NA
18 F 2 10
19 F 3 NA
20 F 4 20
Related Topics
Avoid String Printed to Console Getting Truncated (In Rstudio)
Plotting a 3D Surface Plot with Contour Map Overlay, Using R
Inst and Extdata Folders in R Packaging
Why Does As.Factor Return a Character When Used Inside Apply
Rscript Does Not Load Methods Package, R Does -- Why, and What Are the Consequences
R Package Lattice Won't Plot If Run Using Source()
How to Reorder Data.Table Columns (Without Copying)
Write List of Data.Frames to Separate CSV Files with Lapply
Colorbar from Custom Colorramppalette
Calculate Cumsum() While Ignoring Na Values
Dplyr on Data.Table, am I Really Using Data.Table
R: Assign Variable Labels of Data Frame Columns
How Subset a Data Frame by a Factor and Repeat a Plot for Each Subset
R Gotcha: Logical-And Operator for Combining Conditions Is & Not &&
Remove Duplicate Column Pairs, Sort Rows Based on 2 Columns
What's the Difference Between Integer Class and Numeric Class in R