Fastest way to detect if vector has at least 1 NA?
As of R 3.1.0 anyNA()
is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na())
. Additionally, this avoids creating an intermediate logical vector with is.na
that is immediately discarded. Borrowing Joran's example:
x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
# expr min lq mean median uq
# any(is.na(x)) 13444.674 13509.454 21191.9025 13639.3065 13917.592
# anyNA(x) 6.840 13.187 13.5283 14.1705 14.774
# any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
# anyNA(y) 7193.784 7285.107 7694.1785 7497.9265 7865.064
Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.
How to check if vector is a single NA value without length warning and without suppression
The length warning comes from the use of if
, which expects a length 1 vector, and is.na
which is vectorised.
You could use any
or all
around the is.na
to compress it to a length 1 vector but there may be edge cases where it doesn't work as you expect so I would use shortcircuit evaluation to check it is length 1 on the is.na
check:
so_function <- function(x = NA) {
if (!((length(x)==1 && is.na(x)) | is.character(x))) {
stop("This was just an example for you SO!")
}
}
so_function(NA_character_) # should pass
so_function(NA_integer_) # should pass
so_function(c(NA, NA)) # should fail
Error in so_function(c(NA, NA)) : This was just an example for you SO!
so_function(c("A", "B")) # should pass
so_function(c(1, 2, 3)) # should fail
Error in so_function(c(1, 2, 3)) : This was just an example for you SO!
Another option is to use NULL
as the default value instead.
Detect at least one match between each data frame row and values in vector
Here's one way to do this:
df$valueFound <- apply(df,1,function(x){
if(any(x %in% vec)){
1
} else {
0
}
})
##
> df
x1 x2 x3 x4 valueFound
1 a b b a 1
2 c c d e 0
3 f g h i 1
4 j k <NA> <NA> 0
Thanks to @David Arenburg and @CathG, a couple of more concise approaches:
apply(df, 1, function(x) any(x %in% vec) + 0)
apply(df, 1, function(x) as.numeric(any(x %in% vec)))
Just for fun, a couple of other interesting variants:
apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
apply(df, 1, function(x) cumprod(any(x %in% vec)))
fastest way to count the number of rows in a data frame that has at least one NA
Before I begin, note that none of the code here is mine. I was merely fascinated by the code in the comments and wondered which one really performed the best.
I suspected some of the time was being absorbed in transforming a data frame to a matrix for apply
and rowSums
, so I've also done most of the solutions on matrices to illustrate the penalty applied by running these solutions on a data frame.
# Make a data frame of 10,000 rows and set random values to NA
library(dplyr)
set.seed(13)
MT <- mtcars[sample(1:nrow(mtcars), size = 10000, replace = TRUE), ]
MT <- lapply(MT,
function(x) { x[sample(1:length(x), size = 100)] <- NA; x }) %>%
bind_cols()
MT_mat <- as.matrix(MT)
library(microbenchmark)
microbenchmark(
apply(MT,1,anyNA),
apply(MT_mat,1,anyNA), # apply on a matrix
row_sum = rowSums(is.na(MT)) > 0,
row_sum_mat = rowSums(is.na(MT_mat)), # rowSums on a matrix
reduce = Reduce('|', lapply(MT, is.na)) ,
complete_case = !complete.cases(MT),
complete_case_mat = !complete.cases(MT_mat) # complete.cases on a matrix
)
Unit: microseconds
expr min lq mean median uq max neval cld
apply(MT, 1, anyNA) 12126.013 13422.747 14930.6022 13927.5695 14589.1320 60958.791 100 d
apply(MT_mat, 1, anyNA) 11662.390 12546.674 14758.1266 13336.6785 14083.7225 66075.346 100 d
row_sum 1541.594 1581.768 2233.1150 1617.3985 1647.8955 49114.588 100 bc
row_sum_mat 579.161 589.131 707.3710 618.7490 627.5465 3235.089 100 a c
reduce 2028.969 2051.696 2252.8679 2084.8320 2102.8670 4271.127 100 c
complete_case 321.984 330.195 346.8692 342.5115 351.3090 436.057 100 a
complete_case_mat 348.083 358.640 384.1671 379.0205 406.8790 503.503 100 ab
#* Verify that they all return the same result
MT$apply <- apply(MT, 1, anyNA)
MT$apply_mat <- apply(MT_mat, 1, anyNA)
MT$row_sum <- rowSums(is.na(MT)) > 0
MT$row_sum_mat <- rowSums(is.na(MT_mat)) > 0
MT$reduce <- Reduce('|', lapply(MT, is.na))
MT$complete_case <- !complete.cases(MT)
MT$complete_case_mat <- !complete.cases(MT_mat)
all(MT$apply == MT$apply_mat)
all(MT$apply == MT$row_sum)
all(MT$apply == MT$row_sum_mat)
all(MT$apply == MT$reduce)
all(MT$apply == MT$complete_case)
all(MT$apply == MT$complete_case_mat)
complete.cases
seems to be the clear winner, and works well for both data frames and matrices. As it turns out, complete.cases
calls a C routine, which may account for much of its speed. looking at rowSums
, apply
, and Reduce
shows R code.
Why apply
is slower the rowSums
probably has to do with rowSums
being optimized for a specific task. rowSums
knows it will be returning a numeric, apply
has no such guarantee. I doubt that accounts for all of the difference--I'm mostly speculating.
I couldn't begin to tell you how Reduce
is working.
Detect if a NumPy array contains at least one non-numeric value?
This should be faster than iterating and will work regardless of shape.
numpy.isnan(myarray).any()
Edit: 30x faster:
import timeit
s = 'import numpy;a = numpy.arange(10000.).reshape((100,100));a[10,10]=numpy.nan'
ms = [
'numpy.isnan(a).any()',
'any(numpy.isnan(x) for x in a.flatten())']
for m in ms:
print " %.2f s" % timeit.Timer(m, s).timeit(1000), m
Results:
0.11 s numpy.isnan(a).any()
3.75 s any(numpy.isnan(x) for x in a.flatten())
Bonus: it works fine for non-array NumPy types:
>>> a = numpy.float64(42.)
>>> numpy.isnan(a).any()
False
>>> a = numpy.float64(numpy.nan)
>>> numpy.isnan(a).any()
True
Select set of columns so that each row has at least one non-NA entry
Using a while
loop, this should work to get the minimum set of variables with at least one non-NA per row.
best <- function(df){
best <- which.max(colSums(sapply(df, complete.cases)))
while(any(rowSums(sapply(df[best], complete.cases)) == 0)){
best <- c(best, which.max(sapply(df[is.na(df[best]), ], \(x) sum(complete.cases(x)))))
}
best
}
testing
best(df)
#d c
#4 3
df[best(df)]
# d c
#1 1 1
#2 1 NA
#3 1 NA
#4 1 NA
#5 NA 1
First, select the column with the least NAs (stored in best
). Then, update the vector with the column that has the highest number of non-NA rows on the remaining rows (where best has still NAs), until you get every rows with a complete case.
Fastest way to find second (third...) highest/lowest value in vector or column
Rfast has a function called nth_element that does exactly what you ask.
Further the methods discussed above that are based on partial sort, don't support finding the k smallest values
Update (28/FEB/21) package kit offers a faster implementation (topn) see https://stackoverflow.com/a/66367996/4729755, https://stackoverflow.com/a/53146559/4729755
Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.
Rfast::nth(x, 5, descending = T)
Will return the 5th largest element of x, while
Rfast::nth(x, 5, descending = F)
Will return the 5th smallest element of x
Benchmarks below against most popular answers.
For 10 thousand numbers:
N = 10000
x = rnorm(N)
maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: microseconds
expr min lq mean median uq max neval
Rfast 160.364 179.607 202.8024 194.575 210.1830 351.517 100
maxN 396.419 423.360 559.2707 446.452 487.0775 4949.452 100
order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148 100
For 1 million numbers:
N = 1e6
x = rnorm(N)
microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]])
Unit: milliseconds
expr min lq mean median uq max neval
Rfast 89.7722 93.63674 114.9893 104.6325 120.5767 204.8839 100
maxN 150.2822 207.03922 235.3037 241.7604 259.7476 336.7051 100
order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129 100
How to check if entire vector has no values other than NA (or NAN) in R?
The function all()
, when passed a Boolean vector, will tell you whether all of the values in it are TRUE
:
> all(is.na(c(NA, NaN)))
[1] TRUE
> all(is.na(c(NA, NaN, 1)))
[1] FALSE
Remove NA values from a vector
Trying ?max
, you'll see that it actually has a na.rm =
argument, set by default to FALSE
. (That's the common default for many other R functions, including sum()
, mean()
, etc.)
Setting na.rm=TRUE
does just what you're asking for:
d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)
If you do want to remove all of the NA
s, use this idiom instead:
d <- d[!is.na(d)]
A final note: Other functions (e.g. table()
, lm()
, and sort()
) have NA
-related arguments that use different names (and offer different options). So if NA
's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.
Related Topics
Replacing the Duplicate Values Except 1 Row in R Dataframe
How to Calculate the Average of a Variable Between Two Date Ranges Using a Loop or Apply Function
Warning: Non-Integer #Successes in a Binomial Glm! (Survey Packages)
How to Convert Ensembl Id to Gene Symbol in R
Rmarkdown: Pandoc: PDFlatex Not Found
Understanding Color Scales in Ggplot2
How to Rename a Variable in R Without Copying the Object
How to Plot a Classification Graph of a Svm in R
Reshape Wide Format, to Multi-Column Long Format
How to Produce Time Series for Each Row of a Data Frame with an Unnamed First Column
Filter Based on Number of Distinct Values Per Group
R Web Application Introduction
Predicting Lda Topics for New Data
Compare If Two Dataframe Objects in R Are Equal
Use Dplyr's Summarise_Each to Return One Row Per Function