Fastest Way to Detect If Vector Has at Least 1 Na

Fastest way to detect if vector has at least 1 NA?

As of R 3.1.0 anyNA() is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na()). Additionally, this avoids creating an intermediate logical vector with is.na that is immediately discarded. Borrowing Joran's example:

x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
# expr min lq mean median uq
# any(is.na(x)) 13444.674 13509.454 21191.9025 13639.3065 13917.592
# anyNA(x) 6.840 13.187 13.5283 14.1705 14.774
# any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
# anyNA(y) 7193.784 7285.107 7694.1785 7497.9265 7865.064

Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.

How to check if vector is a single NA value without length warning and without suppression

The length warning comes from the use of if, which expects a length 1 vector, and is.na which is vectorised.

You could use any or all around the is.na to compress it to a length 1 vector but there may be edge cases where it doesn't work as you expect so I would use shortcircuit evaluation to check it is length 1 on the is.na check:

so_function <- function(x = NA) {
if (!((length(x)==1 && is.na(x)) | is.character(x))) {
stop("This was just an example for you SO!")
}
}

so_function(NA_character_) # should pass

so_function(NA_integer_) # should pass

so_function(c(NA, NA)) # should fail
Error in so_function(c(NA, NA)) : This was just an example for you SO!

so_function(c("A", "B")) # should pass

so_function(c(1, 2, 3)) # should fail
Error in so_function(c(1, 2, 3)) : This was just an example for you SO!

Another option is to use NULL as the default value instead.

Detect at least one match between each data frame row and values in vector

Here's one way to do this:

df$valueFound <- apply(df,1,function(x){
if(any(x %in% vec)){
1
} else {
0
}
})
##
> df
x1 x2 x3 x4 valueFound
1 a b b a 1
2 c c d e 0
3 f g h i 1
4 j k <NA> <NA> 0

Thanks to @David Arenburg and @CathG, a couple of more concise approaches:

  • apply(df, 1, function(x) any(x %in% vec) + 0)
  • apply(df, 1, function(x) as.numeric(any(x %in% vec)))

Just for fun, a couple of other interesting variants:

  • apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
  • apply(df, 1, function(x) cumprod(any(x %in% vec)))

fastest way to count the number of rows in a data frame that has at least one NA

Before I begin, note that none of the code here is mine. I was merely fascinated by the code in the comments and wondered which one really performed the best.

I suspected some of the time was being absorbed in transforming a data frame to a matrix for apply and rowSums, so I've also done most of the solutions on matrices to illustrate the penalty applied by running these solutions on a data frame.

# Make a data frame of 10,000 rows and set random values to NA

library(dplyr)
set.seed(13)
MT <- mtcars[sample(1:nrow(mtcars), size = 10000, replace = TRUE), ]

MT <- lapply(MT,
function(x) { x[sample(1:length(x), size = 100)] <- NA; x }) %>%
bind_cols()

MT_mat <- as.matrix(MT)

library(microbenchmark)
microbenchmark(
apply(MT,1,anyNA),
apply(MT_mat,1,anyNA), # apply on a matrix
row_sum = rowSums(is.na(MT)) > 0,
row_sum_mat = rowSums(is.na(MT_mat)), # rowSums on a matrix
reduce = Reduce('|', lapply(MT, is.na)) ,
complete_case = !complete.cases(MT),
complete_case_mat = !complete.cases(MT_mat) # complete.cases on a matrix
)

Unit: microseconds
expr min lq mean median uq max neval cld
apply(MT, 1, anyNA) 12126.013 13422.747 14930.6022 13927.5695 14589.1320 60958.791 100 d
apply(MT_mat, 1, anyNA) 11662.390 12546.674 14758.1266 13336.6785 14083.7225 66075.346 100 d
row_sum 1541.594 1581.768 2233.1150 1617.3985 1647.8955 49114.588 100 bc
row_sum_mat 579.161 589.131 707.3710 618.7490 627.5465 3235.089 100 a c
reduce 2028.969 2051.696 2252.8679 2084.8320 2102.8670 4271.127 100 c
complete_case 321.984 330.195 346.8692 342.5115 351.3090 436.057 100 a
complete_case_mat 348.083 358.640 384.1671 379.0205 406.8790 503.503 100 ab

#* Verify that they all return the same result
MT$apply <- apply(MT, 1, anyNA)
MT$apply_mat <- apply(MT_mat, 1, anyNA)
MT$row_sum <- rowSums(is.na(MT)) > 0
MT$row_sum_mat <- rowSums(is.na(MT_mat)) > 0
MT$reduce <- Reduce('|', lapply(MT, is.na))
MT$complete_case <- !complete.cases(MT)
MT$complete_case_mat <- !complete.cases(MT_mat)

all(MT$apply == MT$apply_mat)
all(MT$apply == MT$row_sum)
all(MT$apply == MT$row_sum_mat)
all(MT$apply == MT$reduce)
all(MT$apply == MT$complete_case)
all(MT$apply == MT$complete_case_mat)

complete.cases seems to be the clear winner, and works well for both data frames and matrices. As it turns out, complete.cases calls a C routine, which may account for much of its speed. looking at rowSums, apply, and Reduce shows R code.

Why apply is slower the rowSums probably has to do with rowSums being optimized for a specific task. rowSums knows it will be returning a numeric, apply has no such guarantee. I doubt that accounts for all of the difference--I'm mostly speculating.

I couldn't begin to tell you how Reduce is working.

Detect if a NumPy array contains at least one non-numeric value?

This should be faster than iterating and will work regardless of shape.

numpy.isnan(myarray).any()

Edit: 30x faster:

import timeit
s = 'import numpy;a = numpy.arange(10000.).reshape((100,100));a[10,10]=numpy.nan'
ms = [
'numpy.isnan(a).any()',
'any(numpy.isnan(x) for x in a.flatten())']
for m in ms:
print " %.2f s" % timeit.Timer(m, s).timeit(1000), m

Results:

  0.11 s numpy.isnan(a).any()
3.75 s any(numpy.isnan(x) for x in a.flatten())

Bonus: it works fine for non-array NumPy types:

>>> a = numpy.float64(42.)
>>> numpy.isnan(a).any()
False
>>> a = numpy.float64(numpy.nan)
>>> numpy.isnan(a).any()
True

Select set of columns so that each row has at least one non-NA entry

Using a while loop, this should work to get the minimum set of variables with at least one non-NA per row.

best <- function(df){
best <- which.max(colSums(sapply(df, complete.cases)))
while(any(rowSums(sapply(df[best], complete.cases)) == 0)){
best <- c(best, which.max(sapply(df[is.na(df[best]), ], \(x) sum(complete.cases(x)))))
}
best
}

testing

best(df)
#d c
#4 3

df[best(df)]
# d c
#1 1 1
#2 1 NA
#3 1 NA
#4 1 NA
#5 NA 1

First, select the column with the least NAs (stored in best). Then, update the vector with the column that has the highest number of non-NA rows on the remaining rows (where best has still NAs), until you get every rows with a complete case.

Fastest way to find second (third...) highest/lowest value in vector or column

Rfast has a function called nth_element that does exactly what you ask.

Further the methods discussed above that are based on partial sort, don't support finding the k smallest values

Update (28/FEB/21) package kit offers a faster implementation (topn) see https://stackoverflow.com/a/66367996/4729755, https://stackoverflow.com/a/53146559/4729755

Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.

Rfast::nth(x, 5, descending = T)

Will return the 5th largest element of x, while

Rfast::nth(x, 5, descending = F)

Will return the 5th smallest element of x

Benchmarks below against most popular answers.

For 10 thousand numbers:

N = 10000
x = rnorm(N)

maxN <- function(x, N=2){
len <- length(x)
if(N>len){
warning('N greater than length(x). Setting N=length(x)')
N <- length(x)
}
sort(x,partial=len-N+1)[len-N+1]
}

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])

Unit: microseconds
expr min lq mean median uq max neval
Rfast 160.364 179.607 202.8024 194.575 210.1830 351.517 100
maxN 396.419 423.360 559.2707 446.452 487.0775 4949.452 100
order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148 100

For 1 million numbers:

N = 1e6
x = rnorm(N)

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]])

Unit: milliseconds
expr min lq mean median uq max neval
Rfast 89.7722 93.63674 114.9893 104.6325 120.5767 204.8839 100
maxN 150.2822 207.03922 235.3037 241.7604 259.7476 336.7051 100
order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129 100

How to check if entire vector has no values other than NA (or NAN) in R?

The function all(), when passed a Boolean vector, will tell you whether all of the values in it are TRUE:

> all(is.na(c(NA, NaN)))
[1] TRUE
> all(is.na(c(NA, NaN, 1)))
[1] FALSE

Remove NA values from a vector

Trying ?max, you'll see that it actually has a na.rm = argument, set by default to FALSE. (That's the common default for many other R functions, including sum(), mean(), etc.)

Setting na.rm=TRUE does just what you're asking for:

d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)

If you do want to remove all of the NAs, use this idiom instead:

d <- d[!is.na(d)]

A final note: Other functions (e.g. table(), lm(), and sort()) have NA-related arguments that use different names (and offer different options). So if NA's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.



Related Topics



Leave a reply



Submit