Fastest Way to Detect If Vector Has at Least 1 Na

Fastest way to detect if vector has at least 1 NA?

As of R 3.1.0 anyNA() is the way to do this. On atomic vectors this will stop after the first NA instead of going through the entire vector as would be the case with any(is.na()). Additionally, this avoids creating an intermediate logical vector with is.na that is immediately discarded. Borrowing Joran's example:

x <- y <- runif(1e7)
x[1e4] <- NA
y[1e7] <- NA
microbenchmark::microbenchmark(any(is.na(x)), anyNA(x), any(is.na(y)), anyNA(y), times=10)
# Unit: microseconds
#           expr        min         lq        mean      median         uq
#  any(is.na(x))  13444.674  13509.454  21191.9025  13639.3065  13917.592
#       anyNA(x)      6.840     13.187     13.5283     14.1705     14.774
#  any(is.na(y)) 165030.942 168258.159 178954.6499 169966.1440 197591.168
#       anyNA(y)   7193.784   7285.107   7694.1785   7497.9265   7865.064

Notice how it is substantially faster even when we modify the last value of the vector; this is in part because of the avoidance of the intermediate logical vector.

How to check if vector is a single NA value without length warning and without suppression

The length warning comes from the use of if, which expects a length 1 vector, and is.na which is vectorised.

You could use any or all around the is.na to compress it to a length 1 vector but there may be edge cases where it doesn't work as you expect so I would use shortcircuit evaluation to check it is length 1 on the is.na check:

so_function <- function(x = NA) {
  if (!((length(x)==1 && is.na(x)) | is.character(x))) {
    stop("This was just an example for you SO!")
  }
}

so_function(NA_character_) # should pass

so_function(NA_integer_) # should pass

so_function(c(NA, NA)) # should fail
Error in so_function(c(NA, NA)) : This was just an example for you SO!

so_function(c("A", "B")) # should pass

so_function(c(1, 2, 3)) # should fail
Error in so_function(c(1, 2, 3)) : This was just an example for you SO!

Another option is to use NULL as the default value instead.

Detect at least one match between each data frame row and values in vector

Here's one way to do this:

df$valueFound <- apply(df,1,function(x){
  if(any(x %in% vec)){ 
    1 
  } else {
    0
  }
})
##
> df
  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0

Thanks to @David Arenburg and @CathG, a couple of more concise approaches:

apply(df, 1, function(x) any(x %in% vec) + 0)
apply(df, 1, function(x) as.numeric(any(x %in% vec)))

Just for fun, a couple of other interesting variants:

apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
apply(df, 1, function(x) cumprod(any(x %in% vec)))

fastest way to count the number of rows in a data frame that has at least one NA

Before I begin, note that none of the code here is mine. I was merely fascinated by the code in the comments and wondered which one really performed the best.

I suspected some of the time was being absorbed in transforming a data frame to a matrix for apply and rowSums, so I've also done most of the solutions on matrices to illustrate the penalty applied by running these solutions on a data frame.

# Make a data frame of 10,000 rows and set random values to NA

library(dplyr)
set.seed(13)
MT <- mtcars[sample(1:nrow(mtcars), size = 10000, replace = TRUE), ]

MT <- lapply(MT,
       function(x) { x[sample(1:length(x), size = 100)] <- NA; x }) %>%
  bind_cols()

MT_mat <- as.matrix(MT)

library(microbenchmark)
microbenchmark(
  apply(MT,1,anyNA),
  apply(MT_mat,1,anyNA),  # apply on a matrix
  row_sum = rowSums(is.na(MT)) > 0,
  row_sum_mat = rowSums(is.na(MT_mat)), # rowSums on a matrix
  reduce = Reduce('|', lapply(MT, is.na)) ,
  complete_case = !complete.cases(MT),
  complete_case_mat = !complete.cases(MT_mat) # complete.cases on a matrix
)

Unit: microseconds
                    expr       min        lq       mean     median         uq       max neval  cld
     apply(MT, 1, anyNA) 12126.013 13422.747 14930.6022 13927.5695 14589.1320 60958.791   100    d
 apply(MT_mat, 1, anyNA) 11662.390 12546.674 14758.1266 13336.6785 14083.7225 66075.346   100    d
                 row_sum  1541.594  1581.768  2233.1150  1617.3985  1647.8955 49114.588   100  bc 
             row_sum_mat   579.161   589.131   707.3710   618.7490   627.5465  3235.089   100 a c 
                  reduce  2028.969  2051.696  2252.8679  2084.8320  2102.8670  4271.127   100   c 
           complete_case   321.984   330.195   346.8692   342.5115   351.3090   436.057   100 a   
       complete_case_mat   348.083   358.640   384.1671   379.0205   406.8790   503.503   100 ab 

#* Verify that they all return the same result
MT$apply <- apply(MT, 1, anyNA)
MT$apply_mat <- apply(MT_mat, 1, anyNA)
MT$row_sum <- rowSums(is.na(MT)) > 0
MT$row_sum_mat <- rowSums(is.na(MT_mat)) > 0
MT$reduce <- Reduce('|', lapply(MT, is.na)) 
MT$complete_case <- !complete.cases(MT)
MT$complete_case_mat <- !complete.cases(MT_mat)

all(MT$apply == MT$apply_mat)
all(MT$apply == MT$row_sum)
all(MT$apply == MT$row_sum_mat)
all(MT$apply == MT$reduce)
all(MT$apply == MT$complete_case)
all(MT$apply == MT$complete_case_mat)

complete.cases seems to be the clear winner, and works well for both data frames and matrices. As it turns out, complete.cases calls a C routine, which may account for much of its speed. looking at rowSums, apply, and Reduce shows R code.

Why apply is slower the rowSums probably has to do with rowSums being optimized for a specific task. rowSums knows it will be returning a numeric, apply has no such guarantee. I doubt that accounts for all of the difference--I'm mostly speculating.

I couldn't begin to tell you how Reduce is working.

Detect if a NumPy array contains at least one non-numeric value?

This should be faster than iterating and will work regardless of shape.

numpy.isnan(myarray).any()

Edit: 30x faster:

import timeit
s = 'import numpy;a = numpy.arange(10000.).reshape((100,100));a[10,10]=numpy.nan'
ms = [
    'numpy.isnan(a).any()',
    'any(numpy.isnan(x) for x in a.flatten())']
for m in ms:
    print "  %.2f s" % timeit.Timer(m, s).timeit(1000), m

Results:

  0.11 s numpy.isnan(a).any()
  3.75 s any(numpy.isnan(x) for x in a.flatten())

Bonus: it works fine for non-array NumPy types:

>>> a = numpy.float64(42.)
>>> numpy.isnan(a).any()
False
>>> a = numpy.float64(numpy.nan)
>>> numpy.isnan(a).any()
True

Select set of columns so that each row has at least one non-NA entry

Using a while loop, this should work to get the minimum set of variables with at least one non-NA per row.

best <- function(df){
  best <- which.max(colSums(sapply(df, complete.cases)))
  while(any(rowSums(sapply(df[best], complete.cases)) == 0)){
    best <- c(best, which.max(sapply(df[is.na(df[best]), ], \(x) sum(complete.cases(x)))))
  }
  best
}

testing

best(df)
#d c 
#4 3

df[best(df)]
#   d  c
#1  1  1
#2  1 NA
#3  1 NA
#4  1 NA
#5 NA  1

First, select the column with the least NAs (stored in best). Then, update the vector with the column that has the highest number of non-NA rows on the remaining rows (where best has still NAs), until you get every rows with a complete case.

Fastest way to find second (third...) highest/lowest value in vector or column

Rfast has a function called nth_element that does exactly what you ask.

Further the methods discussed above that are based on partial sort, don't support finding the k smallest values

Update (28/FEB/21) package kit offers a faster implementation (topn) see https://stackoverflow.com/a/66367996/4729755, https://stackoverflow.com/a/53146559/4729755

Disclaimer: An issue appears to occur when dealing with integers which can by bypassed by using as.numeric (e.g. Rfast::nth(as.numeric(1:10), 2)), and will be addressed in the next update of Rfast.

Rfast::nth(x, 5, descending = T)

Will return the 5th largest element of x, while

Rfast::nth(x, 5, descending = F)

Will return the 5th smallest element of x

Benchmarks below against most popular answers.

For 10 thousand numbers:

N = 10000
x = rnorm(N)

maxN <- function(x, N=2){
    len <- length(x)
    if(N>len){
        warning('N greater than length(x).  Setting N=length(x)')
        N <- length(x)
    }
    sort(x,partial=len-N+1)[len-N+1]
}

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxn = maxN(x,5),
order = x[order(x, decreasing = T)[5]])

Unit: microseconds
  expr      min       lq      mean   median        uq       max neval
 Rfast  160.364  179.607  202.8024  194.575  210.1830   351.517   100
  maxN  396.419  423.360  559.2707  446.452  487.0775  4949.452   100
 order 1288.466 1343.417 1746.7627 1433.221 1500.7865 13768.148   100

For 1 million numbers:

N = 1e6
x = rnorm(N)

microbenchmark::microbenchmark(
Rfast = Rfast::nth(x,5,descending = T),
maxN = maxN(x,5),
order = x[order(x, decreasing = T)[5]]) 

Unit: milliseconds
  expr      min        lq      mean   median        uq       max neval
 Rfast  89.7722  93.63674  114.9893 104.6325  120.5767  204.8839   100
  maxN 150.2822 207.03922  235.3037 241.7604  259.7476  336.7051   100
 order 930.8924 968.54785 1005.5487 991.7995 1031.0290 1164.9129   100

How to check if entire vector has no values other than NA (or NAN) in R?

The function all(), when passed a Boolean vector, will tell you whether all of the values in it are TRUE:

> all(is.na(c(NA, NaN)))
[1] TRUE
> all(is.na(c(NA, NaN, 1)))
[1] FALSE

Remove NA values from a vector

Trying ?max, you'll see that it actually has a na.rm = argument, set by default to FALSE. (That's the common default for many other R functions, including sum(), mean(), etc.)

Setting na.rm=TRUE does just what you're asking for:

d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)

If you do want to remove all of the NAs, use this idiom instead:

d <- d[!is.na(d)]

A final note: Other functions (e.g. table(), lm(), and sort()) have NA-related arguments that use different names (and offer different options). So if NA's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.

Fastest Way to Detect If Vector Has at Least 1 Na