Rcpp Function Check If Missing Value

Rcpp function check if missing value

Both Rcpp and RcppArmadillo have predicates to test for NA, NaN (an R extension) and Inf.

Here is a short RcppArmadillo example:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
arma::mat foo(int n, double threshold=NA_REAL) {
arma::mat M = arma::zeros<arma::mat>(n,n);
if (arma::is_finite(threshold)) M = M + threshold;
return M;
}

/*** R
foo(2)
foo(2, 3.1415)
***/

We initialize a matrix of zeros, and test for the argument. If it is finite (ie not NA or Inf or NaN), then we add that value. If you wanted to, you could test for the possibilities individually too.

This produces the desired result: without a second argument the default value of NA applies, and we get a matrix of zeros.

R> Rcpp::sourceCpp("/tmp/giorgio.cpp")

R> foo(2)
[,1] [,2]
[1,] 0 0
[2,] 0 0

R> foo(2, 3.1415)
[,1] [,2]
[1,] 3.1415 3.1415
[2,] 3.1415 3.1415
R>

Fast checking of missing values in Rcpp

Checking for missing value or any NaN specific variant is always going to be more expensive than checking for a specific value. That's just floating point arithmetic.

However there's still room for improvement in your code. I would encourage you to use NumericVector::is_na instead of R_IsNA but this is mostly cosmetic.

Then branching can be expensive, i.e. I'd replace if (R_IsNA(x[i])) c++; by c += NumericVector::is_na(x[i]). This gives this version:

// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
return c;
}

Then iterating on an int and accessing x[i] can be replaced by using the std::count_if algorithm. This is it's raison d'être. Leading to this version:

// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}

Now if the performance is still not good enough, you might want to try parallelization, for this I typically use the tbb library from the RcppParallel package.

// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
return tbb::parallel_reduce(
tbb::blocked_range<const double*>(x.begin(), x.end()),
0,
[](const tbb::blocked_range<const double*>& r, int init) -> int {
return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
},
[]( int x, int y){ return x+y; }
) ;
}

Benchmarking with this function:

library(microbenchmark)

bench <- function(n){
x <- rep(c(1, 2, NA), n)
microbenchmark(
nb_na = nb_na(x),
nb_na4 = nb_na4(x),
nb_na5 = nb_na5(x),
nb_na6 = nb_na6(x)
)
}
bench(1e5)

On my machine I get:

> bench(1e4)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 84.358 94.6500 107.41957 110.482 118.9580 137.393 100 d
nb_na4 59.984 69.4925 79.42195 82.442 85.9175 106.567 100 b
nb_na5 65.047 75.2625 85.17134 87.501 93.0315 116.993 100 c
nb_na6 39.205 51.0785 59.20582 54.457 68.9625 97.225 100 a

> bench(1e5)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 730.416 732.2660 829.8440 797.4350 872.3335 1410.467 100 d
nb_na4 520.800 521.6215 598.8783 562.7200 657.1755 1059.991 100 b
nb_na5 578.527 579.3805 664.8795 626.5530 710.5925 1166.365 100 c
nb_na6 294.486 345.2050 368.6664 353.6945 372.6205 897.552 100 a

Another way is to totally circumvent floating point arithmetic and pretend the vector is a vector of long long, aka 64 bit integers and compare the values to the bit pattern of NA_REAL:

  > devtools::install_github( "ThinkR-open/seven31" )
> seven31::reveal(NA, NaN, +Inf, -Inf )
0 11111111111 ( NaN ) 0000000000000000000000000000000000000000011110100010 : NA
0 11111111111 ( NaN ) 1000000000000000000000000000000000000000000000000000 : NaN
0 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : +Inf
1 11111111111 ( NaN ) 0000000000000000000000000000000000000000000000000000 : -Inf

A serial solution using this hack:

// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

return std::count(p, p + x.size(), na ) ;

}

And then a parallel version:

// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
return init + std::count( r.begin(), r.end(), na);
} ;

return tbb::parallel_reduce(
tbb::blocked_range<const long long*>(p, p + x.size()),
0,
count_chunk,
[]( int x, int y){ return x+y; }
) ;

}

> bench(1e5)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 730.346 762.5720 839.9479 857.5865 881.8635 1045.048 100 f
nb_na4 520.946 521.6850 589.0911 578.2825 653.4950 832.449 100 d
nb_na5 578.621 579.3245 640.9772 616.8645 701.8125 890.736 100 e
nb_na6 291.115 307.4300 340.1626 344.7955 360.7030 484.261 100 c
nb_na7 122.156 123.4990 141.1954 132.6385 149.7895 253.988 100 b
nb_na8 69.356 86.9980 109.6427 115.2865 126.2775 182.184 100 a

> bench(1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
nb_na 7342.984 7956.3375 10261.583 9227.7450 10869.605 79757.09 100 d
nb_na4 5286.970 5721.9150 7659.009 6660.2390 9234.646 31141.47 100 c
nb_na5 5840.946 6272.7050 7307.055 6883.2430 8205.117 10420.48 100 c
nb_na6 2833.378 2895.7160 3891.745 3049.4160 4054.022 18242.26 100 b
nb_na7 1661.421 1791.1085 2708.992 1916.6055 2232.720 60827.63 100 ab
nb_na8 650.639 869.6685 1289.373 939.0045 1291.025 10223.29 100 a

This assumes there's only one bit pattern to represent NA.

Here's my entire file for reference:

#include <Rcpp.h>
#include <RcppParallel.h>

// [[Rcpp::depends(RcppParallel)]]
// [[Rcpp::plugins(cpp11)]]
using namespace Rcpp;

// [[Rcpp::export]]
int nb_na(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) if (R_IsNA(x[i])) c++;
return c;
}

// [[Rcpp::export]]
int nb_na4(const NumericVector& x) {
int n = x.size();
int c = 0;
for (int i = 0; i < n; i++) c += NumericVector::is_na(x[i]) ;
return c;
}

// [[Rcpp::export]]
int nb_na5(const NumericVector& x) {
return std::count_if(x.begin(), x.end(), NumericVector::is_na ) ;
}

// [[Rcpp::export]]
int nb_na6(const NumericVector& x) {
return tbb::parallel_reduce(
tbb::blocked_range<const double*>(x.begin(), x.end()),
0,
[](const tbb::blocked_range<const double*>& r, int init) -> int {
return init + std::count_if( r.begin(), r.end(), NumericVector::is_na );
},
[]( int x, int y){ return x+y; }
) ;
}

// [[Rcpp::export]]
int nb_na7( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

return std::count(p, p + x.size(), na ) ;

}

// [[Rcpp::export]]
int nb_na8( const NumericVector& x){
const long long* p = reinterpret_cast<const long long*>(x.begin()) ;
long long na = *reinterpret_cast<long long*>(&NA_REAL) ;

auto count_chunk = [=](const tbb::blocked_range<const long long*>& r, int init) -> int {
return init + std::count( r.begin(), r.end(), na);
} ;

return tbb::parallel_reduce(
tbb::blocked_range<const long long*>(p, p + x.size()),
0,
count_chunk,
[]( int x, int y){ return x+y; }
) ;

}

/*** R
library(microbenchmark)

bench <- function(n){
x <- rep(c(1, 2, NA), n)
microbenchmark(
nb_na = nb_na(x),
nb_na4 = nb_na4(x),
nb_na5 = nb_na5(x),
nb_na6 = nb_na6(x),
nb_na7 = nb_na7(x),
nb_na8 = nb_na8(x)
)
}
bench(1e5)
bench(1e6)
*/

Write an Rcpp function to detect if a NumericMatrix has any NA values

Rcpp sugar can replicate the operation by combining is_na() and any(). is_na() will detect missing values and any() verifies a single value is TRUE. Note, to retrieve a boolean value, any() must be used with is_true().

#include<Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
bool contains_na(NumericMatrix M){
return is_true(any(is_na(M)));
}

Test case:

A = matrix(1:4, nrow = 2)
contains_na(A)
# [1] FALSE

M = matrix(c(1, 2, NA, 4), nrow = 2)
contains_na(M)
# [1] TRUE

NA values in Rcpp conditional

You might want to use the R C level function R_IsNA.

require(Rcpp)
require(inline)
z <- seq(from=1, to=10, by=0.1)
z[c(5, 10, 15, 20, 40, 50, 80)] <- NA

src <- '
Rcpp::NumericVector vecz(z);
for (int i=0; i< vecz.size(); i++) {
if (R_IsNA(vecz[i])) {
Rcpp::Rcout << "missing value at position " << i + 1 << std::endl;
}
}
'

func <- cxxfunction(signature(z="numeric"), src, plugin="Rcpp")

## results
func(z)

missing value at position 5
missing value at position 10
missing value at position 15
missing value at position 20
missing value at position 40
missing value at position 50
missing value at position 80
NULL

Checking Null and NA of a vector in Rcpp

Three suggestions:

  • Use Rcpp instead of R's C API.
  • Return early when r is NULL.
  • Create a LogicalVector before looping through the input vector.
#include <Rcpp.h>

// [[Rcpp::export]]
double foo(Rcpp::NumericVector y, Rcpp::Nullable<Rcpp::IntegerVector> r = R_NilValue) {
if (r.isNull())
return Rcpp::sum(y);

Rcpp::LogicalVector mask = Rcpp::is_na(r.as());
if (Rcpp::is_true(Rcpp::all(mask)))
return NA_REAL;

double output = 0.0;
int y_count = y.size();
for (int i = 0; i < y_count; ++i) {
if (!mask[i]) {
output += y[i];
}
}
return output;
}

/***R
foo(1:4, NULL) # <- This should return 10 = 1 + 2 + 3 + 4
foo(1:4, c(1, 1, 1, 1)) # <- This should return 10 = 1 + 2 + 3 + 4
foo(1:4, c(1, 1, NA, 1)) # <- This should return 7 = 1 + 2 + 4
foo(1:4, c(NA, NA, NA, NA)) # <- This should return NA
*/

Result:

> Rcpp::sourceCpp('60569482.cpp')

> foo(1:4, NULL) # <- This should return 10 = 1 + 2 + 3 + 4
[1] 10

> foo(1:4, c(1, 1, 1, 1)) # <- This should return 10 = 1 + 2 + 3 + 4
[1] 10

> foo(1:4, c(1, 1, NA, 1)) # <- This should return 7 = 1 + 2 + 4
[1] 7

> foo(1:4, c(NA, NA, NA, NA)) # <- This should return NA
[1] NA

Further suggestion:

  • Use the mask for sub-setting y.
#include <Rcpp.h>

// [[Rcpp::export]]
double foo(Rcpp::NumericVector y, Rcpp::Nullable<Rcpp::IntegerVector> r = R_NilValue) {
if (r.isNull())
return Rcpp::sum(y);

Rcpp::LogicalVector mask = Rcpp::is_na(r.as());
if (Rcpp::is_true(Rcpp::all(mask)))
return NA_REAL;

Rcpp::NumericVector tmp = y[!mask];
return Rcpp::sum(tmp);
}

is_NA() in Rcpp when combined with all()

some_bool == NA is always returning NA, in R or Rcpp, because you don't know what's behind the input, so you don't know the output.

Yet, R is clever enough to know that NA || TRUE is TRUE, for example.

Ignore missing values while finding min in Rcpp

What follows below is a slightly non-sensical answer (as it only works with vectors not containing NA) but it has all the component your code has, and it compiles.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector foo(int time_length, double BXadd,
NumericVector vn_complete, NumericVector vn1_complete) {

// Empty vectors
NumericVector BX (time_length);
vn_complete = na_omit(vn_complete);
vn1_complete = na_omit(vn1_complete);
for(int t = 0; t < time_length; t++) {
double a = vn_complete[t];
double b = vn1_complete[t];
BX[t] = BXadd * std::sqrt(std::min(a,b));
}
return BX;
}

// Edited version with new function added
// [[Rcpp::export]]
NumericVector foo2(double BXadd, NumericVector vn_complete,
NumericVector vn1_complete) {
return BXadd * sqrt(pmin(vn_complete, vn1_complete));
}

/*** R
foo(5, 2, 5:9, 1:5)
foo2(5, 2, 5:9, 1:5)
*/

For a real solution you will have to think harder about what the removal of NA is supposed to do as it will alter the lenth of your vectors too. So this still needs work.

Lastly, your whole function can be written in R as

2 * sqrt(pmin(5:9, 1:5))

and I think you could write that same expression using Rcpp sugar too as we have pmin() and sqrt():

// [[Rcpp::export]]
NumericVector foo2(double BXadd, NumericVector vn_complete,
NumericVector vn1_complete) {
return BXadd * sqrt(pmin(vn_complete, vn1_complete));
}

Check if element of list of matrix are na in rcpp

Using try-catch like in this example seems to work:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
bool checkNa(int i, List elemInCluster){

try {
return R_IsNA(elemInCluster[i]);
} catch(...) {
return false;
}
}


> checkNa(0, listToCheck)
[1] TRUE

> checkNa(1, listToCheck)
[1] FALSE

Check if there are missing arguments in enclosing call using C/Rcpp

You need if ( unevaluatedArg == R_MissingArg){ ... }. This slightly modified code works for me:

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
void check_missing(SEXP e) {
Rcpp::Environment env(e);
Rcpp::CharacterVector names = env.ls(false);
SEXP unevaluatedArg;
for(int i=0; i<names.length(); i++){
std::string name = as<std::string>(names[i]);
Rcpp::Rcout << "argument - " << name << std::endl;
SEXP nameSym = Rf_install(name.c_str());
unevaluatedArg = Rf_findVarInFrame(e, nameSym);
if ( unevaluatedArg == R_MissingArg) {
Rcpp::Rcout << "missing" << std::endl;
}
}
}

Giving:

> f()
argument - x
missing
argument - y
missing

> f(1)
argument - x
argument - y
missing

The most efficient way to check if all values in a row are the same or are missing

Another update on this, Now I am able to reduce the time to under a second (in fact under 0.5 second) using InMemoryDatasets, thanks to this discussion in Julia lang discourse.

julia> using InMemoryDatasets
julia> df = Dataset(df)
julia> f(x,y) = ismissing(x) || ismissing(y) ? true : x == y
julia> byrow(df, isless, [:v1, :v2, :v3], with = byrow(df, coalesce, :), lt = f)

Update

with the new update from the discourse discussion I am able to almost half the previous insane record :D, i.e. 0.27 (Julia community really loves SPEED)

julia> byrow(df, issorted, [:v1, :v2, :v3], lt = !f)


Related Topics



Leave a reply



Submit