Calling an R Function Using Inline and Rcpp Is Still Just as Slow as Original R Code

Rcpp Function slower than Rf_eval

Rcpp is great because it makes things look absurdly clean to the programmer. The cleanliness has a cost in the form of templated responses and a set of assumptions that weigh down the execution time. But, such is the case with a generalized vs. specific code setup.

Take for instance the call route for an Rcpp::Function. The initial construction and then outside call to a modified version of Rf_reval requires a special Rcpp specific eval function given in Rcpp_eval.h. In turn, this function is wrapped in protections to protect against a function error when calling into R via a Shield associated with it. And so on...

In comparison, Rf_eval has neither. If it fails, you will be up the creek without a paddle. (Unless, of course, you implement error catching via R_tryEval for it.)

With this being said, the best way to speed up the calculation is to simply write everything necessary for the computation in C++.

calling a user-defined R function from C++ using Rcpp

You declare that the function should return an int, but use wrap which indicates the object returned should be a SEXP. Moreover, calling an R function from Rcpp (through Function) also returns a SEXP.

You want something like:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
SEXP mySuminC(){
  Environment myEnv = Environment::global_env();
  Function mySum = myEnv["mySum"];
  int x = myEnv["x"];
  int y = myEnv["y"];
  return mySum(Rcpp::Named("x", x), Rcpp::Named("y", y));
}

(or, leave function return as int and use as<int> in place of wrap).

That said, this is kind of non-idiomatic Rcpp code. Remember that calling R functions from C++ is still going to be slow.

Can I speedup my R code with Rcpp?

The answer by 李哲源 correctly identifies what should be done in your case. As for your original question the answer is two-fold: Yes, you can move the loop to C++ with Rcpp. No, you won't gain performance:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::NumericMatrix fillMatrix(Rcpp::NumericMatrix X,
                   Rcpp::NumericVector y,
                   Rcpp::NumericVector a,
                   Rcpp::Function f) {
  Rcpp::NumericMatrix result = Rcpp::no_init(X.cols(), a.length());
  for (int i = 0; i < a.length(); ++i) {
    result(Rcpp::_, i) = Rcpp::as<Rcpp::NumericVector>(f(X, y, a[i]));
  }
  return result;
}

/*** R
f <- function(X,y,a){
  p = ncol(X)
  res = (crossprod(X) + a*diag(1,p))%*%crossprod(X,y)
  }

X <- matrix(rnorm(500*50),500,50)
y <- rnorm(500)
a <- seq(0,1,0.01)

system.time(fillMatrix(X, y, a, f))
#   user  system elapsed 
#  0.052   0.077   0.075 
system.time({result <- matrix(NA,ncol(X),length(a))

for(i in 1:length(a)){
  result[,i] <- f(X,y,a[i])
  }
})
#   user  system elapsed 
#  0.060   0.037   0.049 
*/

So the Rcpp solution is actually slower than the R solution in this case. Why? Because the real work is done within the function f. This is the same for both solutions, but the Rcpp solution has the additional overhead of calling back to R from C++. Note that for loops in R are not necessarily slow. BTW, here some benchmark data:

          expr      min       lq     mean   median       uq      max neval
 fillMatrixR() 41.22305 41.86880 46.16806 45.20537 49.11250 65.03886   100
 fillMatrixC() 44.57131 44.90617 49.76092 50.99102 52.89444 66.82156   100

Calling R function from C++, using Rcpp

Because you are calling sample() from R, both integer and numeric works as they do in R itself:

R> set.seed(42); sample(seq(1L, 5L), 5, replace=TRUE)
[1] 5 5 2 5 4
R> set.seed(42); sample(seq(1.0, 5.0), 5, replace=TRUE)
[1] 5 5 2 5 4
R>

Speeding up Rcpp `anyNA` equivalent

This is an interesting question, but the answer is pretty simple: there are two versions of STRING_ELT one used internally by R or if you set the USE_RINTERNALS macro in Rinlinedfuns.h and one for plebs in memory.c.

Comparing the two versions, you can see that the pleb version has more checks, which fully accounts for the difference in speed.

If you really want speed and don't care about safety, you can usually beat R by at least a little bit.

// [[Rcpp::export(rng = false)]]
bool any_na_unsafe(SEXP x) {
  SEXP* ptr = STRING_PTR(x);
  R_xlen_t n = Rf_xlength(x);
  for(R_xlen_t i=0; i<n; ++i) {
    if(ptr[i] == NA_STRING) return true;
  }
  return false;
}

Bench:

> microbenchmark(
+   R = anyNA(vec),
+   R_C_api = any_na2(vec),
+   unsafe = any_na_unsafe(vec),
+   unit = "ms"
+ )
Unit: milliseconds
    expr    min      lq     mean  median      uq     max neval
       R 0.5058 0.52830 0.553696 0.54000 0.55465  0.7758   100
 R_C_api 1.9990 2.05170 2.214136 2.06695 2.10220 12.2183   100
  unsafe 0.3170 0.33135 0.369585 0.35270 0.37730  1.2856   100

Although as written this is unsafe, if you add a few checks before the loop in the beginning it'd be fine.

Speed of native R function vs C++ equivalent

Look at the code of gamma() (for example by typing gamma<Return>): it just calls a primitive.

Your Rcpp function is set up to be convenient. All it took was a two-liner. But it has some overhead is saving the state of the random-number generator, setting up the try/catch block for exception handling and more. We have documented how you can skip some of these steps, but in many cases that is ill-advised.

In a nutshell, my view is that you have a poorly chosen comparison: too little happens in the functions you benchmark. Also note the unit: nanoseconds. You have a lot of measurement error here.

But the morale is that with a 'naive' (and convenient) C++ function as the one you wrote as a one-liner, you will not beat somewhat optimised and tuned and already compiled code from inside R. That is actually a good thing because if you did, you would have to now rewrite large chunks of R .

Edit: For kicks, here is a third variant 'in the middle' where we use Rcpp to call the same C API of R.

Code

#include <Rcpp.h>

// [[Rcpp::export]]
double gammaStd(double x) { return (std::tgamma(x)); }

// [[Rcpp::export]]
Rcpp::NumericVector gamma_R(Rcpp::NumericVector x) { return (Rcpp::gamma(x)); }

/*** R
set.seed(123)
x <- max(0, rnorm(1, 50, 25))
res <- microbenchmark::microbenchmark(R = gamma(x), Cpp = gammaStd(x), Sugar = gamma_R(x) )
res
*/

Output

> Rcpp::sourceCpp("~/git/stackoverflow/72383007/answer.cpp")
> set.seed(123)
> x <- max(0, rnorm(1, 50, 25))
> res <- microbenchmark::microbenchmark(R = gamma(x), Cpp = gammaStd(x), Sugar = gamma_R(x) )
> 
res
Unit: nanoseconds
  expr  min     lq     mean median     uq     max neval cld
     R  102  112.0   136.95    124  134.5    1068   100   a
   Cpp 1111 1155.5 11930.02   1186 1260.0 1054813   100   a
 Sugar 1142 1201.0  6355.92   1246 1301.5  506628   100   a
>

Performance drop when including code exported from other Rcpp package

As mentioned by others, allowing other packages to call your C++ code from C++ requires the use of header files in inst/include/. Rcpp::interfaces allows you to automate the creation of such files. However, as I demonstrate below, creating your own headers manually can result in faster execution time. I believe it is because relying on Rcpp::interfaces to create your headers for you may result in more complicated header code.

Before I go further and demonstrate a "simpler" approach that results in faster execution time, I need to note that while this works for me (and I have used the approach I will demonstrate below several times without issue), the more "complex" approach taken by Rcpp::interfaces is used in part to comport with statements in Section 5.4.3. of the Writing R Extensions manual. (Specifically, the bits having to do with R_GetCCallable you'll see below). So, improve your execution time with the code I offer below at your own peril.^1,2

A simple header to share the code for col_sums might look like this:

#ifndef RCPP_package3
#define RCPP_package3

#include <RcppArmadillo.h>

namespace package3 {
    inline arma::vec col_sums(const arma::mat& test){
      return arma::sum(test,0).t();
    }
}

#endif

However, the header created by Rcpp::interfaces looks like this:

// Generated by using Rcpp::compileAttributes() -> do not edit by hand
// Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

#ifndef RCPP_package1_RCPPEXPORTS_H_GEN_
#define RCPP_package1_RCPPEXPORTS_H_GEN_

#include <RcppArmadillo.h>
#include <Rcpp.h>

namespace package1 {

    using namespace Rcpp;

    namespace {
        void validateSignature(const char* sig) {
            Rcpp::Function require = Rcpp::Environment::base_env()["require"];
            require("package1", Rcpp::Named("quietly") = true);
            typedef int(*Ptr_validate)(const char*);
            static Ptr_validate p_validate = (Ptr_validate)
                R_GetCCallable("package1", "_package1_RcppExport_validate");
            if (!p_validate(sig)) {
                throw Rcpp::function_not_exported(
                    "C++ function with signature '" + std::string(sig) + "' not found in package1");
            }
        }
    }

    inline arma::vec col_sums(const arma::mat& matty) {
        typedef SEXP(*Ptr_col_sums)(SEXP);
        static Ptr_col_sums p_col_sums = NULL;
        if (p_col_sums == NULL) {
            validateSignature("arma::vec(*col_sums)(const arma::mat&)");
            p_col_sums = (Ptr_col_sums)R_GetCCallable("package1", "_package1_col_sums");
        }
        RObject rcpp_result_gen;
        {
            RNGScope RCPP_rngScope_gen;
            rcpp_result_gen = p_col_sums(Shield<SEXP>(Rcpp::wrap(matty)));
        }
        if (rcpp_result_gen.inherits("interrupted-error"))
            throw Rcpp::internal::InterruptedException();
        if (Rcpp::internal::isLongjumpSentinel(rcpp_result_gen))
            throw Rcpp::LongjumpException(rcpp_result_gen);
        if (rcpp_result_gen.inherits("try-error"))
            throw Rcpp::exception(Rcpp::as<std::string>(rcpp_result_gen).c_str());
        return Rcpp::as<arma::vec >(rcpp_result_gen);
    }

}

#endif // RCPP_package1_RCPPEXPORTS_H_GEN_

So, I created two additional packages via

library(RcppArmadillo)
RcppArmadillo.package.skeleton(name = "package3", example_code = FALSE)
RcppArmadillo.package.skeleton(name = "package4", example_code = FALSE)

Then in package3/inst/include, I added package3.h containing the "simple header" code above (I also added a throwaway "Hello World" cpp file in src/). In package4/src/ I added the following:

#include <package3.h>

// [[Rcpp::export]]
arma::vec col_sums(const arma::mat& test){
  return arma::sum(test,0).t();
}

// [[Rcpp::export]]
arma::vec simple_header_import(const arma::mat& test){
  return package3::col_sums(test);
}

as well as adding package3 to LinkingTo in the DESCRIPTION file.

Then, after installing the new packages, I benchmarked all the functions against each other:

library(rbenchmark)

set.seed(1)
nr <- 100
p <- 800
testmat <- matrix(rnorm(nr * p), ncol = p)

benchmark(original = package1::col_sums(testmat),
          first_copy = package2::col_sums(testmat),
          complicated_import = package2::col_sums_imported(testmat),
          second_copy = package4::col_sums(testmat),
          simple_import = package4::simple_header_import(testmat),
          replications = 1e3,
          columns = c("test", "relative", "elapsed", "user.self", "sys.self"),
          order = "relative")


                test relative elapsed user.self sys.self
2         first_copy    1.000   0.174     0.174    0.000
4        second_copy    1.000   0.174     0.173    0.000
5      simple_import    1.000   0.174     0.174    0.000
1           original    1.126   0.196     0.197    0.000
3 complicated_import    6.690   1.164     0.544    0.613

While the more "complicated" header function was 6x slower, the "simpler" one was not.

_{1. However, the automated code generated by Rcpp::interfaces does in fact include some features that may be superfluous for you beside the R_GetCCallable issue, though they may be desirable and in some other contexts necessary.}

_{2. Registering functions is always portable, and package authors are instructed to do so by the Writing R Extensions manual, but for internal/organizational/etc. use I believe the approach featured here should work if all packages involved are built from source. See this section of Hadley Wickham's R Packages for some discussion as well as the section of the Writing R Extensions manual linked above.}

Comparing RcppArmadillo and R running speed for non-contiguous submatrix selection

There are a couple of things worth addressing here. First, you made a subtle mistake in the design of your benchmark that had a significant impact on the performance on your Armadillo function, subselect. Observe:

set.seed(432)
rows <- rbinom(1000, 1, 0.1)
cols <- rbinom(1000, 1, 0.1)

imat <- matrix(1:1e6, 1000, 1000)
nmat <- imat + 0.0

storage.mode(imat)
# [1] "integer"

storage.mode(nmat)
# [1] "double"

microbenchmark(
    "imat" = subselect(imat, rows, cols),
    "nmat" = subselect(nmat, rows, cols)
)
# Unit: microseconds
#  expr      min       lq      mean    median        uq       max neval
#  imat 3088.140 3218.013 4355.2956 3404.4685 4585.1095 21662.540   100
#  nmat  139.298  167.116  223.2271  209.4585  238.6875   533.035   100

Although R often treats integer literals (e.g. 1, 2, 3, ...) as floating point values, the sequence operator : is one of the few exceptions to this,

storage.mode(c(1, 2, 3, 4, 5))
# [1] "double"

storage.mode(1:5)
# [1] "integer"

which is why the expression matrix(1:1e6, 1000, 1000) returns an integer matrix, not a numeric matrix. This is problematic because subselect is expecting a NumericMatrix, not an IntegerMatrix, and passing it the latter type triggers a deep copy, hence the difference of more than an order of magnitude in the above benchmark.

Second, there is a noticeable difference between the relative performance of the R function submat and the C++ function subselect over the distribution of your binary indexing vectors, which is presumably due to a difference in underlying algorithms. For more sparse indexing (a greater proportion of 0s than 1s), the R function wins out; and for more dense indexing, the opposite is true. This also seems to be a function of matrix size (or possibly just dimension), as shown in the plots below, where row and column index vectors are generated using rbinom with success parameters 0.0, 0.05, 0.10, ..., 0.95, 1.0 -- first with a 1e3 x 1e3 matrix, and then with a 1e3 x 1e4 matrix. The code for this is included at the end.

Sample Image

Benchmark code:

library(data.table)
library(microbenchmark)
library(ggplot2)

test_data <- function(nr, nc, p, seed = 123) {
    set.seed(seed)
    list(
        x = matrix(rnorm(nr * nc), nr, nc),
        rv = rbinom(nr, 1, p),
        cv = rbinom(nc, 1, p)
    )
}

tests <- lapply(seq(0, 1, 0.05), function(p) {
    lst <- test_data(1e3, 1e3, p)
    list(
        p = p,
        benchmark = microbenchmark::microbenchmark(
            R = submat(lst[[1]], lst[[2]], lst[[3]]),
            Arma = subselect(lst[[1]], lst[[2]], lst[[3]])
        )
    )
})

gt <- rbindlist(
    Map(function(g) {
        data.table(g[[2]])[
            ,.(Median.us = median(time / 1000)), 
            by = .(Expr = expr)
        ][order(Median.us)][
            ,Relative := Median.us / min(Median.us)
        ][,pSuccess := sprintf("%3.2f", g[[1]])]
    }, tests)
)

ggplot(gt) +
    geom_point(
        aes(
            x = pSuccess, 
            y = Relative, 
            color = Expr
        ),
        size = 2,
        alpha = 0.75
    ) +
    theme_bw() +
    ggtitle("1e3 x 1e3 Matrix")

## change `test_data(1e3, 1e3, p)` to
## `test_data(1e3, 1e4, p)` inside of 
## `tests <- lapply(...) ...` to generate 
## the second plot