Passing a 'Data.Table' to C++ Functions Using 'Rcpp' And/Or 'Rcpparmadillo'

Passing a `data.table` to c++ functions using `Rcpp` and/or `RcppArmadillo`

Building on top of other answers, here is some example code:

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
double do_stuff_with_a_data_table(DataFrame df){
    CharacterVector x = df["x"] ;
    NumericVector   y = df["y"] ;
    IntegerVector   z = df["v"] ;

    /* do whatever with x, y, v */
    double res = sum(y) ;
    return res ;
}

So, as Matthew says, this treats the data.table as a data.frame (aka a Rcpp::DataFrame in Rcpp).

require(data.table)
DT <- data.table(
    x=rep(c("a","b","c"),each=3), 
    y=c(1,3,6), 
    v=1:9)
do_stuff_with_a_data_table( DT ) 
# [1] 30

This completely ignores the internals of the data.table.

Rcpp function to construct a function

Ok, as far as I understand, you want a function returning function with a closure, a.k.a. " the function defined in the closure 'remembers' the environment in which it was created."

In C++11 and up it is quite possible to define such function, along the lines

std::function<double(double)> createax2Rcpp(int a) {
    auto ax2 = [a](double x) {  return(double(a) * pow(x, 2)); };
    return ax2;
}

What happens, the anonymous class and object with overloaded operator() will be created, it will capture the closure and moved out of the creator function. Return will be captured into instance of std::function with type erasure etc.

But! C/C++ function in R requires to be of a certain type, which is narrower (as an opposite to wider, you could capture narrow objects into wide one, but not vice versa).

Thus, I don't know how to make from std::function a proper R function, looks like it is impossible.

Perhaps, emulation of the closure like below might help

static int __a;

double ax2(double x) {
    return(__a * pow(x, 2));
}

Rcpp::Function createax2Rcpp(int a) {
    __a = a;

    return (ax2);
}

Rcpp/RcppArmadillo C++/R balance for Performance

The RcppArmadillo code is running ~3x slower then native R. A hybrid RcppArmadillo & R variant of the code is running ~1.10 times faster then R.

You must be copying a lot. That is simply not plausible. Maybe time to profile your code?

Take any posted example. Maybe from the Rcpp Gallery which has multiple posts. Maybe from the by now 250 (!!) CRAN packages using RcppArmadillo. Start with something simple, and time it.

You should see the ~ 50x factor for loops converted from R to C++. You should see something close to maybe 1.5 times faster for purely vectorized code (as R tends to do more error checking that code we write in small extensions).

Edit: Here is a trivial benchmark showing how trivial it is to set one up. You really can (and should !!) time all portions you suspect are inefficient. We all make errors, both in design and in execution. The good thing is that you have both tools and plenty of posted examples to guide you.

R> Rfunc <- function(N) { s <- 0; for (i in 1:N) for (j in 1:N) s <- s+1; s }
R> Rfunc(10)
[1] 100
R> 
R> library(Rcpp)
R> cppFunction("double Cppfunc(int N) { double s=0; for (int i=0; i<N; i++) for (int j=0; j<N; j++) s++; return(s); }")
R> Cppfunc(10)
[1] 100
R> 
R> library(rbenchmark)
R> N <- 1000
R> benchmark(Rfunc(N), Cppfunc(N), order="relative")[,1:4]
        test replications elapsed relative
2 Cppfunc(N)          100   0.073    1.000
1   Rfunc(N)          100  12.596  172.548
R>

Rewriting R's cummin() function using Rcpp and allowing for NAs

I would use this:

template<typename T, int RTYPE>
Vector<RTYPE> cummin_cpp2(Vector<RTYPE> x, bool narm){

  Vector<RTYPE> res = clone(x);
  int i = 1, n = res.size();
  T na;

  if(narm){
    // Ignore NAs
    for(; i < n; i++){
      if(ISNAN(res[i]) || (res[i-1] < res[i])) res[i] = res[i-1];
    }
  } else{
    // Do not ignore NAs
    for(; i < n; i++){
      if(ISNAN(res[i-1])) {
        na = res[i-1];
        break;
      } else if(res[i-1] < res[i]){
        res[i] = res[i-1];
      }
    }
    for(; i < n; i++){
      res[i] = na;
    }
  }

  return res;
}

// [[Rcpp::export]]
SEXP cummin_cpp2(SEXP x, bool narm = false) {
  switch (TYPEOF(x)) {
  case INTSXP:  return cummin_cpp2<int, INTSXP>(x, narm);
  case REALSXP: return cummin_cpp2<double, REALSXP>(x, narm);
  default: Rcpp::stop("SEXP Type Not Supported."); 
  }
}

Try this on:

x <- c(NA, 7, 5, 4, NA, 2, 4)
x2 <- as.integer(x)

cummin_cpp(x, narm = TRUE)
x

cummin_cpp(x2)
x2

x <- c(NA, 7, 5, 4, NA, 2, 4)
x2 <- as.integer(x)
x3 <- replace(x, is.na(x), NaN)

cummin_cpp2(x, narm = TRUE)
x

cummin_cpp2(x2)
x2

cummin_cpp2(x3)
x3

Explanation:

Joran's advice is good, just wrap that in an R function
I use a dispatcher as Joseph Wood suggested
Beware that x is passed by reference and is modified if of the same type of what you declared (see these 2 slides)
You need to handle NA as well as NaN
You can use || instead of | to evaluate only the first condition if it is true.

Returning bunch of matrices using RCPP in C++ in an efficient way using a list

In my other answer I looked at the efficiency of returning data and at simple optimizations. Here I want to look at something different: Optimization of the algorithm.

You want to compute hhat(i, t) for 0 <= i, t < n and i < t. Looking at your formula we see that the dependency of hhat on i and t is very different. In particular, hhat(i, t + 1) can be written as hhat(i, t) + something. Right now your outer loop is over t and you are recomputing all these intermediate values. By switching the loop order, it is easy to do each such computation only once, bringing the algorithm down to a two nested loops. This means you have to generate the resulting matrices separately. And since you cannot store an arma::mat inside a Rcpp::List, I need an additional std::vector for storage:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
Rcpp::List hello( 
        const arma::rowvec& g,
        const int& n, 
        const int& p,
        const arma::mat& S,
        const arma::mat& zc,
        const arma::rowvec& dl){

    std::vector<arma::mat> foo(n);
    for(int t=0; t < n;++t){
        arma::mat hhat(p,n);
        hhat.zeros();
        foo[t] = hhat;
    }

    for(int i = 0;i < n; ++i){
        arma::mat h = exp(arma::as_scalar(g*zc.row(i).t())) * (zc.row(i).t() - S.col(i))*dl(i);
        for(int t=i+1; t < n;++t){
            h += exp(arma::as_scalar(g*zc.row(i).t())) * (zc.row(i).t() - S.col(t))*dl(t);
            foo[t].col(i) = h;
        }
    }

    Rcpp::List ht(n);
    for(int t=0; t < n;++t){
        ht[t] = foo[t];
    }

    return(ht);
}

// [[Rcpp::export]]
Rcpp::List hello_orig( 
        const arma::rowvec& g,
        const int& n, 
        const int& p,
        const arma::mat& S,
        const arma::mat& zc,
        const arma::rowvec& dl){
    Rcpp::List ht(n);

    for(int t=0; t < n;++t){

        arma::mat hhat(p,n);
        hhat.zeros();
        for(int i = 0;i < n; ++i){
            arma::mat h(p,1);
            h.zeros();
            if (t > i){
                for(int u=i;u <= t; ++u){
                    h += exp(arma::as_scalar(g*zc.row(i).t())) * (zc.row(i).t() - S.col(u))*dl(u);
                }
            }
            hhat.col(i) = h;
        }
        ht[t] = hhat;
    }

    return(ht);
}

/***R
g=c(1,2.1,3.1)
n=1600
p=3
S = matrix(rnorm(p*n),nrow=p,ncol=n)
dl=runif(n)
z=matrix(runif(p*n),nrow=n,ncol=p)
bench::mark(hello_orig(g=g,n=n,p=p,S = S,zc=z,dl = dl),
            hello(g=g,n=n,p=p,S = S,zc=z,dl = dl))
*/

Result:

# A tibble: 2 x 13
  expression                                                 min median `itr/sec` mem_alloc
  <bch:expr>                                              <bch:> <bch:>     <dbl> <bch:byt>
1 hello_orig(g = g, n = n, p = p, S = S, zc = z, dl = dl)  14.2s  14.2s    0.0703    58.7MB
2 hello(g = g, n = n, p = p, S = S, zc = z, dl = dl)      53.9ms 85.9ms   11.1       58.7MB
# … with 8 more variables: `gc/sec` <dbl>, n_itr <int>, n_gc <dbl>, total_time <bch:tm>,
#   result <list>, memory <list>, time <list>, gc <list>

More than a factor 100 faster!

You can get cleaner (and maybe even a bit faster code) by floowing @coatless' suggestions in the comments to use an arma::cube. The most compact form will give you a different return structure, though. Instead of a list of p x n you will get a p x n x n array:

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>

// [[Rcpp::export]]
arma::cube coatless( 
        const arma::rowvec& g,
        const int& n, 
        const int& p,
        const arma::mat& S,
        const arma::mat& zc,
        const arma::rowvec& dl){

    arma::cube ht(p, n, n);
    ht.zeros();

    for(int i = 0;i < n; ++i){
        arma::mat h = exp(arma::as_scalar(g*zc.row(i).t())) * (zc.row(i).t() - S.col(i))*dl(i);
        for(int t=i+1; t < n;++t){
            h += exp(arma::as_scalar(g*zc.row(i).t())) * (zc.row(i).t() - S.col(t))*dl(t);
            ht.slice(t).col(i) = h;
        }
    }

    return(ht);
}

obtim_lbfgs function in Rcppnumerical and RcppEigen with Rcpparmadillo

The rest of the error message that I am getting when compiling your code is of interest:

optim_num.cpp: In function ‘Rcpp::NumericVector aftsrr_bfgs(arma::vec, arma::vec, arma::mat, arma::vec)’:
optim_num.cpp:86:16: error: cannot declare variable ‘obj’ to be of abstract type ‘socreftn_mns’
   socreftn_mns obj(TIME, DELTA, COVARI, TARGETVEC);
                ^~~
optim_num.cpp:11:7: note:   because the following virtual functions are pure within ‘socreftn_mns’:
 class socreftn_mns: public MFuncGrad
       ^~~~~~~~~~~~
In file included from /usr/local/lib/R/site-library/RcppNumerical/include/integration/wrapper.h:13:0,
                 from /usr/local/lib/R/site-library/RcppNumerical/include/RcppNumerical.h:16,
                 from optim_num.cpp:6:
/usr/local/lib/R/site-library/RcppNumerical/include/integration/../Func.h:52:20: note:  virtual double Numer::MFuncGrad::f_grad(Numer::Constvec&, Numer::Refvec)
     virtual double f_grad(Constvec& x, Refvec grad) = 0;
                    ^~~~~~

In essence that means that your class socreftn_mns is derived from the abstract class MFuncGrad. This class is abstract since it does not contain a definition for the method f_grad(Constvec& x, Refvec grad). You try to define this by defining the method f_grad(arma::vec& b_s, arma::vec grad), but due to the different function signature, the virtual function is not overloaded. Hence your class is also abstract.

If you use the same signature, things should work out. The required types are defined in terms of Eigen objects:

// Reference to a vector
typedef Eigen::Ref<Eigen::VectorXd>             Refvec;
typedef const Eigen::Ref<const Eigen::VectorXd> Constvec;

So you will have to convert back and forth between Aramdillo and Eigen constructs.

Passing a 'Data.Table' to C++ Functions Using 'Rcpp' And/Or 'Rcpparmadillo'