Applying the Optim Function in R in C++ with Rcpp

Calling R's optim function from within C++ using Rcpp

In this case, the use of a functional pointer is problematic. Rcpp has a special type of wrapper that requires the C++ function to not be exported into R called Rcpp::InternalFunction. Under this implementation, you're able to easily incorporate R's optim with your C++ function from C++!

Also, I modified the element accessed indexes in fr function. The reason for this modification is you are hitting an out-of-bounds error because the indexes in C++ are on a zero-based starting system not a one-based system starting system like R. (e.g. x(0) is the same as x[1] in R)

#include<RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

double fr(arma::vec x){
  double result = 100 * (x(1) - x(0) * x(0)) * (x(1) - x(0) * x(0)) + (1 - x(0)) * (1 - x(0));
  return result;  
} 

// [[Rcpp::export]]
arma::vec optim_rcpp(const arma::vec& init_val){

  Rcpp::Environment stats("package:stats"); 
  Rcpp::Function optim = stats["optim"];    

  Rcpp::List opt_results = optim(Rcpp::_["par"]    = init_val,
                                 Rcpp::_["fn"]     = Rcpp::InternalFunction(&fr),
                                 Rcpp::_["method"] = "BFGS");

  // Extract and coerce from list.
  arma::vec out = Rcpp::as<arma::vec>(opt_results[0]);

  return out;
}

Optimizing R objective function with Rcpp slower, why?

In general if you are able to use vectorized functions, you will find it to be (almost) as fast as running your code directly in Rcpp. This is because many vectorized functions in R (almost all vectorized functions in Base R) are written in C, Cpp or Fortran and as such there is often little to gain.

That said, there are improvements to gain both in your R and Rcpp code. Optimization comes from carefully studying the code, and removing unnecessary steps (memory assignment, sums, etc.).

Lets start with the Rcpp code optimization.

In your case the main optimization is to remove unnecessary matrix and vector calculations. The code is in essence

Shift beta
calculate the log of the sum of exp(shift beta) [log-sum-exp]
use Obs as an index for the shifted beta and sum over all the probabilities
substract the log-sum-exp

Using this observation we can reduce your code to 2 for-loops. Note that sum is simply another for-loop (more or less: for(i = 0; i < max; i++){ sum += x }) so avoiding the sums can speed up ones code further (in most situations this is unnecessary optimization!). In addition your input Obs is an integer vector, and we can further optimize the code by using the IntegerVector type to avoid casting the double elements to integer values (Credit to Ralf Stubner's answer).

cppFunction('double llmnl_int_C_v2(NumericVector beta, IntegerVector Obs, int n_cat)
 {

    int n_Obs = Obs.size();

    NumericVector betas = (beta.size()+1);
    //1: shift beta
    for (int i = 1; i < n_cat; i++) {
        betas[i] = beta[i-1];
    };
    //2: Calculate log sum only once:
    double expBetas_log_sum = log(sum(exp(betas)));
    // pre allocate sum
    double ll_sum = 0;
    
    //3: Use n_Obs, to avoid calling Xby.size() every time 
    for (int i = 0; i < n_Obs; i++) {
        ll_sum += betas(Obs[i] - 1.0) ;
    };
    //4: Use that we know denom is the same for all I:
    ll_sum = ll_sum - expBetas_log_sum * n_Obs;
    return ll_sum;
}')

Note that I have removed quite a few memory allocations and removed unnecessary calculations in the for-loop. Also i have used that denom is the same for all iterations and simply multiplied for the final result.

We can perform similar optimizations in your R-code, which results in the below function:

llmnl_int_R_v2 <- function(beta, Obs, n_cat) {
    n_Obs <- length(Obs)
    betas <- c(0, beta)
    #note: denom = log(sum(exp(betas)))
    sum(betas[Obs]) - log(sum(exp(betas))) * n_Obs
}

Note the complexity of the function has been drastically reduced making it simpler for others to read. Just to be sure that I haven't messed up in the code somewhere let's check that they return the same results:

set.seed(2020)
mnl_sample <- t(rmultinom(n = 1000,size = 1,prob = c(0.3, 0.4, 0.2, 0.1)))
mnl_sample <- apply(mnl_sample,1,function(r) which(r == 1))

beta = c(4,2,1)
Obs = mnl_sample 
n_cat = 4
xr <- llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xr2 <- llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc <- llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat)
xc2 <- llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat)
all.equal(c(xr, xr2), c(xc, xc2))
TRUE

well that's a relief.

Performance:

I'll use microbenchmark to illustrate the performance. The optimized functions are fast, so I'll run the functions 1e5 times to reduce the effect of the garbage collector

microbenchmark("llmml_int_R" = llmnl_int(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmml_int_C" = llmnl_int_C(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmnl_int_R_v2" = llmnl_int_R_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               "llmml_int_C_v2" = llmnl_int_C_v2(beta = beta, Obs = mnl_sample, n_cat = n_cat),
               times = 1e5)
#Output:
#Unit: microseconds
#           expr     min      lq       mean  median      uq        max neval
#    llmml_int_R 202.701 206.801 288.219673 227.601 334.301  57368.902 1e+05
#    llmml_int_C 250.101 252.802 342.190342 272.001 399.251 112459.601 1e+05
# llmnl_int_R_v2   4.800   5.601   8.930027   6.401   9.702   5232.001 1e+05
# llmml_int_C_v2   5.100   5.801   8.834646   6.700  10.101   7154.901 1e+05

Here we see the same result as before. Now the new functions are roughly 35x faster (R) and 40x faster (Cpp) compared to their first counter-parts. Interestingly enough the optimized R function is still very slightly (0.3 ms or 4 %) faster than my optimized Cpp function. My best bet here is that there is some overhead from the Rcpp package, and if this was removed the two would be identical or the R.

Similarly we can check performance using Optim.

microbenchmark("llmnl_int" = optim(beta, llmnl_int, Obs = mnl_sample, 
                                   n_cat = n_cat, method = "BFGS", hessian = F, 
                                   control = list(fnscale = -1)),
               "llmnl_int_C" = optim(beta, llmnl_int_C, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               "llmnl_int_R_v2" = optim(beta, llmnl_int_R_v2, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               "llmnl_int_C_v2" = optim(beta, llmnl_int_C_v2, Obs = mnl_sample, 
                                     n_cat = n_cat, method = "BFGS", hessian = F, 
                                     control = list(fnscale = -1)),
               times = 1e3)
#Output:
#Unit: microseconds
#           expr       min        lq      mean    median         uq      max neval
#      llmnl_int 29541.301 53156.801 70304.446 76753.851  83528.101 196415.5  1000
#    llmnl_int_C 36879.501 59981.901 83134.218 92419.551 100208.451 190099.1  1000
# llmnl_int_R_v2   667.802  1253.452  1962.875  1585.101   1984.151  22718.3  1000
# llmnl_int_C_v2   704.401  1248.200  1983.247  1671.151   2033.401  11540.3  1000

Once again the result is the same.

Conclusion:

As a short conclusion it is worth noting that this is one example, where converting your code to Rcpp is not really worth the trouble. This is not always the case, but often it is worth taking a second look at your function, to see if there are areas of your code, where unnecessary calculations are performed. Especially in situations where one uses buildin vectorized functions, it is often not worth the time to convert code to Rcpp. More often one can see great improvements if one uses for-loops with code that cant easily be vectorized in order to remove the for-loop.

Calling Rcpp functions from optim

There is a difference in your objective functions:

  ll_cpp = cpp_posterior_density(THETAi = 2*as.vector(THETA0_LF[i,]),
                        Yi = as.vector(Y[i,]),
                        MUi =as.vector(MU_LF[i,]),
                        invS = invS ,TAU = TAU ,LAMBDA = LAMBDA, J = J ,K = K)

  ll_R = lf_posterior_density(THETAi = 2*as.vector(THETA0_LF[i,]),
                        Yi = as.vector(Y[i,]),
                        MUi =as.vector(MU_LF[i,]),
                        invS = invS ,TAU = TAU ,LAMBDA = LAMBDA, J = J ,K = K)

  print(paste0(c("R: log posterior: ", ll_R)))
  #> [1] "R: log posterior: " "22.495400131601"   
  print(paste0(c("cpp: log posterior: ", ll_cpp)))
  #> [1] "cpp: log posterior: " "16.7463952181814"

I have not debugged your source code to find the error.

In such cases it is useful to add REPORT = 1 to the control list. For R this gives:

initial  value 45.707620 
iter   2 value 28.881100
iter   3 value 22.426070
iter   4 value 20.145499
iter   5 value 19.922129
iter   6 value 19.805083
iter   7 value 19.684769
iter   8 value 19.684366
iter   9 value 19.684345
iter  10 value 19.684343
iter  10 value 19.684343
final  value 19.684343 
converged

For Rcpp:

initial  value 45.707620 
iter   2 value 23.059207
iter   3 value -33.279972
iter   4 value -77.878965
iter   4 value -77.878965
iter   5 value -93.872445
iter   5 value -93.872445
iter   6 value -2830.594586
iter   6 value -2830.594586
iter   6 value -2830.594586
final  value -2830.594586 
converged

How calling stats::optim from Rcpp causes memory leak?

Using macOS + the Instruments leak detector, it seems like Rcpp::InternalFunction is leaking:

Instruments leak detector

And this points us at the code here:

https://github.com/RcppCore/Rcpp/blob/97222bb753c4cd76a4cef7d33c2bfc2e305d1158/inst/include/Rcpp/generated/InternalFunction__ctors.h#L35-L38

Note that we're allocating an object with new, but we're also telling XPtr not to register a delete finalizer:

https://github.com/RcppCore/Rcpp/blob/97222bb753c4cd76a4cef7d33c2bfc2e305d1158/inst/include/Rcpp/XPtr.h#L88-L106

It's not immediately clear to me why we do this, but I think it explain the behavior. It could be worth filing an issue.

Calling the agrep .Internal C function from Rcpp

Great question. The long and short of it is "You cant" (in many cases) unless the function is visible in one of the header files in "src/include/". At least not that easily.

Not long ago I had a similar fun challenge, where I tried to get access to the do_docall function (called by do.call), and it is not a simple task. First of all, it is not directly possible to just #include <agrep.c> (or something similar). That file simply isn't available for inclusion, as it is not a part of the "src/include". It is compiled and the uncompiled file is removed (not to mention that one should never "include" a .c file).

If one is willing to go the mile, then the next step one could look at is "copying" and "altering" the source code. Basically find the function in "src/main/agrep.c", copy it into your package and then fix any errors you find.

Problems with this approach:

As documented in R-exts the internal structures of sexprec_info is not made public (this is the base structure for all objects in R). Many internal function use the fields within this structure, so one has to "copy" the structure into your source code, to make it public to your code specifically.
If you ever #include <Rcpp.h> prior to this file, you will need to go through each and every call to internal functions and likely add either R_ or Rf_.
The function may contain calls to other "internal" functions, that further needs to be copied and altered for it to work.
You will also need to get a clear understanding of what CDR, CAR and similar does. The internal functions have a documented structure, where the first argument contains the full call passed to the function, and function like those 2 are used to access parts of the call.
I did myself a solid and rewrote do_docall changing the input format, to avoid having to consider this. But this takes time. The alternative is to create a pairlist according to the documentation, set its type as a call-sexp (the exact name is lost to me at the moment) and pass the appropriate arguments for op, args and env.
And lastly, if you go through the steps, and find that it is necessary to copy the internal structures of sexprec_info (as described later), then you will need to be very careful about when you include Rinternals and Rcpp, as any one of these causes your code to crash and burn in the most beautiful and silent way if you include your header and these in the wrong order! Note that this even goes for [[Rcpp::export]], which may indeed turn out to include them in the wrong arbitrary order!

If you are willing to go this far down the drainage, I would suggest carefully reading adv-R "R's C interface" and Chapter 2, 5 and 6 of R-ext and maybe even the R internal manual, and finally once that is done take a look at do_docall from src/main/coerce.c and compare it to the implementation in my repository cmdline.arguments/src/utils/{cmd_coerce.h, cmd_coerce.c}. In this version I have

Added all the internal structures that are not public, so that I can access their unmodified form (unmodified by the current session).
- This includes the table used to store the currently used SEXP's, that was used as a lookup. This caused a problem as I can't access the modified version, so my code is slightly altered with the old code blocked by the macro #if --- defined(CMDLINE_ARGUMENTS_MAYBE_IN_THE_FUTURE). Luckily the code causing a problem had a static answer, so I could work around this (but this might not always be the case).
I added quite a few Rf_s as their macro version is not available (since I #include <Rcpp.h> at some point)
The code has been split into smaller functions to make it more readable (for my own sake).
The function has one additional argument (name), that is not used in the internal function, with some added errors (for my specific need).

This implementation will be frozen "for all time to come" as I've moved on to another branch (and this one is frozen for my own future benefit, if I ever want to walk down this path again).

I spent a few days scouring the internet for information on this and found 2 different posts, talking about how this could be achieved, and my approach basically copies this. Whether this is actually allowed in a cran package, is an whole other question (and not one that I will be testing out).

This approach goes again if you want to use not-public code from other packages. While often here it is as simple as "copy-paste" their files into your repository.

As a final side note, you mention the intend is to "speed up" your code for when you have to perform millions upon millions of calls to agrep. It seems that this is a time where one should consider performing the task in parallel. Even after going through the steps outlined above, creating N parallel sessions to take care of K evaluations each (say 100.000), would be the first step to reduce computing time. Of course each session should be given a batch and not a single call to agrep.