Changing R's Seed from Rcpp to Guarantee Reproducibility

Get the same sample of integers from Rcpp as base R

Compiling comments into an answer. Akrun noted that by setting RNGkind or RNGversion we can replicate results. From DirkEddelbuettel; there was a "change in R's RNG that came about because someone noticed a bias in, IIRC, use of sampling (at very large N). So thats why you you to turn an option on in R to get the old (matching) behaviour. " And RalfStubner indicates that this is a known issue: https://github.com/RcppCore/RcppArmadillo/issues/250 and https://github.com/RcppCore/Rcpp/issues/945

Presently R uses a different default sampler which leads to different results

RNGkind(sample.kind = "Rejection")
set.seed(1)
sample(10)
# [1]  9  4  7  1  2  5  3 10  6  8
set.seed(1)
mysamp1(10)
# [1]  3  4  5  7  2  8  9  6 10  1

However, an earlier version can be used using

RNGkind(sample.kind = "Rounding")
#Warning message:
#  In RNGkind("Mersenne-Twister", "Inversion", "Rounding") : non-uniform 'Rounding' sampler used

set.seed(1)
sample(10)
# [1]  3  4  5  7  2  8  9  6 10  1
set.seed(1)
mysamp1(10)
# [1]  3  4  5  7  2  8  9  6 10  1

Using R's which function in Rcpp, not returning values

When you do <long vector> == <short vector> in R, the short vector gets recycled to match the long vector's length. This does not happen in Rcpp! In your case, you want to do <vector> == <single element vector>, which can be done in Rcpp with <vector> == <double/int/...>. This means you have to select the 0-Element from the single element vector. In your code:

#include <Rcpp.h>
using namespace Rcpp;
//[[Rcpp::export]]
SEXP mywhich(SEXP x, SEXP y) {

  //For each supported type, turn it into the 'real' type and
  //perform the operation. We can use TYPEOF to check the type.
  switch(TYPEOF(x)){
  case REALSXP: { 
   Environment base("package:base"); 
   Function f("which");
   NumericVector answer = f(as<NumericVector>(y)(0) == as<NumericVector>(x));
   //                                           ^^^
   return wrap(answer);
}
  case INTSXP: { 
    Environment base("package:base"); 
    Function f("which");
    IntegerVector answer = f(as<IntegerVector>(y)(0) == as<IntegerVector>(x));
    //                                           ^^^
    return wrap(answer);
  }
  default: {
    stop("Only integer and numeric vectors are supported");
  }
  }}

BTW, I am not convinced that you need which from R to find the indices in a LogicalVector that are true.

Fixing set.seed for an entire session

There are several options, depending on your exact needs. I suspect the first option, the simplest is not sufficient, but my second and third options may be more appropriate, with the third option the most automatable.

Option 1

If you know in advance that the function using/creating random numbers will always draw the same number, and you don't reorder the function calls or insert a new call in between existing ones, then all you need do is set the seed once. Indeed, you probably don't want to keep resetting the seed as you'll just keep on getting the same set of random numbers for each function call.

For example:

> set.seed(1)
> sample(10)
 [1]  3  4  5  7  2  8  9  6 10  1
> sample(10)
 [1]  3  2  6 10  5  7  8  4  1  9
> 
> ## second time round
> set.seed(1)
> sample(10)
 [1]  3  4  5  7  2  8  9  6 10  1
> sample(10)
 [1]  3  2  6 10  5  7  8  4  1  9

Option 2

If you really want to make sure that a function uses the same seed and you only want to set it once, pass the seed as an argument:

foo <- function(...., seed) {
  ## set the seed
  if (!missing(seed)) 
    set.seed(seed) 
  ## do other stuff
  ....
}

my.seed <- 42
bar <- foo(...., seed = my.seed)
fbar <- foo(...., seed = my.seed)

(where .... means other args to your function; this is pseudo code).

Option 3

If you want to automate this even more, then you could abuse the options mechanism, which is fine if you are just doing this in a script (for a package you should use your own options object). Then your function can look for this option. E.g.

foo <- function() {
  if (!is.null(seed <- getOption("myseed")))
    set.seed(seed)
  sample(10)
}

Then in use we have:

> getOption("myseed")
NULL
> foo()
 [1]  1  2  9  4  8  7 10  6  3  5
> foo()
 [1]  6  2  3  5  7  8  1  4 10  9
> options(myseed = 42)
> foo()
 [1] 10  9  3  6  4  8  5  1  2  7
> foo()
 [1] 10  9  3  6  4  8  5  1  2  7
> foo()
 [1] 10  9  3  6  4  8  5  1  2  7
> foo()
 [1] 10  9  3  6  4  8  5  1  2  7

parallel computing with foreach Rcpp and RNG much slower than expected

Here you want to create one large matrix. Distributing this to several processes is possible in principle, but bears the cost of combining the results in the end. I suggest using "shared memory parallelism" here. I am using the OpenMP code from here as starting point for a parallel version of your algorithm:

// [[Rcpp::depends(RcppEigen)]]
#include <RcppEigen.h>
// [[Rcpp::depends(dqrng)]]
#include <xoshiro.h>
// [[Rcpp::plugins(openmp)]]
#include <omp.h>
// [[Rcpp::plugins(cpp11)]]
#include <random>

// [[Rcpp::export]]
Eigen::MatrixXd funD(const int n_bootstrap,
                     const int n_obs_censusdata, 
                     const Eigen::Map<Eigen::VectorXd> locationeffects, 
                     const Eigen::Map<Eigen::VectorXd> residuals,
                     const Eigen::Map<Eigen::MatrixXd> X, 
                     const Eigen::Map<Eigen::MatrixXd> beta_sample,
                     int ncores) {

  // --------- create random sample of locations and of residuals --------- //

  // initialise random seeds 
  std::random_device rd; // used to obtain a seed for the number engine
  dqrng::xoshiro256plus gen(rd());

  // initialize distributions for randam locations and residuals
  const int upperlocation = locationeffects.size();
  const int upperresiduals = residuals.size();

   // subtract 1 because in C++ indices start with 0
  std::uniform_int_distribution<> distrloc(0, upperlocation - 1);
  std::uniform_int_distribution<> distrres(0, upperresiduals - 1);

  // initialize and fill matrix for randam locations and residuals 
  Eigen::MatrixXd LocationEffectResiduals(n_obs_censusdata, n_bootstrap);

  #pragma omp parallel num_threads(ncores)
  {
    dqrng::xoshiro256plus lgen(gen);      // make thread local copy of rng 
    lgen.jump(omp_get_thread_num() + 1);  // advance rng by 1 ... ncores jumps 

    #pragma omp for
    for (int i=0; i<n_obs_censusdata; ++i)
      for (int j=0; j<n_bootstrap; j++)
        LocationEffectResiduals(i,j) = locationeffects[distrloc(lgen)] + residuals[distrres(lgen)];
  }  

  // ----- create Xbeta ------- //
  Eigen::MatrixXd Xbeta = X * beta_sample;

  // ----- combine results ------- //
  Eigen::MatrixXd returnmatrix = Xbeta + LocationEffectResiduals;

  return returnmatrix;
}

On my dual-core Linux system my funD with ncores = 1 is slightly faster than your funC, probably because the used RNG is faster. With ncores = 2 it gains another 30-40%. Not bad given that not all code is executed in parallel. I don't know how good OpenMP performance is on Windows these days. It might make sense to use RcppParallel instead. But that requires more changes to your code.

The abovve code is meant to be sourced with Rcpp::sourceCpp(). When you put this into a package, you should use

CXX_STD = CXX11
PKG_CXXFLAGS = $(SHLIB_OPENMP_CXXFLAGS)
PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS)

in Makevars(.win). Note that according to WRE, this might not worl as expected if a different compiler is used for C++11 than for C++98. IIRC Solaris is the only platform where this is the case in the default configuration. So for an internal package you should be fine.

How to speed up this Rcpp function?

Maybe you're looking for (the oddly named) rowsum

library(microbenchmark)
use.rowsum = rowsum

and

> all.equal(use.for(X, g), use.rowsum(X, g), check.attributes=FALSE)
[1] TRUE
> microbenchmark(use.for(X, g), use.rowsum(X, g), times=5)
Unit: milliseconds
             expr       min        lq    median        uq       max neval
    use.for(X, g) 126.92876 127.19027 127.51403 127.64082 128.06579     5
 use.rowsum(X, g)  10.56727  10.93942  11.01106  11.38697  11.38918     5