How to Handle Vectors Without Knowing the Type in Rcpp

how can I handle vectors without knowing the type in Rcpp

Don't use push_back on Rcpp types. The way Rcpp vectors are currently implemented this requires copying all of the data each time. This is a very expensive operation.

We have RCPP_RETURN_VECTOR for dispatching, this requires that you write a template function taking a Vector as input.

#include <Rcpp.h>
using namespace Rcpp ;

template <int RTYPE>
Vector<RTYPE> first_two_impl( Vector<RTYPE> xin){
    Vector<RTYPE> xout(2) ;
    for( int i=0; i<2; i++ ){
        xout[i] = xin[i] ;    
    }
    return xout ;
}

// [[Rcpp::export]]
SEXP first_two( SEXP xin ){
  RCPP_RETURN_VECTOR(first_two_impl, xin) ;
}

/*** R
    first_two( 1:3 )
    first_two( letters )
*/

Just sourceCpp this file, this will also run the R code which calls the two functions. Actually, the template could be simpler, this would work too:

template <typename T>
T first_two_impl( T xin){
    T xout(2) ;
    for( int i=0; i<2; i++ ){
        xout[i] = xin[i] ;    
    }
    return xout ;
}

The template parameter T only needs:

A constructor taking an int
An operator[](int)

Alternatively, this might be a job for dplyr vector visitors.

#include <dplyr.h>
// [[Rcpp::depends(dplyr,BH)]]

using namespace dplyr ;
using namespace Rcpp ;

// [[Rcpp::export]]
SEXP first_two( SEXP data ){
    VectorVisitor* v = visitor(data) ;
    IntegerVector idx = seq( 0, 1 ) ;
    Shield<SEXP> out( v->subset(idx) ) ;
    delete v ;
    return out ;
}

visitors let you do a set of things on a vector regardless of the type of data it holds.

> first_two(letters)
[1] "a" "b"

> first_two(1:10)
[1] 1 2

> first_two(rnorm(10))
[1] 0.4647190 0.9790888

Rcpp fast statistical mode function with vector input of any type

In order to make the function work for any vector input, you could implement @JosephWood's algorithm for any data type you want to support and call it from a switch(TYPEOF(x)). But that would be lots of code duplication. Instead, it is better to make a generic function that can work on any Vector<RTYPE> argument. If we follow R's paradigm that everything is a vector and let the function also return a Vector<RTYPE>, then we can make use of RCPP_RETURN_VECTOR. Note that we need C++11 to be able to pass additional arguments to the function called by RCPP_RETURN_VECTOR. One tricky thing is that you need the storage type for Vector<RTYPE> in order to create a suitable std::unordered_map. Here Rcpp::traits::storage_type<RTYPE>::type comes to the rescue. However, std::unordered_map does not know how to deal with complex numbers from R. For simplicity, I am disabling this special case.

Putting it all together:

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>

template <int RTYPE>
Vector<RTYPE> fastModeImpl(Vector<RTYPE> x, bool narm){
  if (narm) x = x[!is_na(x)];
  int myMax = 1;
  Vector<RTYPE> myMode(1);
  // special case for factors == INTSXP with "class" and "levels" attribute
  if (x.hasAttribute("levels")){
    myMode.attr("class") = x.attr("class");
    myMode.attr("levels") = x.attr("levels");
  }
  std::unordered_map<typename Rcpp::traits::storage_type<RTYPE>::type, int> modeMap;
  modeMap.reserve(x.size());

  for (std::size_t i = 0, len = x.size(); i < len; ++i) {
    auto it = modeMap.find(x[i]);

    if (it != modeMap.end()) {
      ++(it->second);
      if (it->second > myMax) {
        myMax = it->second;
        myMode[0] = x[i];
      }
    } else {
      modeMap.insert({x[i], 1});
    }
  }

  return myMode;
}

template <>
Vector<CPLXSXP> fastModeImpl(Vector<CPLXSXP> x, bool narm) {
  stop("Not supported SEXP type!");
}

// [[Rcpp::export]]
SEXP fastMode( SEXP x, bool narm = false ){
  RCPP_RETURN_VECTOR(fastModeImpl, x, narm);
}

/*** R
set.seed(1234)
s <- sample(1e5, replace = TRUE)
fastMode(s)
fastMode(s + 0.1)
l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 
fastMode(l)
c <- sample(letters, 1e5, replace = TRUE)
fastMode(c)
f <- as.factor(c)
fastMode(f) 
*/

Output:

> set.seed(1234)

> s <- sample(1e5, replace = TRUE)

> fastMode(s)
[1] 85433

> fastMode(s + 0.1)
[1] 85433.1

> l <- sample(c(TRUE, FALSE), 11, replace = TRUE) 

> fastMode(l)
[1] TRUE

> c <- sample(letters, 1e5, replace = TRUE)

> fastMode(c)
[1] "z"

> f <- as.factor(c)

> fastMode(f) 
[1] z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

As noted above, the used algorithm comes from Joseph Wood's answer, which has been explicitly dual-licensed under CC-BY-SA and GPL >= 2. I am following Joseph and hereby license the code in this answer under the GPL (version 2 or later) in addition to the implicit CC-BY-SA license.

Extending Rcpp function to input vector of any type

I think the main error in examples are that you start your loop at j = 0 so you call operator[](-1). The following works for me. Make the following func.cpp

#include <Rcpp.h>
#include <algorithm>
using namespace Rcpp;

template <int RTYPE>
int streak_run_impl(const Vector<RTYPE>& x, int i1, int i2)
{
  int cur_streak = 1;

  if (Vector<RTYPE>::is_na(x[0])){
    cur_streak = NA_INTEGER;
  } else {
    cur_streak = 1;
  }

  for(int j = std::max(i1, 1) /* have to start at one at least */; 
      j < std::min(i2 + 1, (int)x.size()) /* check size of x */; ++j){
    if(x[j] == x[j - 1]){
      cur_streak += 1;

    } else if(Vector<RTYPE>::is_na(x[j])){
      cur_streak = NA_INTEGER;

    } else {
      cur_streak = 1;

    }
  }
  return cur_streak;
}

// [[Rcpp::export]]
int streak_run3(SEXP x, int i1, int i2) {
  switch (TYPEOF(x)) {
    case INTSXP: {
      return streak_run_impl(as<IntegerVector>(x), i1, i2);
    }
    case REALSXP: {
      return streak_run_impl(as<NumericVector>(x), i1, i2);
    }
    case STRSXP: {
      return streak_run_impl(as<CharacterVector>(x), i1, i2);
    }
    case LGLSXP: {
      return streak_run_impl(as<LogicalVector>(x), i1, i2);
    }
    case CPLXSXP: {
      return streak_run_impl(as<ComplexVector>(x), i1, i2);
    }
    default: {
      return 0;
    }
  }
}

Then run this R script with the working directory set to that of the .cpp file

Rcpp::sourceCpp("func.cpp")

streak_run3(c(1,1,1,1), i1=0, i2=3)
streak_run3(as.integer(c(1,1,1,1)), i1=0, i2=3)
streak_run3(as.character(c(1,1,1,1)), i1=0, i2=3)

Comparing two values in Rcpp without casting to specific type

You are on the right track with using the generic SEXP input object tag. To get this to work one needs to use C++ templates in addition to TYPEOF(). The prior enables the correct vector creation in the comparison function to be hooked in with Rcpp sugar while the latter enables the correct check and dispatch to occur.

#include <Rcpp.h>
using namespace Rcpp;

template <int RTYPE>
Rcpp::LogicalVector compare_me(Rcpp::Vector<RTYPE> x, Rcpp::Vector<RTYPE> y) {
    return x == y;
}

// [[Rcpp::export]]
Rcpp::LogicalVector compare_objects(SEXP x, SEXP y) {

    if (TYPEOF(x) == TYPEOF(y)) {
        switch (TYPEOF(x)) {
            case INTSXP:
                return compare_me<INTSXP>(x, y);
            case REALSXP:
                return compare_me<REALSXP>(x, y);
            case STRSXP:
                return compare_me<STRSXP>(x, y);
            default:
                Rcpp::stop("Type not supported");
        }
    } else {
        Rcpp::stop("Objects are of different type");
    }

    // Never used, but necessary to avoid the compiler complaining
    // about a missing return statement
    return Rcpp::LogicalVector(); 
}

Example:

to_cmp <- "a"
compare_objects(to_cmp, to_cmp)

Output:

[1] TRUE

Also, the above is for use with Rcpp::sourceCpp(). I would encourage you to switch from using inline to using Rcpp::cppFunction() for function definitions as it allows you to focus on the computation and not the setup.

Return subset of a given SEXP without knowing the actual internal data type

You can use a C++ template together with the RCPP_RETURN_VECTOR macro. This macro will make sure that the template is instantiated for all(?) R data types:

#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]

template <int RTYPE>
Rcpp::Vector<RTYPE> debug_subset_impl(Rcpp::Vector<RTYPE> x,
                                      R_xlen_t index_from,
                                      R_xlen_t index_to){
    // range [index_from, index_to)
    Rcpp::Vector<RTYPE> subset(index_to - index_from);
    std::copy(x.cbegin() + index_from, x.cbegin() + index_to, subset.begin());
    // special case for factors == INTSXP with "class" and "levels" attribute
    if (x.hasAttribute("levels")){
        subset.attr("class") = x.attr("class");
        subset.attr("levels") = x.attr("levels");
    }
    return subset;
}

// [[Rcpp::export]]
SEXP dbg_subset(SEXP x, R_xlen_t index_from, R_xlen_t index_to){
    // 1-based -> 0-based
    RCPP_RETURN_VECTOR(debug_subset_impl, x, index_from - 1, index_to - 1);
}

/*** R
set.seed(42)
dbg_subset(1:100, 3, 6)
dbg_subset(runif(100), 3, 6)
dbg_subset(letters, 3, 6)
dbg_subset(as.factor(letters), 3, 6)
*/

Output:

> Rcpp::sourceCpp('58965423.cpp')

> set.seed(42)

> dbg_subset(1:100, 3, 6)
[1] 3 4 5

> dbg_subset(runif(100), 3, 6)
[1] 0.2861395 0.8304476 0.6417455

> dbg_subset(letters, 3, 6)
[1] "c" "d" "e"

> dbg_subset(as.factor(letters), 3, 6)
[1] c d e
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Using R (and Rcpp), how to pass a default 'std::vector int ' array into a function

Here is a version that at least compiles and runs. I am not quite sure what you want with partial -- but what you had is simply outside the (documented, but we already know you do not have time for the documentation we provide) interface contract so of course it didn't build.

Code

// https://gallery.rcpp.org/articles/sorting/
// https://www.geeksforgeeks.org/sorting-a-vector-in-c/
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector cpp_sort_numeric_works(NumericVector arr, std::string dir = "ASC" ) {
    NumericVector _arr = clone(arr);
    if(dir != "ASC") {
        std::sort(_arr.begin(), _arr.end(), std::greater<int>());
    } else {
        std::sort(_arr.begin(), _arr.end());
    }
    return _arr;
}

NumericVector _partial_sort(NumericVector arr, int p, std::string dir = "ASC") {
    NumericVector _arr = clone(arr);
    if(dir != "ASC") {
        std::nth_element(_arr.begin(), _arr.begin()+p-1, _arr.end(), std::greater<int>());
    } else {
        std::nth_element(_arr.begin(), _arr.begin()+p-1, _arr.end());
    }
    return _arr;
}

// [[Rcpp::export]]
NumericVector cpp_sort_numeric(NumericVector arr, NumericVector partial, std::string dir = "ASC") {
    NumericVector _arr = clone(arr);
    if (partial[0] == -1)  { // only positive values allowed ...
        if(dir != "ASC") {
            std::sort(_arr.begin(), _arr.end(), std::greater<int>());
        } else {
            std::sort(_arr.begin(), _arr.end());
        }
    } else {
        for (auto& p : partial) {
            _arr = _partial_sort(_arr, p, dir);
        }
    }
    return _arr;
}

/*** R
v <- c(1,2,3,2,1,0,-1,2)
cpp_sort_numeric_works(v)
cpp_sort_numeric_works(v, "DESC")
w <- v
w[1] <- -1
cpp_sort_numeric(v, w)
cpp_sort_numeric(v, w, "DESC")
*/

Output

> Rcpp::sourceCpp("~/git/stackoverflow/73222485/answer.cpp")

> v <- c(1,2,3,2,1,0,-1,2)

> cpp_sort_numeric_works(v)
[1] -1  0  1  1  2  2  2  3

> cpp_sort_numeric_works(v, "DESC")
[1]  3  2  2  2  1  1  0 -1

> w <- v

> w[1] <- -1

> cpp_sort_numeric(v, w)
[1] -1  0  1  1  2  2  2  3

> cpp_sort_numeric(v, w, "DESC")
[1]  3  2  2  2  1  1  0 -1
>

How to deal with factors in Rcpp

Note: Throughout, I will refer to f, not c. It is bad practice to name variables the same name as a builtin function or constant, such as c, T, or F. Therefore I change the beginning of your code as follows:

library(Rcpp)

f <- factor(c("E", "H", "E", "12", "10", "60", "80", "11", "H", "H"))

In addition to looking at class(f) and storage.mode(f), it's useful to look at str(f):

str(f)
# Factor w/ 7 levels "10","11","12",..: 6 7 6 3 1 4 5 2 7 7

In truth, a factor is an integer vector with "levels": a character vector corresponding to each unique integer value. Luckily, you can get this from C++ using the .attr() member function of Rcpp::IntegerVector:

cppFunction('CharacterVector fun(IntegerVector x){

            // creates an empty character vector the size/length of x.
            CharacterVector y = x.size() ;

            // Get the levels of x
            CharacterVector levs = x.attr("levels");

            int n = x.size() - 1 ;

            //loop
            for(int i = 0; i <= n; i = i + 1){

            if(levs[x[i]-1] == "H"){
            y[i] = "Home" ;

            }else if(levs[x[i]-1] == "E"){
            y[i] = "Elsewhere" ;
            }else{
            y[i] = "Number" ;
            } ;

            }

            return y ;

            }')

fun(f)
# [1] "Elsewhere" "Home"      "Elsewhere" "Number"    "Number"    "Number"   
# [7] "Number"    "Number"    "Home"      "Home"

So, to get what you want, you had to do three things:

Change the return type from IntegerVector to CharacterVector (though you were completely right that the input should be IntegerVector)
Get the levels of the factor using CharacterVector levs = x.attr("levels");
Compare levs[x[i]-1] to "H", etc., rather than x[i] -- x[i] will always be an integer, giving the element of the vector of levels it corresponds to. We do -1 since C++ is 0-indexed and R is 1-indexed.

Other notes:

It is clear, as you say, that "[you're] attempting to learn how to use Rcpp() in R." You'll definitely want to spend some time with resources such as Rcpp for Everyone (that's the chapter on factors), the Rcpp Gallery (this specific link is an article on factors), Hadley's chapter on Rcpp, and definitely the Rcpp vignettes available here.

Is there an efficient way to obtain pmax other than using the R base function?

There seem to be a few issues that memory allocations that can be seen from bench::mark uncover.

bench::mark(pmax(x,  y, z, w, na.rm = TRUE),
            Pmax2(x, y, z, w))

## # A tibble: 2 x 13
##   expression                         min  median `itr/sec` mem_alloc
##   <bch:expr>                     <bch:t> <bch:t>     <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE)  5.79ms  6.28ms     157.    781.3KB
## 2 Pmax2(x, y, z, w)              39.56ms 54.48ms      19.7    9.18MB

Memory Coercion

There is 10 times the memory allocation in comparison to base pmax(). Your rcpp is relatively straight forward, so this hints that there is some kind of coercion. And when looking at your sample data, you are sending integer vectors to a numeric signature. This creates a costly coercion. Let's update the signature and code to expect IntegerVectors. I simply changed everything from NumericVector to IntegerVector for this.

  expression                         min  median `itr/sec` mem_alloc
  <bch:expr>                     <bch:t> <bch:t>     <dbl> <bch:byt>
1 pmax(x, y, z, w, na.rm = TRUE)  1.89ms  2.33ms     438.    781.3KB
2 Pmax2_int(x, y, z, w)          37.42ms 49.88ms      17.6    2.32MB

Re-Compilation

The OP code includes cppFunction within the larger function code. Unless we need to recompile it every loop, we can instead compile and then call the compiled code from R. This is the biggest performance boost for this dataset size.

cppFunction("
  IntegerVector cpp_pmax_pre(List args) {
    IntegerVector out = args[0];
    int n_arg = args.length();
    int n_vec = out.length();
    for (int j = 0; j < n_vec; ++j) {
      if (IntegerVector::is_na(out[j])) out[j] = -1;
    }
    for (int i = 1; i < n_arg; ++i) {
      IntegerVector pa = args[i];
      for (int j = 0; j < n_vec; ++j) {
        if (IntegerVector::is_na(pa[j])) pa[j] = -1;
        out[j] = std::max(out[j], pa[j]);
      }
    }
    return out;
  }
")

Pmax2_int_pre <- function(...) {
  args_list <- list(...)
  output <- cpp_pmax_pre(args_list)
  output[output == -1] <- NA
  return(output)
}

bench::mark(pmax(x,  y, z, w, na.rm = TRUE),
            Pmax2_int_pre(x, y, z, w))

## # A tibble: 2 x 13
##   expression                        min median `itr/sec` mem_alloc
##   <bch:expr>                     <bch:> <bch:>     <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 2.31ms 2.42ms      397.   781.3KB
## 2 Pmax2_int_pre(x, y, z, w)      2.48ms 3.55ms      270.    2.29MB

More memory and small optimizations

Finally, we still have more memory allocated. That hints we can do more - in this case we should update NA_REAL in rcpp. Related, we can optimize the loop assignment some.

cppFunction("
  IntegerVector cpp_pmax_final(List args) {
    IntegerVector out = args[0];
    int n_arg = args.length();
    int n_vec = out.length();
    for (int j = 0; j < n_vec; ++j) {
      if (IntegerVector::is_na(out[j])) out[j] = -1;
    }
    for (int i = 1; i < n_arg; ++i) {
      IntegerVector pa = args[i];
      for (int j = 0; j < n_vec; ++j) {
// simplify logic; if the element is not na and is greater than the out, update out.
        if (!IntegerVector::is_na(pa[j]) & pa[j] > out[j]) out[j] = pa[j];
      }
    }
// update now in Rcpp instead of allocating vectors in R
    for (int i = 0; i < n_vec; i++) {
      if(out[i] == -1) out[i] = NA_INTEGER;
    }
    return out;
  }
")

Pmax2_final <- function(...) {
  cpp_pmax_final(list(...))
}

bench::mark(pmax(x,  y, z, w, na.rm = TRUE),
            Pmax2_final(x, y, z, w))

## # A tibble: 2 x 13
##   expression                        min median `itr/sec` mem_alloc
##   <bch:expr>                     <bch:> <bch:>     <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE)    2ms 2.08ms      460.   781.3KB
## 2 Pmax2_final(x, y, z, w)        1.19ms 1.45ms      671.    2.49KB

We did it*! I am sure there could be small optimizations - we access pa[j] three times so it may be worthwhile to assign to a variable.

Bonus - NA_INTEGER

According to Rcpp for Everyone, the NA_INTEGER should be equivalent to the lowest integer value of -2147483648. Using this, we can remove the replacement of NA's because we can compare directly to NA when dealing with int data types.

During this realization, I also found an issue with the previous part - we need to clone the initial argument so that we are not accidently changing it by reference. Still, we're still slightly faster than base pmax().

cppFunction("
  IntegerVector cpp_pmax_last(List args) {
    IntegerVector tmp = args[0];
    IntegerVector out = clone(tmp);
    int n_arg = args.length();
    int n_vec = out.length();
    for (int i = 1; i < n_arg; ++i) {
      IntegerVector pa = args[i];
      for (int j = 0; j < n_vec; ++j) {
        if (pa[j] > out[j]) out[j] = pa[j];
      }
    }
    return out;
  }
")

Pmax2_last <- function(...) {
  cpp_pmax_last(list(...))
}

bench::mark(pmax(x,  y, z, w, na.rm = TRUE),
            Pmax2_last(x, y, z, w),
)

## # A tibble: 2 x 13
##   expression                        min median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>                     <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
## 1 pmax(x, y, z, w, na.rm = TRUE) 5.98ms 6.36ms      154.     781KB        0
## 2 Pmax2_last(x, y, z, w)         5.09ms 5.46ms      177.     784KB        0

Rcpp swap function with NumericVector

Building on @r2evans' comments, here's a minimal implementation:

#include <Rcpp.h>

template <int T>
void swap_templ(Rcpp::Vector<T> x) {
  double tmp = x[0];
  x[0] = x[1];
  x[1] = tmp;
}
// [[Rcpp::export]]
void swap(SEXP x) {
  switch (TYPEOF(x)) {
  case INTSXP: 
    swap_templ<INTSXP>(x);
    break;
  case REALSXP:
    swap_templ<REALSXP>(x);
    break;
  default:
    Rcpp::Rcout <<
      "\nInput vector must be numeric or integer type" <<
      std::endl;
    break;
  }
}

/*** R
iv <- 1L:3L
dv <- 1:3 + 0.5

R> class(iv)
[1] "integer"

R> class(dv)
[1] "numeric"

R> swap(iv); iv
[1] 2 1 3

R> swap(dv); dv
[1] 2.5 1.5 3.5

R> class(iv)
[1] "integer"

R> class(dv)
[1] "numeric"
*/

How to Handle Vectors Without Knowing the Type in Rcpp

how can I handle vectors without knowing the type in Rcpp

Rcpp fast statistical mode function with vector input of any type

Extending Rcpp function to input vector of any type

Comparing two values in Rcpp without casting to specific type

Return subset of a given SEXP without knowing the actual internal data type

Using R (and Rcpp), how to pass a default 'std::vector int ' array into a function

Code

Output

How to deal with factors in Rcpp

Is there an efficient way to obtain pmax other than using the R base function?

Memory Coercion

Re-Compilation

More memory and small optimizations

Bonus - NA_INTEGER

Rcpp swap function with NumericVector

Related Topics

Leave a reply