how can I handle vectors without knowing the type in Rcpp
Don't use push_back
on Rcpp
types. The way Rcpp vectors are currently implemented this requires copying all of the data each time. This is a very expensive operation.
We have RCPP_RETURN_VECTOR for dispatching, this requires that you write a template function taking a Vector as input.
#include <Rcpp.h>
using namespace Rcpp ;
template <int RTYPE>
Vector<RTYPE> first_two_impl( Vector<RTYPE> xin){
Vector<RTYPE> xout(2) ;
for( int i=0; i<2; i++ ){
xout[i] = xin[i] ;
}
return xout ;
}
// [[Rcpp::export]]
SEXP first_two( SEXP xin ){
RCPP_RETURN_VECTOR(first_two_impl, xin) ;
}
/*** R
first_two( 1:3 )
first_two( letters )
*/
Just sourceCpp this file, this will also run the R code which calls the two functions. Actually, the template could be simpler, this would work too:
template <typename T>
T first_two_impl( T xin){
T xout(2) ;
for( int i=0; i<2; i++ ){
xout[i] = xin[i] ;
}
return xout ;
}
The template parameter T
only needs:
- A constructor taking an
int
- An
operator[](int)
Alternatively, this might be a job for dplyr vector visitors.
#include <dplyr.h>
// [[Rcpp::depends(dplyr,BH)]]
using namespace dplyr ;
using namespace Rcpp ;
// [[Rcpp::export]]
SEXP first_two( SEXP data ){
VectorVisitor* v = visitor(data) ;
IntegerVector idx = seq( 0, 1 ) ;
Shield<SEXP> out( v->subset(idx) ) ;
delete v ;
return out ;
}
visitors let you do a set of things on a vector regardless of the type of data it holds.
> first_two(letters)
[1] "a" "b"
> first_two(1:10)
[1] 1 2
> first_two(rnorm(10))
[1] 0.4647190 0.9790888
Rcpp fast statistical mode function with vector input of any type
In order to make the function work for any vector input, you could implement @JosephWood's algorithm for any data type you want to support and call it from a switch(TYPEOF(x))
. But that would be lots of code duplication. Instead, it is better to make a generic function that can work on any Vector<RTYPE>
argument. If we follow R's paradigm that everything is a vector and let the function also return a Vector<RTYPE>
, then we can make use of RCPP_RETURN_VECTOR
. Note that we need C++11 to be able to pass additional arguments to the function called by RCPP_RETURN_VECTOR
. One tricky thing is that you need the storage type for Vector<RTYPE>
in order to create a suitable std::unordered_map
. Here Rcpp::traits::storage_type<RTYPE>::type
comes to the rescue. However, std::unordered_map
does not know how to deal with complex numbers from R. For simplicity, I am disabling this special case.
Putting it all together:
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::plugins(cpp11)]]
#include <unordered_map>
template <int RTYPE>
Vector<RTYPE> fastModeImpl(Vector<RTYPE> x, bool narm){
if (narm) x = x[!is_na(x)];
int myMax = 1;
Vector<RTYPE> myMode(1);
// special case for factors == INTSXP with "class" and "levels" attribute
if (x.hasAttribute("levels")){
myMode.attr("class") = x.attr("class");
myMode.attr("levels") = x.attr("levels");
}
std::unordered_map<typename Rcpp::traits::storage_type<RTYPE>::type, int> modeMap;
modeMap.reserve(x.size());
for (std::size_t i = 0, len = x.size(); i < len; ++i) {
auto it = modeMap.find(x[i]);
if (it != modeMap.end()) {
++(it->second);
if (it->second > myMax) {
myMax = it->second;
myMode[0] = x[i];
}
} else {
modeMap.insert({x[i], 1});
}
}
return myMode;
}
template <>
Vector<CPLXSXP> fastModeImpl(Vector<CPLXSXP> x, bool narm) {
stop("Not supported SEXP type!");
}
// [[Rcpp::export]]
SEXP fastMode( SEXP x, bool narm = false ){
RCPP_RETURN_VECTOR(fastModeImpl, x, narm);
}
/*** R
set.seed(1234)
s <- sample(1e5, replace = TRUE)
fastMode(s)
fastMode(s + 0.1)
l <- sample(c(TRUE, FALSE), 11, replace = TRUE)
fastMode(l)
c <- sample(letters, 1e5, replace = TRUE)
fastMode(c)
f <- as.factor(c)
fastMode(f)
*/
Output:
> set.seed(1234)
> s <- sample(1e5, replace = TRUE)
> fastMode(s)
[1] 85433
> fastMode(s + 0.1)
[1] 85433.1
> l <- sample(c(TRUE, FALSE), 11, replace = TRUE)
> fastMode(l)
[1] TRUE
> c <- sample(letters, 1e5, replace = TRUE)
> fastMode(c)
[1] "z"
> f <- as.factor(c)
> fastMode(f)
[1] z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
As noted above, the used algorithm comes from Joseph Wood's answer, which has been explicitly dual-licensed under CC-BY-SA and GPL >= 2. I am following Joseph and hereby license the code in this answer under the GPL (version 2 or later) in addition to the implicit CC-BY-SA license.
Extending Rcpp function to input vector of any type
I think the main error in examples are that you start your loop at j = 0
so you call operator[](-1)
. The following works for me. Make the following func.cpp
#include <Rcpp.h>
#include <algorithm>
using namespace Rcpp;
template <int RTYPE>
int streak_run_impl(const Vector<RTYPE>& x, int i1, int i2)
{
int cur_streak = 1;
if (Vector<RTYPE>::is_na(x[0])){
cur_streak = NA_INTEGER;
} else {
cur_streak = 1;
}
for(int j = std::max(i1, 1) /* have to start at one at least */;
j < std::min(i2 + 1, (int)x.size()) /* check size of x */; ++j){
if(x[j] == x[j - 1]){
cur_streak += 1;
} else if(Vector<RTYPE>::is_na(x[j])){
cur_streak = NA_INTEGER;
} else {
cur_streak = 1;
}
}
return cur_streak;
}
// [[Rcpp::export]]
int streak_run3(SEXP x, int i1, int i2) {
switch (TYPEOF(x)) {
case INTSXP: {
return streak_run_impl(as<IntegerVector>(x), i1, i2);
}
case REALSXP: {
return streak_run_impl(as<NumericVector>(x), i1, i2);
}
case STRSXP: {
return streak_run_impl(as<CharacterVector>(x), i1, i2);
}
case LGLSXP: {
return streak_run_impl(as<LogicalVector>(x), i1, i2);
}
case CPLXSXP: {
return streak_run_impl(as<ComplexVector>(x), i1, i2);
}
default: {
return 0;
}
}
}
Then run this R script with the working directory set to that of the .cpp
file
Rcpp::sourceCpp("func.cpp")
streak_run3(c(1,1,1,1), i1=0, i2=3)
streak_run3(as.integer(c(1,1,1,1)), i1=0, i2=3)
streak_run3(as.character(c(1,1,1,1)), i1=0, i2=3)
Comparing two values in Rcpp without casting to specific type
You are on the right track with using the generic SEXP
input object tag. To get this to work one needs to use C++ templates in addition to TYPEOF()
. The prior enables the correct vector creation in the comparison function to be hooked in with Rcpp sugar while the latter enables the correct check and dispatch to occur.
#include <Rcpp.h>
using namespace Rcpp;
template <int RTYPE>
Rcpp::LogicalVector compare_me(Rcpp::Vector<RTYPE> x, Rcpp::Vector<RTYPE> y) {
return x == y;
}
// [[Rcpp::export]]
Rcpp::LogicalVector compare_objects(SEXP x, SEXP y) {
if (TYPEOF(x) == TYPEOF(y)) {
switch (TYPEOF(x)) {
case INTSXP:
return compare_me<INTSXP>(x, y);
case REALSXP:
return compare_me<REALSXP>(x, y);
case STRSXP:
return compare_me<STRSXP>(x, y);
default:
Rcpp::stop("Type not supported");
}
} else {
Rcpp::stop("Objects are of different type");
}
// Never used, but necessary to avoid the compiler complaining
// about a missing return statement
return Rcpp::LogicalVector();
}
Example:
to_cmp <- "a"
compare_objects(to_cmp, to_cmp)
Output:
[1] TRUE
Also, the above is for use with Rcpp::sourceCpp()
. I would encourage you to switch from using inline
to using Rcpp::cppFunction()
for function definitions as it allows you to focus on the computation and not the setup.
Return subset of a given SEXP without knowing the actual internal data type
You can use a C++ template together with the RCPP_RETURN_VECTOR
macro. This macro will make sure that the template is instantiated for all(?) R data types:
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
template <int RTYPE>
Rcpp::Vector<RTYPE> debug_subset_impl(Rcpp::Vector<RTYPE> x,
R_xlen_t index_from,
R_xlen_t index_to){
// range [index_from, index_to)
Rcpp::Vector<RTYPE> subset(index_to - index_from);
std::copy(x.cbegin() + index_from, x.cbegin() + index_to, subset.begin());
// special case for factors == INTSXP with "class" and "levels" attribute
if (x.hasAttribute("levels")){
subset.attr("class") = x.attr("class");
subset.attr("levels") = x.attr("levels");
}
return subset;
}
// [[Rcpp::export]]
SEXP dbg_subset(SEXP x, R_xlen_t index_from, R_xlen_t index_to){
// 1-based -> 0-based
RCPP_RETURN_VECTOR(debug_subset_impl, x, index_from - 1, index_to - 1);
}
/*** R
set.seed(42)
dbg_subset(1:100, 3, 6)
dbg_subset(runif(100), 3, 6)
dbg_subset(letters, 3, 6)
dbg_subset(as.factor(letters), 3, 6)
*/
Output:
> Rcpp::sourceCpp('58965423.cpp')
> set.seed(42)
> dbg_subset(1:100, 3, 6)
[1] 3 4 5
> dbg_subset(runif(100), 3, 6)
[1] 0.2861395 0.8304476 0.6417455
> dbg_subset(letters, 3, 6)
[1] "c" "d" "e"
> dbg_subset(as.factor(letters), 3, 6)
[1] c d e
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
Using R (and Rcpp), how to pass a default 'std::vector int ' array into a function
Here is a version that at least compiles and runs. I am not quite sure what you want with partial
-- but what you had is simply outside the (documented, but we already know you do not have time for the documentation we provide) interface contract so of course it didn't build.
Code
// https://gallery.rcpp.org/articles/sorting/
// https://www.geeksforgeeks.org/sorting-a-vector-in-c/
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector cpp_sort_numeric_works(NumericVector arr, std::string dir = "ASC" ) {
NumericVector _arr = clone(arr);
if(dir != "ASC") {
std::sort(_arr.begin(), _arr.end(), std::greater<int>());
} else {
std::sort(_arr.begin(), _arr.end());
}
return _arr;
}
NumericVector _partial_sort(NumericVector arr, int p, std::string dir = "ASC") {
NumericVector _arr = clone(arr);
if(dir != "ASC") {
std::nth_element(_arr.begin(), _arr.begin()+p-1, _arr.end(), std::greater<int>());
} else {
std::nth_element(_arr.begin(), _arr.begin()+p-1, _arr.end());
}
return _arr;
}
// [[Rcpp::export]]
NumericVector cpp_sort_numeric(NumericVector arr, NumericVector partial, std::string dir = "ASC") {
NumericVector _arr = clone(arr);
if (partial[0] == -1) { // only positive values allowed ...
if(dir != "ASC") {
std::sort(_arr.begin(), _arr.end(), std::greater<int>());
} else {
std::sort(_arr.begin(), _arr.end());
}
} else {
for (auto& p : partial) {
_arr = _partial_sort(_arr, p, dir);
}
}
return _arr;
}
/*** R
v <- c(1,2,3,2,1,0,-1,2)
cpp_sort_numeric_works(v)
cpp_sort_numeric_works(v, "DESC")
w <- v
w[1] <- -1
cpp_sort_numeric(v, w)
cpp_sort_numeric(v, w, "DESC")
*/
Output
> Rcpp::sourceCpp("~/git/stackoverflow/73222485/answer.cpp")
> v <- c(1,2,3,2,1,0,-1,2)
> cpp_sort_numeric_works(v)
[1] -1 0 1 1 2 2 2 3
> cpp_sort_numeric_works(v, "DESC")
[1] 3 2 2 2 1 1 0 -1
> w <- v
> w[1] <- -1
> cpp_sort_numeric(v, w)
[1] -1 0 1 1 2 2 2 3
> cpp_sort_numeric(v, w, "DESC")
[1] 3 2 2 2 1 1 0 -1
>
How to deal with factors in Rcpp
Note: Throughout, I will refer to f
, not c
. It is bad practice to name variables the same name as a builtin function or constant, such as c
, T
, or F
. Therefore I change the beginning of your code as follows:
library(Rcpp)
f <- factor(c("E", "H", "E", "12", "10", "60", "80", "11", "H", "H"))
In addition to looking at class(f)
and storage.mode(f)
, it's useful to look at str(f)
:
str(f)
# Factor w/ 7 levels "10","11","12",..: 6 7 6 3 1 4 5 2 7 7
In truth, a factor is an integer vector with "levels": a character vector corresponding to each unique integer value. Luckily, you can get this from C++ using the .attr()
member function of Rcpp::IntegerVector
:
cppFunction('CharacterVector fun(IntegerVector x){
// creates an empty character vector the size/length of x.
CharacterVector y = x.size() ;
// Get the levels of x
CharacterVector levs = x.attr("levels");
int n = x.size() - 1 ;
//loop
for(int i = 0; i <= n; i = i + 1){
if(levs[x[i]-1] == "H"){
y[i] = "Home" ;
}else if(levs[x[i]-1] == "E"){
y[i] = "Elsewhere" ;
}else{
y[i] = "Number" ;
} ;
}
return y ;
}')
fun(f)
# [1] "Elsewhere" "Home" "Elsewhere" "Number" "Number" "Number"
# [7] "Number" "Number" "Home" "Home"
So, to get what you want, you had to do three things:
- Change the return type from
IntegerVector
toCharacterVector
(though you were completely right that the input should beIntegerVector
) - Get the levels of the factor using
CharacterVector levs = x.attr("levels");
- Compare
levs[x[i]-1]
to"H"
, etc., rather thanx[i]
--x[i]
will always be an integer, giving the element of the vector of levels it corresponds to. We do-1
since C++ is 0-indexed and R is 1-indexed.
Other notes:
It is clear, as you say, that "[you're] attempting to learn how to use Rcpp() in R." You'll definitely want to spend some time with resources such as Rcpp for Everyone (that's the chapter on factors), the Rcpp Gallery (this specific link is an article on factors), Hadley's chapter on Rcpp, and definitely the Rcpp vignettes available here.
Is there an efficient way to obtain pmax other than using the R base function?
There seem to be a few issues that memory allocations that can be seen from bench::mark
uncover.
bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2(x, y, z, w))
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 5.79ms 6.28ms 157. 781.3KB
## 2 Pmax2(x, y, z, w) 39.56ms 54.48ms 19.7 9.18MB
Memory Coercion
There is 10 times the memory allocation in comparison to base pmax()
. Your rcpp is relatively straight forward, so this hints that there is some kind of coercion. And when looking at your sample data, you are sending integer vectors to a numeric signature. This creates a costly coercion. Let's update the signature and code to expect IntegerVector
s. I simply changed everything from NumericVector
to IntegerVector
for this.
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt>
1 pmax(x, y, z, w, na.rm = TRUE) 1.89ms 2.33ms 438. 781.3KB
2 Pmax2_int(x, y, z, w) 37.42ms 49.88ms 17.6 2.32MB
Re-Compilation
The OP code includes cppFunction
within the larger function code. Unless we need to recompile it every loop, we can instead compile and then call the compiled code from R. This is the biggest performance boost for this dataset size.
cppFunction("
IntegerVector cpp_pmax_pre(List args) {
IntegerVector out = args[0];
int n_arg = args.length();
int n_vec = out.length();
for (int j = 0; j < n_vec; ++j) {
if (IntegerVector::is_na(out[j])) out[j] = -1;
}
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (IntegerVector::is_na(pa[j])) pa[j] = -1;
out[j] = std::max(out[j], pa[j]);
}
}
return out;
}
")
Pmax2_int_pre <- function(...) {
args_list <- list(...)
output <- cpp_pmax_pre(args_list)
output[output == -1] <- NA
return(output)
}
bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2_int_pre(x, y, z, w))
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 2.31ms 2.42ms 397. 781.3KB
## 2 Pmax2_int_pre(x, y, z, w) 2.48ms 3.55ms 270. 2.29MB
More memory and small optimizations
Finally, we still have more memory allocated. That hints we can do more - in this case we should update NA_REAL
in rcpp. Related, we can optimize the loop assignment some.
cppFunction("
IntegerVector cpp_pmax_final(List args) {
IntegerVector out = args[0];
int n_arg = args.length();
int n_vec = out.length();
for (int j = 0; j < n_vec; ++j) {
if (IntegerVector::is_na(out[j])) out[j] = -1;
}
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
// simplify logic; if the element is not na and is greater than the out, update out.
if (!IntegerVector::is_na(pa[j]) & pa[j] > out[j]) out[j] = pa[j];
}
}
// update now in Rcpp instead of allocating vectors in R
for (int i = 0; i < n_vec; i++) {
if(out[i] == -1) out[i] = NA_INTEGER;
}
return out;
}
")
Pmax2_final <- function(...) {
cpp_pmax_final(list(...))
}
bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2_final(x, y, z, w))
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 2ms 2.08ms 460. 781.3KB
## 2 Pmax2_final(x, y, z, w) 1.19ms 1.45ms 671. 2.49KB
We did it*! I am sure there could be small optimizations - we access pa[j]
three times so it may be worthwhile to assign to a variable.
Bonus - NA_INTEGER
According to Rcpp for Everyone, the NA_INTEGER
should be equivalent to the lowest integer value of -2147483648. Using this, we can remove the replacement of NA's because we can compare directly to NA when dealing with int
data types.
During this realization, I also found an issue with the previous part - we need to clone the initial argument so that we are not accidently changing it by reference. Still, we're still slightly faster than base pmax()
.
cppFunction("
IntegerVector cpp_pmax_last(List args) {
IntegerVector tmp = args[0];
IntegerVector out = clone(tmp);
int n_arg = args.length();
int n_vec = out.length();
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (pa[j] > out[j]) out[j] = pa[j];
}
}
return out;
}
")
Pmax2_last <- function(...) {
cpp_pmax_last(list(...))
}
bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2_last(x, y, z, w),
)
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
## 1 pmax(x, y, z, w, na.rm = TRUE) 5.98ms 6.36ms 154. 781KB 0
## 2 Pmax2_last(x, y, z, w) 5.09ms 5.46ms 177. 784KB 0
Rcpp swap function with NumericVector
Building on @r2evans' comments, here's a minimal implementation:
#include <Rcpp.h>
template <int T>
void swap_templ(Rcpp::Vector<T> x) {
double tmp = x[0];
x[0] = x[1];
x[1] = tmp;
}
// [[Rcpp::export]]
void swap(SEXP x) {
switch (TYPEOF(x)) {
case INTSXP:
swap_templ<INTSXP>(x);
break;
case REALSXP:
swap_templ<REALSXP>(x);
break;
default:
Rcpp::Rcout <<
"\nInput vector must be numeric or integer type" <<
std::endl;
break;
}
}
/*** R
iv <- 1L:3L
dv <- 1:3 + 0.5
R> class(iv)
[1] "integer"
R> class(dv)
[1] "numeric"
R> swap(iv); iv
[1] 2 1 3
R> swap(dv); dv
[1] 2.5 1.5 3.5
R> class(iv)
[1] "integer"
R> class(dv)
[1] "numeric"
*/
Related Topics
Knitr: Include Figures in Report *And* Output Figures to Separate Files
How to Format Data for Plotly Sunburst Diagram
Change Plotly Chart Y Variable Based on Selectinput
Set a Functions Environment to That of the Calling Environment (Parent.Frame) from Within Function
Sequence Length Encoding Using R
Programmatically Insert Header and Plot in Same Code Chunk with R Markdown Using Results='Asis'
Importing "Csv" File with Multiple-Character Separator to R
Applying a Function to a Backreference Within Gsub in R
Loess Regression on Each Group with Dplyr::Group_By()
Read Multiple Xlsx Files with Multiple Sheets into One R Data Frame
How to Use Loess Method in Ggally::Ggpairs Using Wrap Function
R: Calculate Cosine Distance from a Term-Document Matrix with Tm and Proxy
Ggplot: Order Bars in Faceted Bar Chart Per Facet
New R-Studio Version 0.98.932 Deletes .Md File - How to Prevent
Add Column Containing Data Frame Name to a List of Data Frames