Rcpp Function to Select (And to Return) a Sub-Dataframe

Rcpp function to select (and to return) a sub-dataframe

You don't need Rcpp and RcppArmadillo for that, you can just use R's subset or perhaps dplyr::filter. This is likely to be more efficient than your code that has to deep copy data from the data frame into armadillo vectors, create new armadillo vectors, and then copy these back into R vectors so that you can build the data frame. This produces lots of waste. Another source of waste is that you find three times the same exact thing

Anyway, to answer your question, just use DataFrame::create.

DataFrame::create( _["id"] = id_sub, _["alpha"] = alph_dub, _["mess"] = mess_sub ) ;

Also, note that in your code, alpha will be a factor, so arma::vec alph = Rcpp::as<arma::vec>(myDF["alpha"]); is not likely to do what you want.

Subsetting a data.frame in Rcpp by id yielding 'not compatible with request type'

Okay, so what you are really trying to do here is just subset the data.frame by row ids in Rcpp.

e.g.

D[c(2,4,7,10),]

First up, in your code you define:

std::map<double, DataFrame> X;

There is no wrap() conversion to deal with an object of this type. Furthermore, wrap really shouldn't be used in this case as it is auto converted due to the return type specified by the function.

To subset a data.frame, efficiently, do not use the .push_back() feature since it always requires a full copy and, thus, is not very efficient.

Instead, you want to use the idx variable and Rcpp vector subsetting like so:

#include <Rcpp.h>

// Extract rows from data.frame object in Rcpp
// [[Rcpp::export]]
Rcpp::DataFrame matchRows(Rcpp::DataFrame D, Rcpp::IntegerVector idx) {

// First, break apart each vector
Rcpp::IntegerVector val1 = D["val1"];
Rcpp::NumericVector val2 = D["val2"];
Rcpp::CharacterVector val3 = D["val3"];
Rcpp::NumericVector val4 = D["val4"];

// We assume that the index passed in starts at 1.
// Hence, we need to adjust the idx to start at 0 with:
idx = idx - 1;

// Next up, create a new DataFrame Object with selected rows subset.
return Rcpp::DataFrame::create(Rcpp::Named("val1") = val1[idx],
Rcpp::Named("val2") = val2[idx],
Rcpp::Named("val3") = val3[idx],
Rcpp::Named("val3") = val4[idx]
);
}

/*** R
# Make some data
set.seed(1337)
D = data.frame(val1 = 1:10,
val2 = rnorm(10),
val3 = letters[1:10],
val4 = sample(1:100, 10),
stringsAsFactor=FALSE)

# Create index that starts at 1 instead of 0.
# This will be converted in the C++ function.
idx = c(2,4,7,10)

matchRows(D, idx)

*/

The devil in the details is we reduce the index by 1 to account for C++'s indices starting at 0 vs. R's 1 before calling the index. This can be handled within the C++ code as well. Though, I'll leave that as an exercise.

Rcpp extract row of a DataFrame

There no such thing as a data frame row, it only exist virtually. So what you have is pretty close to what you ought to do. However you should use a NumericVector instead of a std::vector<double> which would copy all of the data from the column for almost nothing.

Updated pseudo code:

DataFrame myFunc(DataFrame& x) {
...

// Suppose I need to get the 10th row
int nCols=x.size();
NumericVector y(nCols);
for (int j=0; j<nCols;j++) {
NumericVector column = x[j] ;
y[i] = column[9] ;
}

...
}

Rcpp subsetting rows of DataFrame

cppFunction('LogicalVector test(DataFrame x, StringVector level_of_species) {
using namespace std;
StringVector sub = x["Species"];
std::string level = Rcpp::as<std::string>(level_of_species[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}

return(ind);
}')

xx=test(iris, "setosa")
> table(xx)
xx
FALSE TRUE
100 50

Subsetting done!!! (i myself learnt a lot from this question..thanks!)

cppFunction('Rcpp::DataFrame test(DataFrame x, StringVector level_of_species) {
using namespace std;
StringVector sub = x["Species"];
std::string level = Rcpp::as<std::string>(level_of_species[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}

// extracting each column into a vector
Rcpp::NumericVector SepalLength = x["Sepal.Length"];
Rcpp::NumericVector SepalWidth = x["Sepal.Width"];
Rcpp::NumericVector PetalLength = x["Petal.Length"];
Rcpp::NumericVector PetalWidth = x["Petal.Width"];

return Rcpp::DataFrame::create(Rcpp::Named("Sepal.Length") = SepalLength[ind],
Rcpp::Named("Sepal.Width") = SepalWidth[ind],
Rcpp::Named("Petal.Length") = PetalLength[ind],
Rcpp::Named("Petal.Width") = PetalWidth[ind]
);}')

yy=test(iris, "setosa")
> str(yy)
'data.frame': 50 obs. of 4 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

Avoid SIGSEGV when subsetting data.frame with call to `[data.frame` in Rcpp

The problem here is that LogicalVector::create() is not doing what you expect here -- it's returning a vector of length two, with the elements TRUE and TRUE. In other words, your code:

LogicalVector filter = LogicalVector::create(n, TRUE);

generates not a logical vector of length n with values TRUE, but instead a logical vector of length two with the first element being 'truthy' and so TRUE, and the second explicitly TRUE.

You likely intended to just use the regular constructor, e.g. LogicalVector(n, TRUE).

Rcpp - extracting rows from list of matrices / dataframes

Here is one way to do it:

#include <Rcpp.h>

// x[[nx]][ny,] -> y[[ny]][[nx]]

// [[Rcpp::export]]
Rcpp::List Transform(Rcpp::List x) {
R_xlen_t nx = x.size(), ny = Rcpp::as<Rcpp::NumericMatrix>(x[0]).nrow();
Rcpp::List y(ny);

for (R_xlen_t iy = 0; iy < ny; iy++) {
Rcpp::List tmp(nx);
for (R_xlen_t ix = 0; ix < nx; ix++) {
Rcpp::NumericMatrix mtmp = Rcpp::as<Rcpp::NumericMatrix>(x[ix]);
tmp[ix] = mtmp.row(iy);
}
y[iy] = tmp;
}

return y;
}

/*** R

L1 <- lapply(1:10, function(x) {
matrix(rnorm(20), ncol = 5)
})

L2 <- lapply(1:nrow(L1[[1]]), function(x) {
lapply(L1, function(y) unlist(y[x,]))
})

all.equal(L2, Transform(L1))
#[1] TRUE

microbenchmark::microbenchmark(
"R" = lapply(1:nrow(L1[[1]]), function(x) {
lapply(L1, function(y) unlist(y[x,]))
}),
"Cpp" = Transform(L1),
times = 200L)

#Unit: microseconds
#expr min lq mean median uq max neval
# R 254.660 316.627 383.92739 347.547 392.7705 1909.097 200
#Cpp 18.314 26.007 71.58795 30.230 38.8650 945.167 200

*/

I'm not sure how this will scale; I think it is just an inherently inefficient transformation. As per my comment at the top of the source, it seems like you are just doing a sort of coordinate swap -- the nyth row of the nxth element of the input list becomes the nxth element of the nyth element of the output list:

x[[nx]][ny,]  ->  y[[ny]][[nx]]

To address the errors you were getting, Rcpp::List is a generic object - technically an Rcpp::Vector<VECSXP> - so when you try to do, e.g.

my_list[i].row(nr)

the compiler doesn't know that my_list[i] is a NumericMatrix. Therefore, you have to make an explicit cast with Rcpp::as<>,

Rcpp::NumericMatrix mtmp = Rcpp::as<Rcpp::NumericMatrix>(x[ix]);
tmp[ix] = mtmp.row(iy);

I just used matrix elements in the example data to simplify things. In practice you are probably better off coercing data.frames to matrix objects directly in R than trying to do it in C++; it will be much simpler, and most likely, the coercion is just calling underlying C code, so there isn't really anything to be gained trying to do it otherwise.


I should also point out that if you are using a Rcpp::List of homogeneous types, you can squeeze out a little more performance with Rcpp::ListOf<type>. This will allow you to skip the Rcpp::as<type> conversions done above:

typedef Rcpp::ListOf<Rcpp::NumericMatrix> MatList;

// [[Rcpp::export]]
Rcpp::List Transform2(MatList x) {
R_xlen_t nx = x.size(), ny = x[0].nrow();
Rcpp::List y(ny);

for (R_xlen_t iy = 0; iy < ny; iy++) {
Rcpp::List tmp(nx);
for (R_xlen_t ix = 0; ix < nx; ix++) {
tmp[ix] = x[ix].row(iy);
}
y[iy] = tmp;
}

return y;
}

/*** R

L1 <- lapply(1:10, function(x) {
matrix(rnorm(20000), ncol = 100)
})

L2 <- lapply(1:nrow(L1[[1]]), function(x) {
lapply(L1, function(y) unlist(y[x,]))
})

microbenchmark::microbenchmark(
"R" = lapply(1:nrow(L1[[1]]), function(x) {
lapply(L1, function(y) unlist(y[x,]))
}),
"Transform" = Transform(L1),
"Transform2" = Transform2(L1),
times = 200L)

#Unit: microseconds
# expr min lq mean median uq max neval
# R 6049.594 6318.822 7604.871 6707.242 8592.510 64005.190 200
# Transform 928.468 1041.936 3130.959 1166.819 1659.745 71552.284 200
#Transform2 850.912 957.918 1694.329 1061.183 2856.724 4502.065 200

*/

Iterate over vectors from an imported dataframe row-wise

I am not convinced that you will gain performance from going to C++ here. However, if you have a set of vectors with equal length (data.frameguarantees that) then you can simply iterate with one index:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {

// access the columns
IntegerVector a = df["a"];
IntegerVector b = df["b"];
CharacterVector c = df["c"];
NumericVector d = df["d"];
CharacterVector e = df["e"];

for(int i=0; i < df.nrow(); ++i){
a(i) += 1;
b(i) += 2;
c(i) += "c";
d(i) += 3;
e(i) += "e";
}
// return a new data frame
return DataFrame::create(_["a"]= a, _["b"]= b, _["c"]= c, _["d"]= d, _["e"]=e);
}
/*** R
a <- c(0, 2, 4, 6, 8, 10)
b <- c(1, 3, 5, 7, 9, 11)
c <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
d <- c(10.2, 10.2, 4.3, 4.3, 3.4, 7.9)
e <- c("a", "t", "t", "g", "c", "a")

df <- data.frame(a, b, c, d, e)
modifyDataFrame(df)
*/

Result:

> modifyDataFrame(df)  
a b c d e
1 1 3 chr1c 13.2 ae
2 3 5 chr1c 13.2 te
3 5 7 chr1c 7.3 te
4 7 9 chr1c 7.3 ge
5 9 11 chr1c 6.4 ce
6 11 13 chr1c 10.9 ae

Here I am using the nrow()method of the DataFrameclass, c.f. the Rcpp API. This uses R's C API, just as the length() method. I just find it more logical to use a DataFrame-method than single out one of the vectors to retrieve the length. The result would be the same.

As for a sliding window I would look into the RcppRoll package first.

Return subset of a given SEXP without knowing the actual internal data type

You can use a C++ template together with the RCPP_RETURN_VECTOR macro. This macro will make sure that the template is instantiated for all(?) R data types:

#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]

template <int RTYPE>
Rcpp::Vector<RTYPE> debug_subset_impl(Rcpp::Vector<RTYPE> x,
R_xlen_t index_from,
R_xlen_t index_to){
// range [index_from, index_to)
Rcpp::Vector<RTYPE> subset(index_to - index_from);
std::copy(x.cbegin() + index_from, x.cbegin() + index_to, subset.begin());
// special case for factors == INTSXP with "class" and "levels" attribute
if (x.hasAttribute("levels")){
subset.attr("class") = x.attr("class");
subset.attr("levels") = x.attr("levels");
}
return subset;
}

// [[Rcpp::export]]
SEXP dbg_subset(SEXP x, R_xlen_t index_from, R_xlen_t index_to){
// 1-based -> 0-based
RCPP_RETURN_VECTOR(debug_subset_impl, x, index_from - 1, index_to - 1);
}

/*** R
set.seed(42)
dbg_subset(1:100, 3, 6)
dbg_subset(runif(100), 3, 6)
dbg_subset(letters, 3, 6)
dbg_subset(as.factor(letters), 3, 6)
*/

Output:

> Rcpp::sourceCpp('58965423.cpp')

> set.seed(42)

> dbg_subset(1:100, 3, 6)
[1] 3 4 5

> dbg_subset(runif(100), 3, 6)
[1] 0.2861395 0.8304476 0.6417455

> dbg_subset(letters, 3, 6)
[1] "c" "d" "e"

> dbg_subset(as.factor(letters), 3, 6)
[1] c d e
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z


Related Topics



Leave a reply



Submit