Rcpp Function to Select (And to Return) a Sub-Dataframe

Rcpp function to select (and to return) a sub-dataframe

You don't need Rcpp and RcppArmadillo for that, you can just use R's subset or perhaps dplyr::filter. This is likely to be more efficient than your code that has to deep copy data from the data frame into armadillo vectors, create new armadillo vectors, and then copy these back into R vectors so that you can build the data frame. This produces lots of waste. Another source of waste is that you find three times the same exact thing

Anyway, to answer your question, just use DataFrame::create.

DataFrame::create( _["id"] = id_sub, _["alpha"] = alph_dub, _["mess"] = mess_sub ) ;

Also, note that in your code, alpha will be a factor, so arma::vec alph = Rcpp::as<arma::vec>(myDF["alpha"]); is not likely to do what you want.

Subsetting a data.frame in Rcpp by id yielding 'not compatible with request type'

Okay, so what you are really trying to do here is just subset the data.frame by row ids in Rcpp.

e.g.

D[c(2,4,7,10),]

First up, in your code you define:

std::map<double, DataFrame> X;

There is no wrap() conversion to deal with an object of this type. Furthermore, wrap really shouldn't be used in this case as it is auto converted due to the return type specified by the function.

To subset a data.frame, efficiently, do not use the .push_back() feature since it always requires a full copy and, thus, is not very efficient.

Instead, you want to use the idx variable and Rcpp vector subsetting like so:

#include <Rcpp.h>

// Extract rows from data.frame object in Rcpp
// [[Rcpp::export]]
Rcpp::DataFrame matchRows(Rcpp::DataFrame D, Rcpp::IntegerVector idx) {

  // First, break apart each vector
  Rcpp::IntegerVector   val1 = D["val1"];
  Rcpp::NumericVector   val2 = D["val2"];
  Rcpp::CharacterVector val3 = D["val3"];
  Rcpp::NumericVector   val4 = D["val4"];

  // We assume that the index passed in starts at 1. 
  // Hence, we need to adjust the idx to start at 0 with:
  idx = idx - 1;

  // Next up, create a new DataFrame Object with selected rows subset. 
  return Rcpp::DataFrame::create(Rcpp::Named("val1")  = val1[idx],
                                 Rcpp::Named("val2")  = val2[idx],
                                 Rcpp::Named("val3")  = val3[idx],
                                 Rcpp::Named("val3")  = val4[idx]
                                 );
}

/*** R
# Make some data
set.seed(1337)
D = data.frame(val1 = 1:10, 
               val2 = rnorm(10), 
               val3 = letters[1:10], 
               val4 = sample(1:100, 10),
               stringsAsFactor=FALSE)

# Create index that starts at 1 instead of 0. 
# This will be converted in the C++ function.
idx = c(2,4,7,10) 

matchRows(D, idx)

*/

The devil in the details is we reduce the index by 1 to account for C++'s indices starting at 0 vs. R's 1 before calling the index. This can be handled within the C++ code as well. Though, I'll leave that as an exercise.

Rcpp extract row of a DataFrame

There no such thing as a data frame row, it only exist virtually. So what you have is pretty close to what you ought to do. However you should use a NumericVector instead of a std::vector<double> which would copy all of the data from the column for almost nothing.

Updated pseudo code:

DataFrame myFunc(DataFrame& x) {
    ...

    // Suppose I need to get the 10th row
    int nCols=x.size();
    NumericVector y(nCols);
    for (int j=0; j<nCols;j++) {
        NumericVector column = x[j] ;
        y[i] = column[9] ;
    }

    ...
}

Rcpp subsetting rows of DataFrame

cppFunction('LogicalVector test(DataFrame x, StringVector level_of_species) {
  using namespace std;  
  StringVector sub = x["Species"];
  std::string level = Rcpp::as<std::string>(level_of_species[0]);
  Rcpp::LogicalVector ind(sub.size());
  for (int i = 0; i < sub.size(); i++){
      ind[i] = (sub[i] == level);
  }

  return(ind);
}')

xx=test(iris, "setosa")
> table(xx)
 xx
 FALSE  TRUE 
   100    50

Subsetting done!!! (i myself learnt a lot from this question..thanks!)

cppFunction('Rcpp::DataFrame test(DataFrame x, StringVector level_of_species) {
  using namespace std;  
  StringVector sub = x["Species"];
  std::string level = Rcpp::as<std::string>(level_of_species[0]);
  Rcpp::LogicalVector ind(sub.size());
  for (int i = 0; i < sub.size(); i++){
    ind[i] = (sub[i] == level);
  }

 // extracting each column into a vector
 Rcpp::NumericVector   SepalLength = x["Sepal.Length"];
 Rcpp::NumericVector   SepalWidth = x["Sepal.Width"];
 Rcpp::NumericVector PetalLength = x["Petal.Length"];
 Rcpp::NumericVector   PetalWidth = x["Petal.Width"];

 return Rcpp::DataFrame::create(Rcpp::Named("Sepal.Length")  = SepalLength[ind],
                                Rcpp::Named("Sepal.Width")  = SepalWidth[ind],
                                Rcpp::Named("Petal.Length")  = PetalLength[ind],
                                Rcpp::Named("Petal.Width")  = PetalWidth[ind]
);}')

yy=test(iris, "setosa")
> str(yy)
 'data.frame':  50 obs. of  4 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

Avoid SIGSEGV when subsetting data.frame with call to `[data.frame` in Rcpp

The problem here is that LogicalVector::create() is not doing what you expect here -- it's returning a vector of length two, with the elements TRUE and TRUE. In other words, your code:

LogicalVector filter = LogicalVector::create(n, TRUE);

generates not a logical vector of length n with values TRUE, but instead a logical vector of length two with the first element being 'truthy' and so TRUE, and the second explicitly TRUE.

You likely intended to just use the regular constructor, e.g. LogicalVector(n, TRUE).

Rcpp - extracting rows from list of matrices / dataframes

Here is one way to do it:

#include <Rcpp.h>

// x[[nx]][ny,]  ->  y[[ny]][[nx]]

// [[Rcpp::export]]
Rcpp::List Transform(Rcpp::List x) {
    R_xlen_t nx = x.size(), ny = Rcpp::as<Rcpp::NumericMatrix>(x[0]).nrow();
    Rcpp::List y(ny);

    for (R_xlen_t iy = 0; iy < ny; iy++) {
        Rcpp::List tmp(nx);
        for (R_xlen_t ix = 0; ix < nx; ix++) {
            Rcpp::NumericMatrix mtmp = Rcpp::as<Rcpp::NumericMatrix>(x[ix]);
            tmp[ix] = mtmp.row(iy);
        }
        y[iy] = tmp;
    }

    return y;
}

/*** R

L1 <- lapply(1:10, function(x) {
    matrix(rnorm(20), ncol = 5)
})

L2 <- lapply(1:nrow(L1[[1]]), function(x) {
    lapply(L1, function(y) unlist(y[x,]))
})

all.equal(L2, Transform(L1))
#[1] TRUE

microbenchmark::microbenchmark(
    "R" = lapply(1:nrow(L1[[1]]), function(x) {
        lapply(L1, function(y) unlist(y[x,]))
    }),
    "Cpp" = Transform(L1),
    times = 200L)

#Unit: microseconds
#expr    min      lq      mean  median       uq      max neval
#  R 254.660 316.627 383.92739 347.547 392.7705 1909.097   200
#Cpp  18.314  26.007  71.58795  30.230  38.8650  945.167   200

*/

I'm not sure how this will scale; I think it is just an inherently inefficient transformation. As per my comment at the top of the source, it seems like you are just doing a sort of coordinate swap -- the nyth row of the nxth element of the input list becomes the nxth element of the nyth element of the output list:

x[[nx]][ny,]  ->  y[[ny]][[nx]]

To address the errors you were getting, Rcpp::List is a generic object - technically an Rcpp::Vector<VECSXP> - so when you try to do, e.g.

my_list[i].row(nr)

the compiler doesn't know that my_list[i] is a NumericMatrix. Therefore, you have to make an explicit cast with Rcpp::as<>,

Rcpp::NumericMatrix mtmp = Rcpp::as<Rcpp::NumericMatrix>(x[ix]);
tmp[ix] = mtmp.row(iy);

I just used matrix elements in the example data to simplify things. In practice you are probably better off coercing data.frames to matrix objects directly in R than trying to do it in C++; it will be much simpler, and most likely, the coercion is just calling underlying C code, so there isn't really anything to be gained trying to do it otherwise.

I should also point out that if you are using a Rcpp::List of homogeneous types, you can squeeze out a little more performance with Rcpp::ListOf<type>. This will allow you to skip the Rcpp::as<type> conversions done above:

typedef Rcpp::ListOf<Rcpp::NumericMatrix> MatList;

// [[Rcpp::export]]
Rcpp::List Transform2(MatList x) {
    R_xlen_t nx = x.size(), ny = x[0].nrow();
    Rcpp::List y(ny);

    for (R_xlen_t iy = 0; iy < ny; iy++) {
        Rcpp::List tmp(nx);
        for (R_xlen_t ix = 0; ix < nx; ix++) {
            tmp[ix] = x[ix].row(iy);
        }
        y[iy] = tmp;
    }

    return y;
}

/*** R

L1 <- lapply(1:10, function(x) {
    matrix(rnorm(20000), ncol = 100)
})

L2 <- lapply(1:nrow(L1[[1]]), function(x) {
    lapply(L1, function(y) unlist(y[x,]))
})

microbenchmark::microbenchmark(
    "R" = lapply(1:nrow(L1[[1]]), function(x) {
        lapply(L1, function(y) unlist(y[x,]))
    }),
    "Transform" = Transform(L1),
    "Transform2" = Transform2(L1),
    times = 200L)

#Unit: microseconds
#      expr      min       lq     mean   median       uq       max neval
#         R 6049.594 6318.822 7604.871 6707.242 8592.510 64005.190   200
# Transform  928.468 1041.936 3130.959 1166.819 1659.745 71552.284   200
#Transform2  850.912  957.918 1694.329 1061.183 2856.724  4502.065   200

*/

Iterate over vectors from an imported dataframe row-wise

I am not convinced that you will gain performance from going to C++ here. However, if you have a set of vectors with equal length (data.frameguarantees that) then you can simply iterate with one index:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
DataFrame modifyDataFrame(DataFrame df) {

  // access the columns
  IntegerVector a = df["a"];
  IntegerVector b = df["b"];
  CharacterVector c = df["c"];
  NumericVector d = df["d"];
  CharacterVector e = df["e"];

  for(int i=0; i < df.nrow(); ++i){
    a(i) += 1;
    b(i) += 2;
    c(i) += "c";
    d(i) += 3;
    e(i) += "e";
  }
  // return a new data frame
  return DataFrame::create(_["a"]= a, _["b"]= b, _["c"]= c, _["d"]= d, _["e"]=e);
}
/*** R
a <- c(0, 2, 4, 6, 8, 10)
b <- c(1, 3, 5, 7, 9, 11)
c <- c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1")
d <- c(10.2, 10.2, 4.3, 4.3, 3.4, 7.9)
e <- c("a", "t", "t", "g", "c", "a")

df <- data.frame(a, b, c, d, e)
modifyDataFrame(df)  
*/

Result:

> modifyDataFrame(df)  
   a  b     c    d  e
1  1  3 chr1c 13.2 ae
2  3  5 chr1c 13.2 te
3  5  7 chr1c  7.3 te
4  7  9 chr1c  7.3 ge
5  9 11 chr1c  6.4 ce
6 11 13 chr1c 10.9 ae

Here I am using the nrow()method of the DataFrameclass, c.f. the Rcpp API. This uses R's C API, just as the length() method. I just find it more logical to use a DataFrame-method than single out one of the vectors to retrieve the length. The result would be the same.

As for a sliding window I would look into the RcppRoll package first.

Return subset of a given SEXP without knowing the actual internal data type

You can use a C++ template together with the RCPP_RETURN_VECTOR macro. This macro will make sure that the template is instantiated for all(?) R data types:

#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]

template <int RTYPE>
Rcpp::Vector<RTYPE> debug_subset_impl(Rcpp::Vector<RTYPE> x,
                                      R_xlen_t index_from,
                                      R_xlen_t index_to){
    // range [index_from, index_to)
    Rcpp::Vector<RTYPE> subset(index_to - index_from);
    std::copy(x.cbegin() + index_from, x.cbegin() + index_to, subset.begin());
    // special case for factors == INTSXP with "class" and "levels" attribute
    if (x.hasAttribute("levels")){
        subset.attr("class") = x.attr("class");
        subset.attr("levels") = x.attr("levels");
    }
    return subset;
}

// [[Rcpp::export]]
SEXP dbg_subset(SEXP x, R_xlen_t index_from, R_xlen_t index_to){
    // 1-based -> 0-based
    RCPP_RETURN_VECTOR(debug_subset_impl, x, index_from - 1, index_to - 1);
}

/*** R
set.seed(42)
dbg_subset(1:100, 3, 6)
dbg_subset(runif(100), 3, 6)
dbg_subset(letters, 3, 6)
dbg_subset(as.factor(letters), 3, 6)
*/

Output:

> Rcpp::sourceCpp('58965423.cpp')

> set.seed(42)

> dbg_subset(1:100, 3, 6)
[1] 3 4 5

> dbg_subset(runif(100), 3, 6)
[1] 0.2861395 0.8304476 0.6417455

> dbg_subset(letters, 3, 6)
[1] "c" "d" "e"

> dbg_subset(as.factor(letters), 3, 6)
[1] c d e
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z

Rcpp Function to Select (And to Return) a Sub-Dataframe