Passing by Reference a Data.Frame and Updating It with Rcpp

Passing by reference a data.frame and updating it with rcpp

The way DataFrame::operator[] is implemented indeed leeds to a copy when you do that:

df["newCol"] = newCol;

To do what you want, you need to consider what a data frame is, a list of vectors, with certain attributes. Then you can grab data from the original, by copying the vectors (the pointers, not their content).

Something like this does it. It is a little more work, but not that hard.

// [[Rcpp::export]]
List updateDFByRef(DataFrame& df, std::string name) {
int nr = df.nrows(), nc= df.size() ;
NumericVector newCol(nr,1.);
List out(nc+1) ;
CharacterVector onames = df.attr("names") ;
CharacterVector names( nc + 1 ) ;
for( int i=0; i<nc; i++) {
out[i] = df[i] ;
names[i] = onames[i] ;
}
out[nc] = newCol ;
names[nc] = name ;
out.attr("class") = df.attr("class") ;
out.attr("row.names") = df.attr("row.names") ;
out.attr("names") = names ;
return out ;
}

There are issues associated with this approach. Your original data frame and the one you created share the same vectors and so bad things can happen. So only use this if you know what you are doing.

Rcpp: Which is the best way to modify some columns of a dataframe with Rcpp

According to the suggestion of @Roland, the best way using a reference method by modifying updateDF2, the code is as below:

// [[Rcpp::export]]
DataFrame updateDF(DataFrame& df, Nullable<Rcpp::CharacterVector> vars=R_NilValue) {
string tmpstr;
NumericVector tmpv;
if(vars.isNotNull()){
CharacterVector selvars(vars);
for(int v=0;v<selvars.size();v++){
tmpstr=selvars[v];
tmpv=df[tmpstr];
tmpv=tmpv+1.0;
df[tmpstr]=tmpv;
}
}
return df;
}

with the performance of:

Unit: milliseconds
expr min lq mean median
x1 <<- updateDF1(df, vars = names(df)[-1]) 573.8246 728.4211 990.8680 951.3108
x2 <<- updateDF2(df, vars = names(df)[-1]) 595.7339 694.0645 935.4226 941.7450
x3 <<- updateDF3(df, vars = names(df)[-1]) 197.7855 206.4767 377.4378 225.0290
x4 <<- updateDF(df, vars = names(df)[-1]) 148.5119 149.7321 247.1329 152.3744

Rcpp: Append rows to dataframe by reference

This has been answered before as well but I don't have the reference handy. In essence:

  • a data.frame is a list of vectors
  • at the C++ level you just see a set of vectors
  • so you essentially have to insert into each vector
  • and resize as needed

Resizing is expensive as you need to reallocate and copy content so if you know you have, say, ten rows to insert only do it once.

Rcpp pass by reference vs. by value

They key is 'proxy model' -- your xa really is the same memory location as your original object so you end up changing your original.

If you don't want that, you should do one thing: (deep) copy using the clone() method, or maybe explicit creation of a new object into which the altered object gets written. Method two does not do that, you simply use two differently named variables which are both "pointers" (in the proxy model sense) to the original variable.

An additional complication, though, is in implicit cast and copy when you pass an int vector (from R) to a NumericVector type: that creates a copy, and then the original no longer gets altered.

Here is a more explicit example, similar to one I use in the tutorials or workshops:

library(inline)
f1 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
for(int i=0; i < n; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
')

f2 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
Rcpp::NumericVector xr(a); // still points to a
for(int i=0; i < n; i++) {
if(xr[i]<0) xr[i] = 0;
}
return xr;
')

p <- seq(-2,2)
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))
p <- as.numeric(seq(-2,2))
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))

and this is what I see:

edd@max:~/svn/rcpp/pkg$ r /tmp/ari.r
Loading required package: methods
[1] "integer"
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
[1] "numeric"
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
edd@max:~/svn/rcpp/pkg$

So it really matters whether you pass int-to-float or float-to-float.

How can I add a new column to dataframe in RCpp?

You cannot do it by reference. But if you return the data frame it works:

#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame AddNewCol(const DataFrame& df, std::string new_var) {
NumericVector vec_x = df["x"];
NumericVector vec_y = df["y"];
df[new_var] = vec_x * Rcpp::pow(vec_y, 2);
return df;
}

/*** R
set.seed(42)
df <- data.frame(x = runif(10), y = runif(10))
AddNewCol( df ,"result")
*/

Note that I have taken the liberty to simplify the computation a bit. Result:

> set.seed(42)

> df <- data.frame(x = runif(10), y = runif(10))

> AddNewCol( df ,"result")
x y result
1 0.9148060 0.4577418 0.191677054
2 0.9370754 0.7191123 0.484582715
3 0.2861395 0.9346722 0.249974991
4 0.8304476 0.2554288 0.054181629
5 0.6417455 0.4622928 0.137150421
6 0.5190959 0.9400145 0.458687354
7 0.7365883 0.9782264 0.704861206
8 0.1346666 0.1174874 0.001858841
9 0.6569923 0.4749971 0.148232064
10 0.7050648 0.5603327 0.221371155

Rcpp Update matrix passed by reference and return the update in R

Let's start by reiterating that this is probably bad practice. Don't use void, return your changed object -- a more common approach.

That said, you can make it work in either way. For RcppArmadillo, pass by (explicit) reference. I get the desired behaviour

> sourceCpp("/tmp/so.cpp")

> M1 <- M2 <- matrix(0, 2, 2)

> bar(M1)

> M1
[,1] [,2]
[1,] 42 0
[2,] 0 0

> foo(M2)

> M2
[,1] [,2]
[1,] 42 0
[2,] 0 0
>

out of this short example:

#include <RcppArmadillo.h>

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
void bar(Rcpp::NumericMatrix M) {
M(0,0) = 42;
}

// [[Rcpp::export]]
void foo(arma::mat M) {
M(0,0) = 42;
}

/*** R
M1 <- M2 <- matrix(0, 2, 2)

bar(M1)
M1

foo(M2)
M2
*/

Subsetting a data.frame in Rcpp by id yielding 'not compatible with request type'

Okay, so what you are really trying to do here is just subset the data.frame by row ids in Rcpp.

e.g.

D[c(2,4,7,10),]

First up, in your code you define:

std::map<double, DataFrame> X;

There is no wrap() conversion to deal with an object of this type. Furthermore, wrap really shouldn't be used in this case as it is auto converted due to the return type specified by the function.

To subset a data.frame, efficiently, do not use the .push_back() feature since it always requires a full copy and, thus, is not very efficient.

Instead, you want to use the idx variable and Rcpp vector subsetting like so:

#include <Rcpp.h>

// Extract rows from data.frame object in Rcpp
// [[Rcpp::export]]
Rcpp::DataFrame matchRows(Rcpp::DataFrame D, Rcpp::IntegerVector idx) {

// First, break apart each vector
Rcpp::IntegerVector val1 = D["val1"];
Rcpp::NumericVector val2 = D["val2"];
Rcpp::CharacterVector val3 = D["val3"];
Rcpp::NumericVector val4 = D["val4"];

// We assume that the index passed in starts at 1.
// Hence, we need to adjust the idx to start at 0 with:
idx = idx - 1;

// Next up, create a new DataFrame Object with selected rows subset.
return Rcpp::DataFrame::create(Rcpp::Named("val1") = val1[idx],
Rcpp::Named("val2") = val2[idx],
Rcpp::Named("val3") = val3[idx],
Rcpp::Named("val3") = val4[idx]
);
}

/*** R
# Make some data
set.seed(1337)
D = data.frame(val1 = 1:10,
val2 = rnorm(10),
val3 = letters[1:10],
val4 = sample(1:100, 10),
stringsAsFactor=FALSE)

# Create index that starts at 1 instead of 0.
# This will be converted in the C++ function.
idx = c(2,4,7,10)

matchRows(D, idx)

*/

The devil in the details is we reduce the index by 1 to account for C++'s indices starting at 0 vs. R's 1 before calling the index. This can be handled within the C++ code as well. Though, I'll leave that as an exercise.

Passing a `data.table` to c++ functions using `Rcpp` and/or `RcppArmadillo`

Building on top of other answers, here is some example code:

#include <Rcpp.h>
using namespace Rcpp ;

// [[Rcpp::export]]
double do_stuff_with_a_data_table(DataFrame df){
CharacterVector x = df["x"] ;
NumericVector y = df["y"] ;
IntegerVector z = df["v"] ;

/* do whatever with x, y, v */
double res = sum(y) ;
return res ;
}

So, as Matthew says, this treats the data.table as a data.frame (aka a Rcpp::DataFrame in Rcpp).

require(data.table)
DT <- data.table(
x=rep(c("a","b","c"),each=3),
y=c(1,3,6),
v=1:9)
do_stuff_with_a_data_table( DT )
# [1] 30

This completely ignores the internals of the data.table.



Related Topics



Leave a reply



Submit