Passing by reference a data.frame and updating it with rcpp
The way DataFrame::operator[]
is implemented indeed leeds to a copy when you do that:
df["newCol"] = newCol;
To do what you want, you need to consider what a data frame is, a list of vectors, with certain attributes. Then you can grab data from the original, by copying the vectors (the pointers, not their content).
Something like this does it. It is a little more work, but not that hard.
// [[Rcpp::export]]
List updateDFByRef(DataFrame& df, std::string name) {
int nr = df.nrows(), nc= df.size() ;
NumericVector newCol(nr,1.);
List out(nc+1) ;
CharacterVector onames = df.attr("names") ;
CharacterVector names( nc + 1 ) ;
for( int i=0; i<nc; i++) {
out[i] = df[i] ;
names[i] = onames[i] ;
}
out[nc] = newCol ;
names[nc] = name ;
out.attr("class") = df.attr("class") ;
out.attr("row.names") = df.attr("row.names") ;
out.attr("names") = names ;
return out ;
}
There are issues associated with this approach. Your original data frame and the one you created share the same vectors and so bad things can happen. So only use this if you know what you are doing.
Rcpp: Which is the best way to modify some columns of a dataframe with Rcpp
According to the suggestion of @Roland, the best way using a reference method by modifying updateDF2, the code is as below:
// [[Rcpp::export]]
DataFrame updateDF(DataFrame& df, Nullable<Rcpp::CharacterVector> vars=R_NilValue) {
string tmpstr;
NumericVector tmpv;
if(vars.isNotNull()){
CharacterVector selvars(vars);
for(int v=0;v<selvars.size();v++){
tmpstr=selvars[v];
tmpv=df[tmpstr];
tmpv=tmpv+1.0;
df[tmpstr]=tmpv;
}
}
return df;
}
with the performance of:
Unit: milliseconds
expr min lq mean median
x1 <<- updateDF1(df, vars = names(df)[-1]) 573.8246 728.4211 990.8680 951.3108
x2 <<- updateDF2(df, vars = names(df)[-1]) 595.7339 694.0645 935.4226 941.7450
x3 <<- updateDF3(df, vars = names(df)[-1]) 197.7855 206.4767 377.4378 225.0290
x4 <<- updateDF(df, vars = names(df)[-1]) 148.5119 149.7321 247.1329 152.3744
Rcpp: Append rows to dataframe by reference
This has been answered before as well but I don't have the reference handy. In essence:
- a
data.frame
is a list of vectors - at the C++ level you just see a set of vectors
- so you essentially have to insert into each vector
- and resize as needed
Resizing is expensive as you need to reallocate and copy content so if you know you have, say, ten rows to insert only do it once.
Rcpp pass by reference vs. by value
They key is 'proxy model' -- your xa
really is the same memory location as your original object so you end up changing your original.
If you don't want that, you should do one thing: (deep) copy using the clone()
method, or maybe explicit creation of a new object into which the altered object gets written. Method two does not do that, you simply use two differently named variables which are both "pointers" (in the proxy model sense) to the original variable.
An additional complication, though, is in implicit cast and copy when you pass an int vector (from R) to a NumericVector type: that creates a copy, and then the original no longer gets altered.
Here is a more explicit example, similar to one I use in the tutorials or workshops:
library(inline)
f1 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
for(int i=0; i < n; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
')
f2 <- cxxfunction(signature(a="numeric"), plugin="Rcpp", body='
Rcpp::NumericVector xa(a);
int n = xa.size();
Rcpp::NumericVector xr(a); // still points to a
for(int i=0; i < n; i++) {
if(xr[i]<0) xr[i] = 0;
}
return xr;
')
p <- seq(-2,2)
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))
p <- as.numeric(seq(-2,2))
print(class(p))
print(cbind(f1(p), p))
print(cbind(f2(p), p))
and this is what I see:
edd@max:~/svn/rcpp/pkg$ r /tmp/ari.r
Loading required package: methods
[1] "integer"
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 -2
[2,] 0 -1
[3,] 0 0
[4,] 1 1
[5,] 2 2
[1] "numeric"
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
p
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 1 1
[5,] 2 2
edd@max:~/svn/rcpp/pkg$
So it really matters whether you pass int-to-float or float-to-float.
How can I add a new column to dataframe in RCpp?
You cannot do it by reference. But if you return the data frame it works:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
DataFrame AddNewCol(const DataFrame& df, std::string new_var) {
NumericVector vec_x = df["x"];
NumericVector vec_y = df["y"];
df[new_var] = vec_x * Rcpp::pow(vec_y, 2);
return df;
}
/*** R
set.seed(42)
df <- data.frame(x = runif(10), y = runif(10))
AddNewCol( df ,"result")
*/
Note that I have taken the liberty to simplify the computation a bit. Result:
> set.seed(42)
> df <- data.frame(x = runif(10), y = runif(10))
> AddNewCol( df ,"result")
x y result
1 0.9148060 0.4577418 0.191677054
2 0.9370754 0.7191123 0.484582715
3 0.2861395 0.9346722 0.249974991
4 0.8304476 0.2554288 0.054181629
5 0.6417455 0.4622928 0.137150421
6 0.5190959 0.9400145 0.458687354
7 0.7365883 0.9782264 0.704861206
8 0.1346666 0.1174874 0.001858841
9 0.6569923 0.4749971 0.148232064
10 0.7050648 0.5603327 0.221371155
Rcpp Update matrix passed by reference and return the update in R
Let's start by reiterating that this is probably bad practice. Don't use void
, return your changed object -- a more common approach.
That said, you can make it work in either way. For RcppArmadillo, pass by (explicit) reference. I get the desired behaviour
> sourceCpp("/tmp/so.cpp")
> M1 <- M2 <- matrix(0, 2, 2)
> bar(M1)
> M1
[,1] [,2]
[1,] 42 0
[2,] 0 0
> foo(M2)
> M2
[,1] [,2]
[1,] 42 0
[2,] 0 0
>
out of this short example:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
void bar(Rcpp::NumericMatrix M) {
M(0,0) = 42;
}
// [[Rcpp::export]]
void foo(arma::mat M) {
M(0,0) = 42;
}
/*** R
M1 <- M2 <- matrix(0, 2, 2)
bar(M1)
M1
foo(M2)
M2
*/
Subsetting a data.frame in Rcpp by id yielding 'not compatible with request type'
Okay, so what you are really trying to do here is just subset the data.frame
by row ids in Rcpp.
e.g.
D[c(2,4,7,10),]
First up, in your code you define:
std::map<double, DataFrame> X;
There is no wrap()
conversion to deal with an object of this type. Furthermore, wrap really shouldn't be used in this case as it is auto converted due to the return type specified by the function.
To subset a data.frame, efficiently, do not use the .push_back()
feature since it always requires a full copy and, thus, is not very efficient.
Instead, you want to use the idx
variable and Rcpp vector subsetting like so:
#include <Rcpp.h>
// Extract rows from data.frame object in Rcpp
// [[Rcpp::export]]
Rcpp::DataFrame matchRows(Rcpp::DataFrame D, Rcpp::IntegerVector idx) {
// First, break apart each vector
Rcpp::IntegerVector val1 = D["val1"];
Rcpp::NumericVector val2 = D["val2"];
Rcpp::CharacterVector val3 = D["val3"];
Rcpp::NumericVector val4 = D["val4"];
// We assume that the index passed in starts at 1.
// Hence, we need to adjust the idx to start at 0 with:
idx = idx - 1;
// Next up, create a new DataFrame Object with selected rows subset.
return Rcpp::DataFrame::create(Rcpp::Named("val1") = val1[idx],
Rcpp::Named("val2") = val2[idx],
Rcpp::Named("val3") = val3[idx],
Rcpp::Named("val3") = val4[idx]
);
}
/*** R
# Make some data
set.seed(1337)
D = data.frame(val1 = 1:10,
val2 = rnorm(10),
val3 = letters[1:10],
val4 = sample(1:100, 10),
stringsAsFactor=FALSE)
# Create index that starts at 1 instead of 0.
# This will be converted in the C++ function.
idx = c(2,4,7,10)
matchRows(D, idx)
*/
The devil in the details is we reduce the index by 1 to account for C++'s indices starting at 0 vs. R's 1 before calling the index. This can be handled within the C++ code as well. Though, I'll leave that as an exercise.
Passing a `data.table` to c++ functions using `Rcpp` and/or `RcppArmadillo`
Building on top of other answers, here is some example code:
#include <Rcpp.h>
using namespace Rcpp ;
// [[Rcpp::export]]
double do_stuff_with_a_data_table(DataFrame df){
CharacterVector x = df["x"] ;
NumericVector y = df["y"] ;
IntegerVector z = df["v"] ;
/* do whatever with x, y, v */
double res = sum(y) ;
return res ;
}
So, as Matthew says, this treats the data.table
as a data.frame
(aka a Rcpp::DataFrame
in Rcpp
).
require(data.table)
DT <- data.table(
x=rep(c("a","b","c"),each=3),
y=c(1,3,6),
v=1:9)
do_stuff_with_a_data_table( DT )
# [1] 30
This completely ignores the internals of the data.table
.
Related Topics
Fill Area Between Two Lines, with High/Low and Dates
Grouping with Custom Geom Fails - How to Inspect Internal Object from Draw_Panel()
Insert Images Using Knitr::Include_Graphics in a for Loop
Plotting Functions on Top of Datapoints in R
Different Font Faces and Sizes Within Label Text Entries in Ggplot2
Make R Studio Plots Only Show Up in New Window
Data.Table VS Plyr Regression Output
Plotting Dose Response Curves with Ggplot2 and Drc
Colons Equals Operator in R? New Syntax
Ggplot Object Not Found Error When Adding Layer with Different Data
Plot Margin of PDF Plot Device: Y-Axis Label Falling Outside Graphics Window
Get Stack Trace on Trycatch'Ed Error in R
Treat Na as Zero Only When Adding a Number
Use a Factor Column in "By" and Do Not Drop Empty Factors
Can Ggplot Make 2D Summaries of Data
How to Escape Characters in Variable Names