Na.Locf and Inverse.Rle in Rcpp

na.locf and inverse.rle in Rcpp

The only thing I'd say is that you are testing for NA twice for each value when you only need to do it once. Testing for NA is not a free operation. Perhaps something like this:

//[[Rcpp::export]]
NumericVector naLocf(NumericVector x) {
int n = x.size() ;
double v = x[0]
for( int i=1; i<n; i++){
if( NumericVector::is_na(x[i]) ) {
x[i] = v ;
} else {
v = x[i] ;
}
}

return x;
}

This still however does unnecessary things, like setting v every time when we could only do it for the last time we don't see NA. We can try something like this:

//[[Rcpp::export]]
NumericVector naLocf3(NumericVector x) {
double *p=x.begin(), *end = x.end() ;
double v = *p ; p++ ;

while( p < end ){
while( p<end && !NumericVector::is_na(*p) ) p++ ;
v = *(p-1) ;
while( p<end && NumericVector::is_na(*p) ) {
*p = v ;
p++ ;
}
}

return x;
}

Now, we can try some benchmarks:

x <- rnorm(1e6)
x[sample(1:1e6, 1000)] <- NA
require(microbenchmark)
microbenchmark( naLocf1(x), naLocf2(x), naLocf3(x) )
# Unit: milliseconds
# expr min lq median uq max neval
# naLocf1(x) 6.296135 6.323142 6.339132 6.354798 6.749864 100
# naLocf2(x) 4.097829 4.123418 4.139589 4.151527 4.266292 100
# naLocf3(x) 3.467858 3.486582 3.507802 3.521673 3.569041 100

Replace NA with last non-NA in data.table by using only data.table

Here's a data.table-only solution, but it's slightly slower than na.locf:

m1[, X := X[1], by = cumsum(!is.na(X))]
m1
# X
# 1: NA
# 2: NA
# 3: 1
# 4: 2
# 5: 2
# ---
# 996: 2
# 997: 2
# 998: 6
# 999: 7
#1000: 8

Speed test:

m1 <- data.table(X = rep(c(NA,NA,1,2,NA,NA,NA,6,7,8), 1e6))
f3 = function(x) x[, X := X[1], by = cumsum(!is.na(X))]

system.time(f1(copy(m1)))
# user system elapsed
# 3.84 0.58 4.62
system.time(f3(copy(m1)))
# user system elapsed
# 5.56 0.19 6.04

And here's a perverse way of making it faster, but I think one that makes it considerably less readable:

f4 = function(x) {
x[, tmp := cumsum(!is.na(X))]
setattr(x, "sorted", "tmp") # set the key without any checks
x[x[!is.na(X)], X := i.X][, tmp := NULL]
}

system.time(f4(copy(m1)))
# user system elapsed
# 3.32 0.51 4.00

return NA value in NumericVector Rcpp unexpected behavior

You can initialize out with NA values:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector fill_backward(NumericVector x) {
int n = x.size();
NumericVector out = NumericVector(n, NumericVector::get_na());
for (int i = 0; i < n; ++i) {
if (R_IsNA(x[i])) {
for (int j = i+1; j < n; ++j) {
if(R_IsNA(x[j])) {
continue;
} else {
out[i] = x[j];
break;
}
}
} else { //not NA
out[i] = x[i];
}
}
return out;
}

Testing it:

fill_backward(c(NA, 1.0, NA, 2, NA, NA))
[1] 1 1 2 2 NA NA

And I should probably mention that your line out[i] = NumericVector::get_na(); is never reached due to your use of continue.

How to fill in the preceding numbers whenever there is a 0 in R?

n2 <- n1[cummax(seq_along(n1) * (n1 != 0))]

Filling data frame with previous row value

Perhaps you can make use of na.locf from the "zoo" package after setting values of "0" to NA. Assuming your data.frame is called "mydf":

mydf$state <- mydf$temp
mydf$state[mydf$state == 0] <- NA

library(zoo)
mydf$state <- na.locf(mydf$state)
# random temp state
# 1 0.5024234 1.0 1.0
# 2 0.6875941 0.0 1.0
# 3 0.7418837 0.0 1.0
# 4 0.4453640 0.0 1.0
# 5 0.5062614 0.5 0.5
# 6 0.5163650 0.0 0.5

If there were NA values in your original data.frame in the "temp" column, and you wanted to keep them as NA in the newly generated "state" column too, that's easy to take care of. Just add one more line to reintroduce the NA values:

mydf$state[is.na(mydf$temp)] <- NA


Related Topics



Leave a reply



Submit