na.locf and inverse.rle in Rcpp
The only thing I'd say is that you are testing for NA
twice for each value when you only need to do it once. Testing for NA
is not a free operation. Perhaps something like this:
//[[Rcpp::export]]
NumericVector naLocf(NumericVector x) {
int n = x.size() ;
double v = x[0]
for( int i=1; i<n; i++){
if( NumericVector::is_na(x[i]) ) {
x[i] = v ;
} else {
v = x[i] ;
}
}
return x;
}
This still however does unnecessary things, like setting v
every time when we could only do it for the last time we don't see NA
. We can try something like this:
//[[Rcpp::export]]
NumericVector naLocf3(NumericVector x) {
double *p=x.begin(), *end = x.end() ;
double v = *p ; p++ ;
while( p < end ){
while( p<end && !NumericVector::is_na(*p) ) p++ ;
v = *(p-1) ;
while( p<end && NumericVector::is_na(*p) ) {
*p = v ;
p++ ;
}
}
return x;
}
Now, we can try some benchmarks:
x <- rnorm(1e6)
x[sample(1:1e6, 1000)] <- NA
require(microbenchmark)
microbenchmark( naLocf1(x), naLocf2(x), naLocf3(x) )
# Unit: milliseconds
# expr min lq median uq max neval
# naLocf1(x) 6.296135 6.323142 6.339132 6.354798 6.749864 100
# naLocf2(x) 4.097829 4.123418 4.139589 4.151527 4.266292 100
# naLocf3(x) 3.467858 3.486582 3.507802 3.521673 3.569041 100
Replace NA with last non-NA in data.table by using only data.table
Here's a data.table
-only solution, but it's slightly slower than na.locf
:
m1[, X := X[1], by = cumsum(!is.na(X))]
m1
# X
# 1: NA
# 2: NA
# 3: 1
# 4: 2
# 5: 2
# ---
# 996: 2
# 997: 2
# 998: 6
# 999: 7
#1000: 8
Speed test:
m1 <- data.table(X = rep(c(NA,NA,1,2,NA,NA,NA,6,7,8), 1e6))
f3 = function(x) x[, X := X[1], by = cumsum(!is.na(X))]
system.time(f1(copy(m1)))
# user system elapsed
# 3.84 0.58 4.62
system.time(f3(copy(m1)))
# user system elapsed
# 5.56 0.19 6.04
And here's a perverse way of making it faster, but I think one that makes it considerably less readable:
f4 = function(x) {
x[, tmp := cumsum(!is.na(X))]
setattr(x, "sorted", "tmp") # set the key without any checks
x[x[!is.na(X)], X := i.X][, tmp := NULL]
}
system.time(f4(copy(m1)))
# user system elapsed
# 3.32 0.51 4.00
return NA value in NumericVector Rcpp unexpected behavior
You can initialize out
with NA
values:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector fill_backward(NumericVector x) {
int n = x.size();
NumericVector out = NumericVector(n, NumericVector::get_na());
for (int i = 0; i < n; ++i) {
if (R_IsNA(x[i])) {
for (int j = i+1; j < n; ++j) {
if(R_IsNA(x[j])) {
continue;
} else {
out[i] = x[j];
break;
}
}
} else { //not NA
out[i] = x[i];
}
}
return out;
}
Testing it:
fill_backward(c(NA, 1.0, NA, 2, NA, NA))
[1] 1 1 2 2 NA NA
And I should probably mention that your line out[i] = NumericVector::get_na();
is never reached due to your use of continue
.
How to fill in the preceding numbers whenever there is a 0 in R?
n2 <- n1[cummax(seq_along(n1) * (n1 != 0))]
Filling data frame with previous row value
Perhaps you can make use of na.locf
from the "zoo" package after setting values of "0" to NA
. Assuming your data.frame
is called "mydf":
mydf$state <- mydf$temp
mydf$state[mydf$state == 0] <- NA
library(zoo)
mydf$state <- na.locf(mydf$state)
# random temp state
# 1 0.5024234 1.0 1.0
# 2 0.6875941 0.0 1.0
# 3 0.7418837 0.0 1.0
# 4 0.4453640 0.0 1.0
# 5 0.5062614 0.5 0.5
# 6 0.5163650 0.0 0.5
If there were NA
values in your original data.frame
in the "temp" column, and you wanted to keep them as NA
in the newly generated "state" column too, that's easy to take care of. Just add one more line to reintroduce the NA
values:
mydf$state[is.na(mydf$temp)] <- NA
Related Topics
Combining Date and Time into a Date Column for Plotting
Filling Polygons of a Map Using Ggplot in R
Count Number of Values in Row Using Dplyr
Aggregating Rows for Multiple Columns in R
Using Read.Csv.Sql to Select Multiple Values from a Single Column
How to Use Different Font Sizes in Ggplot Facet Wrap Labels
Ggplot2: Shape, Color and Linestyle into One Legend
How to Filter Cases in a Data.Table by Multiple Conditions Defined in Another Data.Table
Evaluate Different Logical Conditions from String for Each Row
Segfault in R Using Reshape2 Package and Dcast
Why Does Apt-Get Install R-Base Install 3.2.3 Instead of 3.4.0 in R
Blockwise Sum of Matrix Elements
Use Endpoints Function to Get Start Points Instead
How to Give Numbers to Each Group of a Dataframe with Dplyr::Group_By
R Finding Duplicates in One Column and Collapsing in a Second Column
Remove Whiskers in Box-Whisker-Plot