Is there something like a pmax index?
Very simple in base R:
z$wmax <- apply(z[, -c(1,6)],1, which.max)
Actually that gives you 1 less than what you were asking for since I excluded the first column but that can easily be remedied by adding one.
z$max_col_n <- apply(z[, -c(1,6)],1, which.max) +1
Is there an efficient way to obtain pmax other than using the R base function?
There seem to be a few issues that memory allocations that can be seen from bench::mark
uncover.
bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2(x, y, z, w))
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 5.79ms 6.28ms 157. 781.3KB
## 2 Pmax2(x, y, z, w) 39.56ms 54.48ms 19.7 9.18MB
Memory Coercion
There is 10 times the memory allocation in comparison to base pmax()
. Your rcpp is relatively straight forward, so this hints that there is some kind of coercion. And when looking at your sample data, you are sending integer vectors to a numeric signature. This creates a costly coercion. Let's update the signature and code to expect IntegerVector
s. I simply changed everything from NumericVector
to IntegerVector
for this.
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt>
1 pmax(x, y, z, w, na.rm = TRUE) 1.89ms 2.33ms 438. 781.3KB
2 Pmax2_int(x, y, z, w) 37.42ms 49.88ms 17.6 2.32MB
Re-Compilation
The OP code includes cppFunction
within the larger function code. Unless we need to recompile it every loop, we can instead compile and then call the compiled code from R. This is the biggest performance boost for this dataset size.
cppFunction("
IntegerVector cpp_pmax_pre(List args) {
IntegerVector out = args[0];
int n_arg = args.length();
int n_vec = out.length();
for (int j = 0; j < n_vec; ++j) {
if (IntegerVector::is_na(out[j])) out[j] = -1;
}
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (IntegerVector::is_na(pa[j])) pa[j] = -1;
out[j] = std::max(out[j], pa[j]);
}
}
return out;
}
")
Pmax2_int_pre <- function(...) {
args_list <- list(...)
output <- cpp_pmax_pre(args_list)
output[output == -1] <- NA
return(output)
}
bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2_int_pre(x, y, z, w))
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 2.31ms 2.42ms 397. 781.3KB
## 2 Pmax2_int_pre(x, y, z, w) 2.48ms 3.55ms 270. 2.29MB
More memory and small optimizations
Finally, we still have more memory allocated. That hints we can do more - in this case we should update NA_REAL
in rcpp. Related, we can optimize the loop assignment some.
cppFunction("
IntegerVector cpp_pmax_final(List args) {
IntegerVector out = args[0];
int n_arg = args.length();
int n_vec = out.length();
for (int j = 0; j < n_vec; ++j) {
if (IntegerVector::is_na(out[j])) out[j] = -1;
}
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
// simplify logic; if the element is not na and is greater than the out, update out.
if (!IntegerVector::is_na(pa[j]) & pa[j] > out[j]) out[j] = pa[j];
}
}
// update now in Rcpp instead of allocating vectors in R
for (int i = 0; i < n_vec; i++) {
if(out[i] == -1) out[i] = NA_INTEGER;
}
return out;
}
")
Pmax2_final <- function(...) {
cpp_pmax_final(list(...))
}
bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2_final(x, y, z, w))
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 2ms 2.08ms 460. 781.3KB
## 2 Pmax2_final(x, y, z, w) 1.19ms 1.45ms 671. 2.49KB
We did it*! I am sure there could be small optimizations - we access pa[j]
three times so it may be worthwhile to assign to a variable.
Bonus - NA_INTEGER
According to Rcpp for Everyone, the NA_INTEGER
should be equivalent to the lowest integer value of -2147483648. Using this, we can remove the replacement of NA's because we can compare directly to NA when dealing with int
data types.
During this realization, I also found an issue with the previous part - we need to clone the initial argument so that we are not accidently changing it by reference. Still, we're still slightly faster than base pmax()
.
cppFunction("
IntegerVector cpp_pmax_last(List args) {
IntegerVector tmp = args[0];
IntegerVector out = clone(tmp);
int n_arg = args.length();
int n_vec = out.length();
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (pa[j] > out[j]) out[j] = pa[j];
}
}
return out;
}
")
Pmax2_last <- function(...) {
cpp_pmax_last(list(...))
}
bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2_last(x, y, z, w),
)
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
## 1 pmax(x, y, z, w, na.rm = TRUE) 5.98ms 6.36ms 154. 781KB 0
## 2 Pmax2_last(x, y, z, w) 5.09ms 5.46ms 177. 784KB 0
Return attributes of pmax function output
One option would be using max.col
for finding the index of the maximum value per each row. For that, we need to create a matrix/data.frame by cbind
ing the vectors ('xy') and its names
('nmxy'). Create a row/column index ('ij') and subset the elements of 'xy' and set the names from 'nmxy'.
xy <- cbind(x,y)
nmxy <- cbind(names(x), names(y))
ij <- cbind(1:nrow(xy), max.col(xy))
setNames(xy[ij], nmxy[ij])
# d b f
# 2 2 4
r : Ignore NA values with pmax function
There is na.rm
in pmax
and as the values are null
, we need to replace
those null
to NA
before doing that and as "null" is a character string, the columns would be character
or factor
. So, we need to also change the type
with type.convert
before the pmax
step
df[-1] <- replace(df[-1], df[-1] == "null", NA)
df <- type.convert(df, as.is = TRUE)
df$max <- pmax(df$skyc1, df$skyc2, df$skyc3, na.rm = TRUE)
df$max
#[1] 3 1 3
If there are many columns of 'skyc',then it can be automated as well
nm1 <- grep('^skyc\\d+$', names(df), value = TRUE)
df$max <- do.call(pmax, c(df[nm1], na.rm = TRUE))
data
df <-structure(list(date = c("1995-01-01", "1995-01-02", "1995-01-03"
), skyc1 = c(0L, 1L, 1L), skyc2 = c("1", "null", "3"), skyc3 = c("3",
"null", "null")), class = "data.frame", row.names = c(NA, -3L
))
Get index of max of Spark Dataframe with Sparklyr
Let's start with your first problem:
It's straight forward to get the maximum value.
It indeed is, however spark_apply
is just not a way to go. Instead it is better to use greatest
function:
sdf %>% mutate(max = greatest(V1, V2, V3))
The same function can be used for you second problem, however due to sparklyr
limitations, you'll have to use SQL expression directly:
expr <- c("V1", "V2", "V3") %>%
paste0(
"CAST(STRUCT(`",
., "`, ", seq_along(.),
") AS struct<value: double, index: double>)", collapse=", ") %>%
paste0("greatest(", ., ").index AS max_index")
sdf %>%
spark_dataframe() %>%
invoke("selectExpr", list("*", expr)) %>%
sdf_register()
In Spark 2.4 (as for now unsupported in sparklyr
) it might be possible to
sdf %>% mutate(max
max_index = array_max(arrays_zip(array(V1, V2, V3), array(1, 2, 3))).1
)
Actual values of the maximum value in each interval
The problem is attempting to index the i
th element of the list. Doing I$'i'
is trying to get the element of the list corresponding to the string 'i'
, which doesn't exist:
> i <- 1
> I$'i'
NULL
To fix this, you should index a list using the [[..]]
notation (which indexes them in order, i.e. I[[1]]
= I$'0'
):
> i <- 1
> I[[i]]
[1] 1 2
> I$'0' # to illustrate the indexing
[1] 1 2
Assuming that f
is just meant to take a vector (rather than an index into I
), its definition should be something like:
f = function(vec, x) {
d = pmax(outer(x, vec, "-"), 0)
colSums(d - d^2/2)
}
And the loop like:
for (i in 1:length(i)) {
max.value = I[[i]][which.max(f(I[[i]], x))]
}
Note that you can iterate directly over the elements of a list, you don't need to index each one individually, so we could also do:
for (vec in I) {
max.value = vec[which.max(f(vec, x))]
}
(Also, you might want something slightly different to what you have, since in each loop max.value
is overwritten.)
Any way to achieve numpy-like indexing of multi-dimensional arrays in R?
The following works for assigning a constant in any 1-k dimensions:
x = array(0, c(5,4,3,2))
indices = c(1,2,3)
k <- 1 # Dimension to subassign
x[slice.index(x, k) %in% indices] <- 5
However, this becomes a bit inconvenient if you want to do sub-assignment parallel to indices
, especially in higher dimensions, because we have to match the order of the array. Example for the 2nd dimension:
x = array(0, c(5,4,3,2))
k <- 2
x[slice.index(x, k) %in% indices] <- rep(1:3, each = dim(x)[1])
The best option I have come up with is the following:
subassign_array <- function(x, i, k, value) {
each <- pmax(prod(dim(x)[seq_len(k - 1)]), 1)
x[slice.index(x, k) %in% i] <- rep(value, each = each)
x
}
For example replacing the 1st and 2nd element in the 3rd dimension by 1s and 2s:
subassign_array(x, 1:2, 3, 1:2)
This doesn't cover every possible case for subassignment however.
How to create an increasing index based on a certain condition?
Waldi's answer is very good, here is a slightly modified version:
library(dplyr)
df %>%
group_by(col2 =cumsum(co11 >= 5)+1)
co11 col2
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 5 2
7 6 3
8 1 3
9 1 3
10 1 3
11 2 3
12 3 3
13 4 3
14 5 4
15 8 5
16 1 5
17 1 5
18 2 5
19 2 5
20 8 6
21 10 7
Related Topics
Reshape Data from Long to Wide Format - More Than One Variable
Adding a Legend to an Rgl 3D Plot
Ggplot2: Geom_Smooth Confidence Band Does Not Extend to Edge of Graph, Even with Fullrange=True
Programmatically Create Tab and Plot in Markdown
How to Use R to Create a Word Co-Occurrence Matrix
R: Ggplot2 Setting the Last Plot in the Midle with Facet_Wrap
How to Drop Factor Levels While Scraping Data Off Us Census HTML Site
Convert Jpg to Greyscale CSV Using R
How to Add New Calculated Variables to a Data Frame
Logistic Regression: How to Try Every Combination of Predictors in R
How to Know a Dimension of Matrix or Vector in R
Why Does Withcallinghandlers Still Stops Execution
Find Closest Points (Lat/Lon) from One Data Set to a Second Data Set
Convert Byte Encoding to Unicode
Uri Routing for Shinydashboard Using Shiny.Router