Is There Something Like a Pmax Index

Is there something like a pmax index?

Very simple in base R:

z$wmax <- apply(z[, -c(1,6)],1, which.max)

Actually that gives you 1 less than what you were asking for since I excluded the first column but that can easily be remedied by adding one.

z$max_col_n <- apply(z[, -c(1,6)],1, which.max) +1

Is there an efficient way to obtain pmax other than using the R base function?

There seem to be a few issues that memory allocations that can be seen from bench::mark uncover.

bench::mark(pmax(x,  y, z, w, na.rm = TRUE),
Pmax2(x, y, z, w))

## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 5.79ms 6.28ms 157. 781.3KB
## 2 Pmax2(x, y, z, w) 39.56ms 54.48ms 19.7 9.18MB

Memory Coercion

There is 10 times the memory allocation in comparison to base pmax(). Your rcpp is relatively straight forward, so this hints that there is some kind of coercion. And when looking at your sample data, you are sending integer vectors to a numeric signature. This creates a costly coercion. Let's update the signature and code to expect IntegerVectors. I simply changed everything from NumericVector to IntegerVector for this.

  expression                         min  median `itr/sec` mem_alloc
<bch:expr> <bch:t> <bch:t> <dbl> <bch:byt>
1 pmax(x, y, z, w, na.rm = TRUE) 1.89ms 2.33ms 438. 781.3KB
2 Pmax2_int(x, y, z, w) 37.42ms 49.88ms 17.6 2.32MB

Re-Compilation

The OP code includes cppFunction within the larger function code. Unless we need to recompile it every loop, we can instead compile and then call the compiled code from R. This is the biggest performance boost for this dataset size.

cppFunction("
IntegerVector cpp_pmax_pre(List args) {
IntegerVector out = args[0];
int n_arg = args.length();
int n_vec = out.length();
for (int j = 0; j < n_vec; ++j) {
if (IntegerVector::is_na(out[j])) out[j] = -1;
}
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (IntegerVector::is_na(pa[j])) pa[j] = -1;
out[j] = std::max(out[j], pa[j]);
}
}
return out;
}
")

Pmax2_int_pre <- function(...) {
args_list <- list(...)
output <- cpp_pmax_pre(args_list)
output[output == -1] <- NA
return(output)
}

bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2_int_pre(x, y, z, w))

## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 2.31ms 2.42ms 397. 781.3KB
## 2 Pmax2_int_pre(x, y, z, w) 2.48ms 3.55ms 270. 2.29MB

More memory and small optimizations

Finally, we still have more memory allocated. That hints we can do more - in this case we should update NA_REAL in rcpp. Related, we can optimize the loop assignment some.

cppFunction("
IntegerVector cpp_pmax_final(List args) {
IntegerVector out = args[0];
int n_arg = args.length();
int n_vec = out.length();
for (int j = 0; j < n_vec; ++j) {
if (IntegerVector::is_na(out[j])) out[j] = -1;
}
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
// simplify logic; if the element is not na and is greater than the out, update out.
if (!IntegerVector::is_na(pa[j]) & pa[j] > out[j]) out[j] = pa[j];
}
}
// update now in Rcpp instead of allocating vectors in R
for (int i = 0; i < n_vec; i++) {
if(out[i] == -1) out[i] = NA_INTEGER;
}
return out;
}
")

Pmax2_final <- function(...) {
cpp_pmax_final(list(...))
}

bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2_final(x, y, z, w))

## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt>
## 1 pmax(x, y, z, w, na.rm = TRUE) 2ms 2.08ms 460. 781.3KB
## 2 Pmax2_final(x, y, z, w) 1.19ms 1.45ms 671. 2.49KB

We did it*! I am sure there could be small optimizations - we access pa[j] three times so it may be worthwhile to assign to a variable.

Bonus - NA_INTEGER

According to Rcpp for Everyone, the NA_INTEGER should be equivalent to the lowest integer value of -2147483648. Using this, we can remove the replacement of NA's because we can compare directly to NA when dealing with int data types.

During this realization, I also found an issue with the previous part - we need to clone the initial argument so that we are not accidently changing it by reference. Still, we're still slightly faster than base pmax().

cppFunction("
IntegerVector cpp_pmax_last(List args) {
IntegerVector tmp = args[0];
IntegerVector out = clone(tmp);
int n_arg = args.length();
int n_vec = out.length();
for (int i = 1; i < n_arg; ++i) {
IntegerVector pa = args[i];
for (int j = 0; j < n_vec; ++j) {
if (pa[j] > out[j]) out[j] = pa[j];
}
}
return out;
}
")

Pmax2_last <- function(...) {
cpp_pmax_last(list(...))
}

bench::mark(pmax(x, y, z, w, na.rm = TRUE),
Pmax2_last(x, y, z, w),
)

## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc `gc/sec`
## <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
## 1 pmax(x, y, z, w, na.rm = TRUE) 5.98ms 6.36ms 154. 781KB 0
## 2 Pmax2_last(x, y, z, w) 5.09ms 5.46ms 177. 784KB 0

Return attributes of pmax function output

One option would be using max.col for finding the index of the maximum value per each row. For that, we need to create a matrix/data.frame by cbinding the vectors ('xy') and its names ('nmxy'). Create a row/column index ('ij') and subset the elements of 'xy' and set the names from 'nmxy'.

xy <- cbind(x,y)
nmxy <- cbind(names(x), names(y))
ij <- cbind(1:nrow(xy), max.col(xy))
setNames(xy[ij], nmxy[ij])
# d b f
# 2 2 4

r : Ignore NA values with pmax function

There is na.rm in pmax and as the values are null, we need to replace those null to NA before doing that and as "null" is a character string, the columns would be character or factor. So, we need to also change the type with type.convert before the pmax step

df[-1] <- replace(df[-1], df[-1] == "null", NA)
df <- type.convert(df, as.is = TRUE)
df$max <- pmax(df$skyc1, df$skyc2, df$skyc3, na.rm = TRUE)
df$max
#[1] 3 1 3

If there are many columns of 'skyc',then it can be automated as well

nm1 <- grep('^skyc\\d+$', names(df), value = TRUE)
df$max <- do.call(pmax, c(df[nm1], na.rm = TRUE))

data

df <-structure(list(date = c("1995-01-01", "1995-01-02", "1995-01-03"
), skyc1 = c(0L, 1L, 1L), skyc2 = c("1", "null", "3"), skyc3 = c("3",
"null", "null")), class = "data.frame", row.names = c(NA, -3L
))

Get index of max of Spark Dataframe with Sparklyr

Let's start with your first problem:

It's straight forward to get the maximum value.

It indeed is, however spark_apply is just not a way to go. Instead it is better to use greatest function:

sdf %>% mutate(max = greatest(V1, V2, V3))

The same function can be used for you second problem, however due to sparklyr limitations, you'll have to use SQL expression directly:

expr <- c("V1", "V2", "V3") %>% 
paste0(
"CAST(STRUCT(`",
., "`, ", seq_along(.),
") AS struct<value: double, index: double>)", collapse=", ") %>%
paste0("greatest(", ., ").index AS max_index")

sdf %>%
spark_dataframe() %>%
invoke("selectExpr", list("*", expr)) %>%
sdf_register()

In Spark 2.4 (as for now unsupported in sparklyr) it might be possible to

sdf %>% mutate(max
max_index = array_max(arrays_zip(array(V1, V2, V3), array(1, 2, 3))).1
)

Actual values of the maximum value in each interval

The problem is attempting to index the ith element of the list. Doing I$'i' is trying to get the element of the list corresponding to the string 'i', which doesn't exist:

> i <- 1
> I$'i'
NULL

To fix this, you should index a list using the [[..]] notation (which indexes them in order, i.e. I[[1]] = I$'0'):

> i <- 1
> I[[i]]
[1] 1 2
> I$'0' # to illustrate the indexing
[1] 1 2

Assuming that f is just meant to take a vector (rather than an index into I), its definition should be something like:

f = function(vec, x) {
d = pmax(outer(x, vec, "-"), 0)
colSums(d - d^2/2)
}

And the loop like:

for (i in 1:length(i)) {
max.value = I[[i]][which.max(f(I[[i]], x))]
}

Note that you can iterate directly over the elements of a list, you don't need to index each one individually, so we could also do:

for (vec in I) {
max.value = vec[which.max(f(vec, x))]
}

(Also, you might want something slightly different to what you have, since in each loop max.value is overwritten.)

Any way to achieve numpy-like indexing of multi-dimensional arrays in R?

The following works for assigning a constant in any 1-k dimensions:

x = array(0, c(5,4,3,2))
indices = c(1,2,3)

k <- 1 # Dimension to subassign
x[slice.index(x, k) %in% indices] <- 5

However, this becomes a bit inconvenient if you want to do sub-assignment parallel to indices, especially in higher dimensions, because we have to match the order of the array. Example for the 2nd dimension:

x = array(0, c(5,4,3,2))
k <- 2
x[slice.index(x, k) %in% indices] <- rep(1:3, each = dim(x)[1])

The best option I have come up with is the following:

subassign_array <- function(x, i, k, value) {
each <- pmax(prod(dim(x)[seq_len(k - 1)]), 1)
x[slice.index(x, k) %in% i] <- rep(value, each = each)
x
}

For example replacing the 1st and 2nd element in the 3rd dimension by 1s and 2s:

subassign_array(x, 1:2, 3, 1:2)

This doesn't cover every possible case for subassignment however.

How to create an increasing index based on a certain condition?

Waldi's answer is very good, here is a slightly modified version:

library(dplyr)

df %>%
group_by(col2 =cumsum(co11 >= 5)+1)
 co11 col2
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 5 2
7 6 3
8 1 3
9 1 3
10 1 3
11 2 3
12 3 3
13 4 3
14 5 4
15 8 5
16 1 5
17 1 5
18 2 5
19 2 5
20 8 6
21 10 7


Related Topics



Leave a reply



Submit