Arithmetic Mean on a Multidimensional Array on R and Matlab: Drastic Difference of Performances

Arithmetic mean on a multidimensional array on R and MATLAB: drastic difference of performances

mean is particularly slow because of S3 method dispatch. This is faster:

set.seed(42)
a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))

system.time({b = apply(a, c(1,2,4,5), mean.default)})
# user  system elapsed 
#16.80    0.03   16.94

If you don't need to handle NAs you can use the internal function:

system.time({b1 = apply(a, c(1,2,4,5),  function(x) .Internal(mean(x)))})
# user  system elapsed 
# 6.80    0.04    6.86

For comparison:

system.time({b2 = apply(a, c(1,2,4,5),  function(x) sum(x)/length(x))})
# user  system elapsed 
# 9.05    0.01    9.08 

system.time({b3 = apply(a, c(1,2,4,5),  sum)
             b3 = b3/dim(a)[[3]]})
# user  system elapsed 
# 7.44    0.03    7.47

(Note that all timings are only approximate. Proper benchmarking would require running this repreatedly, e.g., using one of the bechmarking packages. But I'm not patient enough for that right now.)

It might be possible to speed this up with an Rcpp implementation.

Indexing for Multidimensional array in R like as Matlab

Matlab uses rowwise definition of the elements in A=[1:7;8:14;15:21]; during R uses columnwise in A<-array(1:84, c(3,7,4)). This gives the desired result:

A <- array(NA, c(3,7,4))
A[,,1] <- matrix(c(1:7, 8:14, 15:21), 3, byrow=TRUE)
A[,,2] <- matrix(c(22:28, 29:35, 36:42), 3, byrow=TRUE)
A[,,3] <- matrix(c(43:49, 50:56, 57:63), 3, byrow=TRUE)
A[,,4] <- matrix(c(64:70, 71:77, 78:84), 3, byrow=TRUE)
A[1,, , drop=FALSE]
# > A[1,, , drop=FALSE]
# , , 1
# 
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,]    1    2    3    4    5    6    7
# 
# , , 2
# 
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,]   22   23   24   25   26   27   28
# 
# , , 3
# 
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,]   43   44   45   46   47   48   49
# 
# , , 4
# 
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,]   64   65   66   67   68   69   70

or the same: A[drop=FALSE, 1,,]

How can I apply a function to every row/column of a matrix in MATLAB?

Many built-in operations like sum and prod are already able to operate across rows or columns, so you may be able to refactor the function you are applying to take advantage of this.

If that's not a viable option, one way to do it is to collect the rows or columns into cells using mat2cell or num2cell, then use cellfun to operate on the resulting cell array.

As an example, let's say you want to sum the columns of a matrix M. You can do this simply using sum:

M = magic(10);           %# A 10-by-10 matrix
columnSums = sum(M, 1);  %# A 1-by-10 vector of sums for each column

And here is how you would do this using the more complicated num2cell/cellfun option:

M = magic(10);                  %# A 10-by-10 matrix
C = num2cell(M, 1);             %# Collect the columns into cells
columnSums = cellfun(@sum, C);  %# A 1-by-10 vector of sums for each cell

Why is mean() so slow?

It is due to the s3 look up for the method, and then the necessary parsing of arguments in mean.default. (and also the other code in mean)

sum and length are both Primitive functions. so will be fast (but how are you handling NA values?)

t1 <- rnorm(10)
microbenchmark(
  mean(t1),
  sum(t1)/length(t1),
  mean.default(t1),
  .Internal(mean(t1)),
  times = 10000)

Unit: nanoseconds
                expr   min    lq median    uq     max neval
            mean(t1) 10266 10951  11293 11635 1470714 10000
  sum(t1)/length(t1)   684  1027   1369  1711  104367 10000
    mean.default(t1)  2053  2396   2738  2739 1167195 10000
 .Internal(mean(t1))   342   343    685   685   86574 10000

The internal bit of mean is faster even than sum/length.

See http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table#method_dispatch_takes_time (mirror) for more details (and a data.table solution that avoids .Internal).

Note that if we increase the length of the vector, then the primitive approach is fastest

t1 <- rnorm(1e7)
microbenchmark(
     mean(t1),
     sum(t1)/length(t1),
     mean.default(t1),
     .Internal(mean(t1)),
+     times = 100)

Unit: milliseconds
                expr      min       lq   median       uq      max neval
            mean(t1) 25.79873 26.39242 26.56608 26.85523 33.36137   100
  sum(t1)/length(t1) 15.02399 15.22948 15.31383 15.43239 19.20824   100
    mean.default(t1) 25.69402 26.21466 26.44683 26.84257 33.62896   100
 .Internal(mean(t1)) 25.70497 26.16247 26.39396 26.63982 35.21054   100

Now method dispatch is only a fraction of the overall "time" required.

Looking for faster way to implement logSumExp across multidimensional array

The otherwise great solution from @Miff was causing my code to crash with certain datasets as infinities were being produced which I eventually figured out was due to an underflow problem which can be avoided by using the 'logSumExp trick': https://www.xarg.org/2016/06/the-log-sum-exp-trick-in-machine-learning/

Taking inspiration from @Miff 's code, and the R apply() function, I made a new function to gives faster calculations while avoiding the underflow issue. Not quite as fast as @Miff 's solution however. Posting in case it helps others

apply_logSumExp <- function (X) {
    MARGIN <- c(1, 2, 3) # fixing the margins as have not tested other dims
    dl <- length(dim(X)) # get length of dim
    d <- dim(X) # get dim
    dn <- dimnames(X) # get dimnames
    ds <- seq_len(dl) # makes sequences of length of dims
    d.call <- d[-MARGIN]    # gets index of dim not included in MARGIN
    d.ans <- d[MARGIN]  # define dim for answer array
    s.call <- ds[-MARGIN] # used to define permute
    s.ans <- ds[MARGIN]     # used to define permute
    d2 <- prod(d.ans)   # length of results object
    
    newX <- aperm(X, c(s.call, s.ans)) # permute X such that dims omitted from calc are first dim
    dim(newX) <- c(prod(d.call), d2) # voodoo. Preserves ommitted dim dimension but collapses the rest into 1
    
    maxes <- colMaxs(newX)
    ans <- maxes + log(colSums(exp( sweep(newX, 2, maxes, "-"))) )
    ans <- array(ans, d.ans)
    
    return(ans)
}

 > microbenchmark(
+     res1 <- apply(array4d, c(1,2,3), logSumExp),
+     res2 <- log(rowSums(exp(array4d), dims=3)),
+     res3 <- apply_logSumExp(array4d)
+ )
Unit: milliseconds
                                          expr        min         lq       mean    median        uq       max
 res1 <- apply(array4d, c(1, 2, 3), logSumExp) 176.286670 213.882443 247.420334 236.44593 267.81127 486.41072
  res2 <- log(rowSums(exp(array4d), dims = 3))   4.664907   5.821601   7.588448   5.97765   7.47814  30.58002
              res3 <- apply_logSumExp(array4d)  12.119875  14.673011  19.635265  15.20385  18.30471  90.59859
 neval cld
   100   c
   100 a  
   100  b

Remove elements at a set of indices in a multidimensional array in MATLAB?

You can use logical indexing to filter the matrix, for example,

A=rand(1000,3);
A(A(:,1)>0.9)=[];

which removes the rows of A that have a value greater than 0.9 in the first column.

I'm not sure why your original approach didn't work though.

Pointwise multiplication and right matrix division

This operation for sampling n random points from the d-dimensional unit sphere could be stated in words as:

Construct a n x d matrix with entries drawn from the standard normal distribution
Normalize each row so it has (2-norm) magnitude 1
For each row, compute a random value by taking a draw from the uniform distribution (between 0 and 1) and raise that value to the 1/d power. Multiply all elements in the row by that value.

The following R code does these operations:

unif.samp <- function(n, d) {
  z <- matrix(rnorm(n*d), nrow=n, ncol=d)
  z * (runif(n)^(1/d) / sqrt(rowSums(z^2)))
}

Note that in the second line of code I have taken advantage of the fact that multiplying a n x d matrix in R by a vector of length n will multiply each row by the corresponding value in that vector. This saves us the work of using repmat to construct matrices of exactly the same size as our original matrix for these sorts of row-specific operations.

Arithmetic Mean on a Multidimensional Array on R and Matlab: Drastic Difference of Performances