Arithmetic Mean on a Multidimensional Array on R and Matlab: Drastic Difference of Performances

Arithmetic mean on a multidimensional array on R and MATLAB: drastic difference of performances

mean is particularly slow because of S3 method dispatch. This is faster:

set.seed(42)
a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))

system.time({b = apply(a, c(1,2,4,5), mean.default)})
# user system elapsed
#16.80 0.03 16.94

If you don't need to handle NAs you can use the internal function:

system.time({b1 = apply(a, c(1,2,4,5),  function(x) .Internal(mean(x)))})
# user system elapsed
# 6.80 0.04 6.86

For comparison:

system.time({b2 = apply(a, c(1,2,4,5),  function(x) sum(x)/length(x))})
# user system elapsed
# 9.05 0.01 9.08

system.time({b3 = apply(a, c(1,2,4,5), sum)
b3 = b3/dim(a)[[3]]})
# user system elapsed
# 7.44 0.03 7.47

(Note that all timings are only approximate. Proper benchmarking would require running this repreatedly, e.g., using one of the bechmarking packages. But I'm not patient enough for that right now.)

It might be possible to speed this up with an Rcpp implementation.

Indexing for Multidimensional array in R like as Matlab

Matlab uses rowwise definition of the elements in A=[1:7;8:14;15:21]; during R uses columnwise in A<-array(1:84, c(3,7,4)). This gives the desired result:

A <- array(NA, c(3,7,4))
A[,,1] <- matrix(c(1:7, 8:14, 15:21), 3, byrow=TRUE)
A[,,2] <- matrix(c(22:28, 29:35, 36:42), 3, byrow=TRUE)
A[,,3] <- matrix(c(43:49, 50:56, 57:63), 3, byrow=TRUE)
A[,,4] <- matrix(c(64:70, 71:77, 78:84), 3, byrow=TRUE)
A[1,, , drop=FALSE]
# > A[1,, , drop=FALSE]
# , , 1
#
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 1 2 3 4 5 6 7
#
# , , 2
#
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 22 23 24 25 26 27 28
#
# , , 3
#
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 43 44 45 46 47 48 49
#
# , , 4
#
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,] 64 65 66 67 68 69 70

or the same: A[drop=FALSE, 1,,]

How can I apply a function to every row/column of a matrix in MATLAB?

Many built-in operations like sum and prod are already able to operate across rows or columns, so you may be able to refactor the function you are applying to take advantage of this.

If that's not a viable option, one way to do it is to collect the rows or columns into cells using mat2cell or num2cell, then use cellfun to operate on the resulting cell array.

As an example, let's say you want to sum the columns of a matrix M. You can do this simply using sum:

M = magic(10);           %# A 10-by-10 matrix
columnSums = sum(M, 1); %# A 1-by-10 vector of sums for each column

And here is how you would do this using the more complicated num2cell/cellfun option:

M = magic(10);                  %# A 10-by-10 matrix
C = num2cell(M, 1); %# Collect the columns into cells
columnSums = cellfun(@sum, C); %# A 1-by-10 vector of sums for each cell

Why is mean() so slow?

It is due to the s3 look up for the method, and then the necessary parsing of arguments in mean.default. (and also the other code in mean)

sum and length are both Primitive functions. so will be fast (but how are you handling NA values?)

t1 <- rnorm(10)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
times = 10000)

Unit: nanoseconds
expr min lq median uq max neval
mean(t1) 10266 10951 11293 11635 1470714 10000
sum(t1)/length(t1) 684 1027 1369 1711 104367 10000
mean.default(t1) 2053 2396 2738 2739 1167195 10000
.Internal(mean(t1)) 342 343 685 685 86574 10000

The internal bit of mean is faster even than sum/length.

See http://rwiki.sciviews.org/doku.php?id=packages:cran:data.table#method_dispatch_takes_time (mirror) for more details (and a data.table solution that avoids .Internal).

Note that if we increase the length of the vector, then the primitive approach is fastest

t1 <- rnorm(1e7)
microbenchmark(
mean(t1),
sum(t1)/length(t1),
mean.default(t1),
.Internal(mean(t1)),
+ times = 100)

Unit: milliseconds
expr min lq median uq max neval
mean(t1) 25.79873 26.39242 26.56608 26.85523 33.36137 100
sum(t1)/length(t1) 15.02399 15.22948 15.31383 15.43239 19.20824 100
mean.default(t1) 25.69402 26.21466 26.44683 26.84257 33.62896 100
.Internal(mean(t1)) 25.70497 26.16247 26.39396 26.63982 35.21054 100

Now method dispatch is only a fraction of the overall "time" required.

Looking for faster way to implement logSumExp across multidimensional array

The otherwise great solution from @Miff was causing my code to crash with certain datasets as infinities were being produced which I eventually figured out was due to an underflow problem which can be avoided by using the 'logSumExp trick': https://www.xarg.org/2016/06/the-log-sum-exp-trick-in-machine-learning/

Taking inspiration from @Miff 's code, and the R apply() function, I made a new function to gives faster calculations while avoiding the underflow issue. Not quite as fast as @Miff 's solution however. Posting in case it helps others

apply_logSumExp <- function (X) {
MARGIN <- c(1, 2, 3) # fixing the margins as have not tested other dims
dl <- length(dim(X)) # get length of dim
d <- dim(X) # get dim
dn <- dimnames(X) # get dimnames
ds <- seq_len(dl) # makes sequences of length of dims
d.call <- d[-MARGIN] # gets index of dim not included in MARGIN
d.ans <- d[MARGIN] # define dim for answer array
s.call <- ds[-MARGIN] # used to define permute
s.ans <- ds[MARGIN] # used to define permute
d2 <- prod(d.ans) # length of results object

newX <- aperm(X, c(s.call, s.ans)) # permute X such that dims omitted from calc are first dim
dim(newX) <- c(prod(d.call), d2) # voodoo. Preserves ommitted dim dimension but collapses the rest into 1

maxes <- colMaxs(newX)
ans <- maxes + log(colSums(exp( sweep(newX, 2, maxes, "-"))) )
ans <- array(ans, d.ans)

return(ans)
}

> microbenchmark(
+ res1 <- apply(array4d, c(1,2,3), logSumExp),
+ res2 <- log(rowSums(exp(array4d), dims=3)),
+ res3 <- apply_logSumExp(array4d)
+ )
Unit: milliseconds
expr min lq mean median uq max
res1 <- apply(array4d, c(1, 2, 3), logSumExp) 176.286670 213.882443 247.420334 236.44593 267.81127 486.41072
res2 <- log(rowSums(exp(array4d), dims = 3)) 4.664907 5.821601 7.588448 5.97765 7.47814 30.58002
res3 <- apply_logSumExp(array4d) 12.119875 14.673011 19.635265 15.20385 18.30471 90.59859
neval cld
100 c
100 a
100 b

Remove elements at a set of indices in a multidimensional array in MATLAB?

You can use logical indexing to filter the matrix, for example,

A=rand(1000,3);
A(A(:,1)>0.9)=[];

which removes the rows of A that have a value greater than 0.9 in the first column.

I'm not sure why your original approach didn't work though.

Pointwise multiplication and right matrix division

This operation for sampling n random points from the d-dimensional unit sphere could be stated in words as:

  1. Construct a n x d matrix with entries drawn from the standard normal distribution
  2. Normalize each row so it has (2-norm) magnitude 1
  3. For each row, compute a random value by taking a draw from the uniform distribution (between 0 and 1) and raise that value to the 1/d power. Multiply all elements in the row by that value.

The following R code does these operations:

unif.samp <- function(n, d) {
z <- matrix(rnorm(n*d), nrow=n, ncol=d)
z * (runif(n)^(1/d) / sqrt(rowSums(z^2)))
}

Note that in the second line of code I have taken advantage of the fact that multiplying a n x d matrix in R by a vector of length n will multiply each row by the corresponding value in that vector. This saves us the work of using repmat to construct matrices of exactly the same size as our original matrix for these sorts of row-specific operations.



Related Topics



Leave a reply



Submit