Can Lapply Not Modify Variables in a Higher Scope

can lapply not modify variables in a higher scope

I discussed this issue in this related question: "Is R’s apply family more than syntactic sugar". You will notice that if you look at the function signature for for and apply, they have one critical difference: a for loop evaluates an expression, while an apply loop evaluates a function.

If you want to alter things outside the scope of an apply function, then you need to use <<- or assign. Or more to the point, use something like a for loop instead. But you really need to be careful when working with things outside of a function because it can result in unexpected behavior.

In my opinion, one of the primary reasons to use an apply function is explicitly because it doesn't alter things outside of it. This is a core concept in functional programming, wherein functions avoid having side effects. This is also a reason why the apply family of functions can be used in parallel processing (and similar functions exist in the various parallel packages such as snow).

Lastly, the right way to run your code example is to also pass in the parameters to your function like so, and assigning back the output:

mat <- matrix(0,nrow=10,ncol=1)
mat <- matrix(lapply(1:10, function(i, mat) { mat[i,] <- rnorm(1,mean=i)}, mat=mat))

It is always best to be explicit about a parameter when possible (hence the mat=mat) rather than inferring it.

Difference between using higher scope variables and using variables explicitly passed in a function

If a function uses a variable from scope that might cause a side effect(function modifying the outer variable) and this is considered bad practice because makes function impure.

Global variables considered bad practice and should only be used if variable is constant. If the variable is constant it is okay, because now function can't modify the scope.

Can non-global variables be modified inside a function in R?

It is possible to update a global variable, in a function using get and assign function. Below is the code, which does the same :

heatmap.matrix <- matrix(rep(0,40000), nrow=200, ncol=200)

# foo function should just update a single cell of the declared matrix
varName <- "heatmap.matrix"
foo <- function() {
heatmap.matrix.copy <- get(varName)
heatmap.matrix.copy[40,40] <- 100
assign(varName, heatmap.matrix.copy, pos=1)
}

heatmap.matrix[40,40]
#[1] 0
foo()
heatmap.matrix[40,40]
# [1] 100

you should read up a bit on environments concept. The best place to start is http://adv-r.had.co.nz/Environments.html

How to define multiple variables with lapply?

General solution

Try outer:

c(outer(1:10, 2:4, Vectorize(function(x, y) x*y)))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

If function is Vectorized already

If the function is already vectorized, as it is here, then we can omit Vectorize:

c(outer(1:10, 2:4, function(x, y) x * y))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

Particular example shown in question

In fact, in this particular case the anonymous function shown is the default so this would work:

c(outer(1:10, 2:4))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

Also in this particular case we could use:

c(1:10 %o% 2:4)
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

If input is list X

If your starting point is list X shown in the question then:

c(outer(X[[1]], X[[2]], Vectorize(function(x, y) x * y)))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

or

c(do.call("outer", c(unname(X), Vectorize(function(x, y) x*y))))
## [1] 2 4 6 8 10 12 14 16 18 20 3 6 9 12 15 18 21 24 27 30 4 8 12 16 20
## [26] 24 28 32 36 40

where the prior sections apply to shorten it, if applicable.

Is R's apply family more than syntactic sugar?

The apply functions in R don't provide improved performance over other looping functions (e.g. for). One exception to this is lapply which can be a little faster because it does more work in C code than in R (see this question for an example of this).

But in general, the rule is that you should use an apply function for clarity, not for performance.

I would add to this that apply functions have no side effects, which is an important distinction when it comes to functional programming with R. This can be overridden by using assign or <<-, but that can be very dangerous. Side effects also make a program harder to understand since a variable's state depends on the history.

Edit:

Just to emphasize this with a trivial example that recursively calculates the Fibonacci sequence; this could be run multiple times to get an accurate measure, but the point is that none of the methods have significantly different performance:

> fibo <- function(n) {
+ if ( n < 2 ) n
+ else fibo(n-1) + fibo(n-2)
+ }
> system.time(for(i in 0:26) fibo(i))
user system elapsed
7.48 0.00 7.52
> system.time(sapply(0:26, fibo))
user system elapsed
7.50 0.00 7.54
> system.time(lapply(0:26, fibo))
user system elapsed
7.48 0.04 7.54
> library(plyr)
> system.time(ldply(0:26, fibo))
user system elapsed
7.52 0.00 7.58

Edit 2:

Regarding the usage of parallel packages for R (e.g. rpvm, rmpi, snow), these do generally provide apply family functions (even the foreach package is essentially equivalent, despite the name). Here's a simple example of the sapply function in snow:

library(snow)
cl <- makeSOCKcluster(c("localhost","localhost"))
parSapply(cl, 1:20, get("+"), 3)

This example uses a socket cluster, for which no additional software needs to be installed; otherwise you will need something like PVM or MPI (see Tierney's clustering page). snow has the following apply functions:

parLapply(cl, x, fun, ...)
parSapply(cl, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
parApply(cl, X, MARGIN, FUN, ...)
parRapply(cl, x, fun, ...)
parCapply(cl, x, fun, ...)

It makes sense that apply functions should be used for parallel execution since they have no side effects. When you change a variable value within a for loop, it is globally set. On the other hand, all apply functions can safely be used in parallel because changes are local to the function call (unless you try to use assign or <<-, in which case you can introduce side effects). Needless to say, it's critical to be careful about local vs. global variables, especially when dealing with parallel execution.

Edit:

Here's a trivial example to demonstrate the difference between for and *apply so far as side effects are concerned:

> df <- 1:10
> # *apply example
> lapply(2:3, function(i) df <- df * i)
> df
[1] 1 2 3 4 5 6 7 8 9 10
> # for loop example
> for(i in 2:3) df <- df * i
> df
[1] 6 12 18 24 30 36 42 48 54 60

Note how the df in the parent environment is altered by for but not *apply.



Related Topics



Leave a reply



Submit