Options for Caching/Memoization/Hashing in R

How to cache parallelly computed results with memoise::memoise?

Turning my comment into an answer:

The R.cache package (I'm the author) memoizes to file, which means the cache works across R sessions and in parallel R processes. Obviously, caching to file adds a bit of overhead compared to caching to memory.

Cache expensive operations in R

## load the file from disk only if it 
## hasn't already been read into a variable
if(!(exists("mytable")){
mytable=read.csv(...)
}

Edit: fixed typo - thanks Dirk.

Load current cache using memoise

One can define the following function, which takes a memoised function and returns the result of its last evaluation:

getLast <- function(fn){
stopifnot(class(fn) == c("memoised", "function"))
keys <- get("_cache", envir=environment(fn))$keys()
n <- length(keys)
get("_cache", envir=environment(fn))$get(keys[n])[[1]]
}

Example:

library(memoise)
fn <- function(x) x*10
fn_mem <- memoise(fn)
fn_mem(1)
[1] 10
getLast(fn_mem)
[1] 10

fn_mem(7)
[1] 70
getLast(fn_mem)
[1] 70

Factorial Memoization in R

First of all, if you need an efficient implementation, use R's factorial function. Don't write it yourself. Then, the factorial is a good exercise for understanding recursion:

myfactorial <- function(n) {
if (n == 1) return(1)
n * myfactorial(n-1)
}

myfactorial(10)
#[1] 3628800

With this function memoization is only useful, if you intend to use the function repeatedly. You can implement memoization in R using closures. Hadley explains these in his book.

createMemFactorial <- function() {
res <- 1
memFactorial <- function(n) {
if (n == 1) return(1)

#grow res if necessary
if (length(res) < n) res <<- `length<-`(res, n)

#return pre-calculated value
if (!is.na(res[n])) return(res[n])

#calculate new values
res[n] <<- n * factorial(n-1)
res[n]
}
memFactorial
}
memFactorial <- createMemFactorial()

memFactorial(10)
#[1] 3628800

Is it actually faster?

library(microbenchmark)
microbenchmark(factorial(10),
myfactorial(10),
memFactorial(10))
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# factorial(10) 235 264.0 348.02 304.5 378.5 2463 100 a
# myfactorial(10) 4799 5279.5 6491.94 5629.0 6044.5 15955 100 c
# memFactorial(10) 950 1025.0 1344.51 1134.5 1292.0 7942 100 b

Note that microbenchmark evaluates the functions (by default) 100 times. Since we have stored the value for n = 10 when testing the memFactorial, we time only the if conditions and the lookup here. As you can also see, R's implementation, which is mostly written in C, is faster.

A better (and easier) example implements Fibonacci numbers. Here the algorithm itself benefits from memoization.

#naive recursive implementation
fib <- function(n) {
if(n == 1 || n == 2) return(1)
fib(n-1) + fib(n-2)
}

#with memoization
fibm <- function(n) {
if(n == 1 || n == 2) return(1)

seq <- integer(n)
seq[1:2] <- 1

calc <- function(n) {
if (seq[n] != 0) return(seq[n])
seq[n] <<- calc(n-1) + calc(n-2)
seq[n]
}

calc(n)
}

#try it:
fib(20)
#[1] 6765
fibm(20)
#[1] 6765

#Is memoization faster?
microbenchmark(fib(20),
fibm(20))
#Unit: microseconds
# expr min lq mean median uq max neval cld
# fib(20) 8005.314 8804.130 9758.75325 9301.6210 9798.8500 46867.182 100 b
#fibm(20) 38.991 44.798 54.12626 53.6725 60.4035 97.089 100 a

Something like R.cache but in RAM

Perhaps you could make a RAM disk and specify that drive as the storage destination for your cache, using R.cache.

What does the term memoize imply?

I believe that memoizing a function allows you to locally cache the result of a function for a given set of arguments. It's almost like:

function f(a, b, c) {
if (a==1 && b==2 && !c) {
return 5;
} else if (a==1 && b==3 && !c) {
return 17.2;
} /* ... etc ... */

// some horribly long/complex/expensive calculation
return result;
}

but with the initial huge "if" block being handled automatically and far more efficiently, and being added to as the function is called with different arguments.

Note that you can only memoize a function that is deterministic and has no side effects. This means that the function's result can only depend on its inputs, and it cannot change anything when it is run.

So in short, memoization is function-local caching in very specific circumstances, and so it's a specialisation of regular caching.



Related Topics



Leave a reply



Submit