Options for Caching/Memoization/Hashing in R

How to cache parallelly computed results with memoise::memoise?

Turning my comment into an answer:

The R.cache package (I'm the author) memoizes to file, which means the cache works across R sessions and in parallel R processes. Obviously, caching to file adds a bit of overhead compared to caching to memory.

Cache expensive operations in R

## load the file from disk only if it 
## hasn't already been read into a variable
if(!(exists("mytable")){
  mytable=read.csv(...)
}

Edit: fixed typo - thanks Dirk.

Load current cache using memoise

One can define the following function, which takes a memoised function and returns the result of its last evaluation:

getLast <- function(fn){
    stopifnot(class(fn) == c("memoised", "function"))
    keys <- get("_cache", envir=environment(fn))$keys()
    n <- length(keys)
    get("_cache", envir=environment(fn))$get(keys[n])[[1]]
}

Example:

library(memoise)
fn <- function(x) x*10
fn_mem <- memoise(fn)
fn_mem(1)
[1] 10
getLast(fn_mem)
[1] 10

fn_mem(7)
[1] 70
getLast(fn_mem)
[1] 70

Factorial Memoization in R

First of all, if you need an efficient implementation, use R's factorial function. Don't write it yourself. Then, the factorial is a good exercise for understanding recursion:

myfactorial <- function(n) {
  if (n == 1) return(1)
  n * myfactorial(n-1)
}

myfactorial(10)
#[1] 3628800

With this function memoization is only useful, if you intend to use the function repeatedly. You can implement memoization in R using closures. Hadley explains these in his book.

createMemFactorial <- function() {
  res <- 1
  memFactorial <- function(n) {
    if (n == 1) return(1)

    #grow res if necessary
    if (length(res) < n) res <<- `length<-`(res, n)

    #return pre-calculated value
    if (!is.na(res[n])) return(res[n])

    #calculate new values
    res[n] <<- n * factorial(n-1)
    res[n]
  }
  memFactorial
}
memFactorial <- createMemFactorial()

memFactorial(10)
#[1] 3628800

Is it actually faster?

library(microbenchmark)
microbenchmark(factorial(10),
               myfactorial(10), 
               memFactorial(10))
#Unit: nanoseconds
#             expr  min     lq    mean median     uq   max neval cld
#    factorial(10)  235  264.0  348.02  304.5  378.5  2463   100 a  
#  myfactorial(10) 4799 5279.5 6491.94 5629.0 6044.5 15955   100   c
# memFactorial(10)  950 1025.0 1344.51 1134.5 1292.0  7942   100  b

Note that microbenchmark evaluates the functions (by default) 100 times. Since we have stored the value for n = 10 when testing the memFactorial, we time only the if conditions and the lookup here. As you can also see, R's implementation, which is mostly written in C, is faster.

A better (and easier) example implements Fibonacci numbers. Here the algorithm itself benefits from memoization.

#naive recursive implementation
fib <- function(n)  {
  if(n == 1 || n == 2) return(1)
  fib(n-1) + fib(n-2)
}

#with memoization
fibm <- function(n)  {
  if(n == 1 || n == 2) return(1)

  seq <- integer(n)
  seq[1:2] <- 1

  calc <- function(n) {
    if (seq[n] != 0) return(seq[n])
    seq[n] <<- calc(n-1) + calc(n-2)
    seq[n]
  }

  calc(n)
}

#try it:
fib(20)
#[1] 6765
fibm(20)
#[1] 6765

#Is memoization faster?
microbenchmark(fib(20),
               fibm(20))
#Unit: microseconds
#     expr      min       lq       mean    median        uq       max neval cld
# fib(20)  8005.314 8804.130 9758.75325 9301.6210 9798.8500 46867.182   100   b
#fibm(20)    38.991   44.798   54.12626   53.6725   60.4035    97.089   100  a

Something like R.cache but in RAM

Perhaps you could make a RAM disk and specify that drive as the storage destination for your cache, using R.cache.

What does the term memoize imply?

I believe that memoizing a function allows you to locally cache the result of a function for a given set of arguments. It's almost like:

function f(a, b, c) {
  if (a==1 && b==2 && !c) {
    return 5;
  } else if (a==1 && b==3 && !c) {
    return 17.2;
  } /* ... etc ... */

  // some horribly long/complex/expensive calculation
  return result;
}

but with the initial huge "if" block being handled automatically and far more efficiently, and being added to as the function is called with different arguments.

Note that you can only memoize a function that is deterministic and has no side effects. This means that the function's result can only depend on its inputs, and it cannot change anything when it is run.

So in short, memoization is function-local caching in very specific circumstances, and so it's a specialisation of regular caching.