Multithreading with R

multithreading with R?

You are confused.

The R (and before it, S) internals are single-threaded, and will almost surely remain single-threaded. As I understand it, Duncan Temple Lang's PhD work was about overcoming this, and if he can't do it...

That said, there are pockets of multi-threadedness:

  • First off, whenever you make external calls, and with proper locking, you can go multi-threaded. That is what the BLAS libraries MKL, Goto/Open BLAS, Atlas (if built
    multithreaded), ... all offer. Revo R "merely" ships with (Intel's) MKL as Intel happens
    to be a key Revo investor

  • If you are careful about what you do, you can use OpenMP (a compiler extension for multi-threading). This started with Luke Tierney's work on pnmath and pnmath0 (which used to be experimental / external packages) and has since been coming into R itself, slowly but surely.

  • Next, in a multicore world, and on the right operating system, you can always fork(). That is what package multicore pioneered and which package parallel now carries on.

  • Last but not least there is the network / RPC route with MPI used by packages like Rmpi, snow, parallel, ... and covered in HPC introductions.

Parallel computing in R: Run multiple files in same session [multi-threaded]

Running code on different files and multi-threading are different things. With lapply I can run one function on several files / data frames but it will be done sequentially. Packages like parallel allow me to run process concurrently on multiple cores at the same time.

E.g.

list_of_dfs <- c("a.csv","b.csv")

lapply(list_of_dfs, read.csv) #opens and reads all CSVs from the list sequentially

vs

library(parallel)
list_of_dfs <- c("a.csv","b.csv")

mclapply(list_of_dfs, read.csv) #opens and reads all CSVs from the list at the same time

Note that in both cases the end result is the same, only the second case might be faster due to parallelization. So it depends on your use case for multi-threading.

Using packages with multi-threading in R

The solution came from Ben Barnes, thank you.

The following code works fine:

mean_function <- function(variable)
{
result = cellStats(variable, stat='mean', na.rm=TRUE)
return(result)
}

cl <- makeCluster(procs, type = "SOCK")
clusterEvalQ(cl, library(raster))
result = parLapply(cl, a_list, mean_function)
stopCluster(cl)

Where procs is the number of processors you wish to use, which must be the same value as the length of the list you are passing (in this case called a_list).

a_list in this case needs to be a list containing rasters which can be operated on to calculate the mean using the cellStats function. So, a_list is simply a list of rasters, containing procs number of rasters.

Parallelizing / Multithreading with data.table

I got answers from data.table developers from data.table github.

Here's a summary:

  • Finding groups of by variable itself is parallelized always, but more importantly,

  • If the function on j is generic (User Defined Function) then there's no parallelization.

  • Operations on j is parallelized if the function is (gforce) optimized (Expressions in j which contain only the functions min, max, mean, median, var, sd, sum, prod, first, last, head, tail)

So, it is advised to do parallel operation manually if the function on j is generic, but it may not always guarantee speed gain. Reference

==Solution==

In my case, I encountered vector memory exhaust when I plainly used DT[, var := some_function(var2)] even though my server had 1TB of ram, while data was taking 200GB of memory.

I used split(DT, by='grouper') to split my data.table into chunks, and utilized doFuture foreach %dopar% to do the job. It was pretty fast.



Related Topics



Leave a reply



Submit