multithreading with R?
You are confused.
The R (and before it, S) internals are single-threaded, and will almost surely remain single-threaded. As I understand it, Duncan Temple Lang's PhD work was about overcoming this, and if he can't do it...
That said, there are pockets of multi-threadedness:
First off, whenever you make external calls, and with proper locking, you can go multi-threaded. That is what the BLAS libraries MKL, Goto/Open BLAS, Atlas (if built
multithreaded), ... all offer. Revo R "merely" ships with (Intel's) MKL as Intel happens
to be a key Revo investorIf you are careful about what you do, you can use OpenMP (a compiler extension for multi-threading). This started with Luke Tierney's work on pnmath and pnmath0 (which used to be experimental / external packages) and has since been coming into R itself, slowly but surely.
Next, in a multicore world, and on the right operating system, you can always
fork()
. That is what package multicore pioneered and which package parallel now carries on.Last but not least there is the network / RPC route with MPI used by packages like Rmpi, snow, parallel, ... and covered in HPC introductions.
Parallel computing in R: Run multiple files in same session [multi-threaded]
Running code on different files and multi-threading are different things. With lapply
I can run one function on several files / data frames but it will be done sequentially. Packages like parallel
allow me to run process concurrently on multiple cores at the same time.
E.g.
list_of_dfs <- c("a.csv","b.csv")
lapply(list_of_dfs, read.csv) #opens and reads all CSVs from the list sequentially
vs
library(parallel)
list_of_dfs <- c("a.csv","b.csv")
mclapply(list_of_dfs, read.csv) #opens and reads all CSVs from the list at the same time
Note that in both cases the end result is the same, only the second case might be faster due to parallelization. So it depends on your use case for multi-threading.
Using packages with multi-threading in R
The solution came from Ben Barnes, thank you.
The following code works fine:
mean_function <- function(variable)
{
result = cellStats(variable, stat='mean', na.rm=TRUE)
return(result)
}
cl <- makeCluster(procs, type = "SOCK")
clusterEvalQ(cl, library(raster))
result = parLapply(cl, a_list, mean_function)
stopCluster(cl)
Where procs is the number of processors you wish to use, which must be the same value as the length of the list you are passing (in this case called a_list).
a_list in this case needs to be a list containing rasters which can be operated on to calculate the mean using the cellStats function. So, a_list is simply a list of rasters, containing procs number of rasters.
Parallelizing / Multithreading with data.table
I got answers from data.table
developers from data.table github.
Here's a summary:
Finding groups of
by
variable itself is parallelized always, but more importantly,If the function on
j
is generic (User Defined Function) then there's no parallelization.Operations on
j
is parallelized if the function is (gforce) optimized (Expressions in j which contain only the functionsmin
,max
,mean
,median
,var
,sd
,sum
,prod
,first
,last
,head
,tail
)
So, it is advised to do parallel operation manually if the function on j
is generic, but it may not always guarantee speed gain. Reference
==Solution==
In my case, I encountered vector memory exhaust when I plainly used DT[, var := some_function(var2)]
even though my server had 1TB of ram, while data was taking 200GB of memory.
I used split(DT, by='grouper')
to split my data.table
into chunks, and utilized doFuture
foreach
%dopar%
to do the job. It was pretty fast.
Related Topics
Programmatically Creating Markdown Tables in R with Knitr
Finding Row Index Containing Maximum Value Using R
How to Draw a Nice Arrow in Ggplot2
Install.Packages Fails in Knitr Document: "Trying to Use Cran Without Setting a Mirror"
Shift Values in Single Column of Dataframe Up
Replace Empty Values with Value from Other Column in a Dataframe
Using Two Scale Colour Gradients on One Ggplot
Is There Anything Wrong with Using T & F Instead of True & False
Outputting Multiple Lines of Text with Rendertext() in R Shiny
Importing CSV File into R - Numeric Values Read as Characters
How to Clear Only a Few Specific Objects from the Workspace
R Knitr Markdown: Output Plots Within for Loop
Difference Between Rbind() and Bind_Rows() in R
R: How to Run Some Code on Load of Package
Reverse Datetime (Posixct Data) Axis in Ggplot
Set Ggplot Plots to Have Same X-Axis Width and Same Space Between Dot Plot Rows