Run a for Loop in Parallel in R

run a for loop in parallel in R

Thanks for your feedback. I did look up parallel after I posted this question.

Finally after a few tries, I got it running. I have added the code below in case it is useful to others

library(foreach)
library(doParallel)

#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)

finalMatrix <- foreach(i=1:150000, .combine=cbind) %dopar% {
tempMatrix = functionThatDoesSomething() #calling a function
#do other things if you want

tempMatrix #Equivalent to finalMatrix = cbind(finalMatrix, tempMatrix)
}
#stop cluster
stopCluster(cl)

Note - I must add a note that if the user allocates too many processes, then user may get this error: Error in serialize(data, node$con) : error writing to connection

Note - If .combine in the foreach statement is rbind , then the final object returned would have been created by appending output of each loop row-wise.

Hope this is useful for folks trying out parallel processing in R for the first time like me.

References:
http://www.r-bloggers.com/parallel-r-loops-for-windows-and-linux/
https://beckmw.wordpress.com/2014/01/21/a-brief-foray-into-parallel-processing-with-r/

Parallel Computing for nested for loop in R

This is a pretty simple parallelization scenario.

Without a reproducible example I cannot guarantee this will work. However, this is how I would approach it, that is parallelizing the outermost loop.

library(doParallel)
library(foreach)
library(fGarch)

registerDoParallel(parallel::detectCores()-2) #Or set this to whatever is reasonable for your computer/server

holder <- foreach(x=1:1000, .combine = "rbind", .packages='fGarch') %dopar% {
end = x + 99
thedata = dataindataframe[x:end,]

pred <- numeric(20L)
for (y in 1:20) {
m = garchFit(~garch(1,1), data = thedata[,y], trace = FALSE)
pred[y] = predict(m, 1)[,3]
}
return(pred)
}

Some other resources:

  • foreach vignette
  • doparallel vignette

Can I use the parallel version of for loop and apply family together?

I feel like it's due to the parSapply occupied the whole cores so foreach don't have additional cores to compute. Is there any good idea to fix it? Basically I want to achieve both two processes running in their parallel versions.

Nah, that's not a good idea. You're basically trying to over-parallelize here (but that does actually happen in your code as explained below).

Another problem is: suppose we can only choose one process to do the parallel computation, which one should I choose? The for loop or apply family?

There is no one right answer to that. I recommend that you profile your *** process *** code to figure out how much it gains from parallelization.

So, I found your parSapply(cl, ...) on top a foreach() %dopar% { ... } using the same cluster cl interesting. First time I've seen this asked/proposed in that way. You don't want to do this for sure but the question/attempt is not crazy. Your intuition that all workers would be occupied when foreach() %dopar% { ... } attempts to use them is partly correct. However, what is really happening is also that the foreach() %dopar% { ... } statement is evaluated in the workers not the main R session where the cluster cl was defined. On the workers, there are no foreach adaptors registered, so those calls will default to sequential processing (== foreach::registerDoSEQ()). To achieve nested parallelization, you'd had to set up and register a cluster within each worker, e.g. inside the myfunction() function.

As the author of the future framework, I'd like to propose you make use of that. It'll protect you against the above mistakes and it will not over-parallel either (you can do it if you really really want to do it). Here is how I would rewrite your code example:

library(foreach) ## foreach() and %dopar%

myfunction <- function{data} {
df <- foreach(i = 1:200, .combine = "rbind") %:%
foreach(j = 1:200, .combine = "rbind") %dopar% {
*****
process
*****
}

data <- df[1,1]
return(data)
}


## Tell foreach to parallelize via the future framework
doFuture::registerDoFuture()

## Have the future framework parallelize using a cluster of
## local workers (similar to makeCluster(detectCores()))
library(future)
plan(multisession)

library(future.apply) ## future_sapply()

system.time({
mat <- t(future_sapply(list, myfuntion))
})

Now, what is important to understand is that the outer future_sapply() parallelization will operate on the 'multisession' cluster. When you get to the inner foreach() %dopar% { ... } parallelization, all that foreach sees is a sequential worker so that inner layer will be processed in parallel. This is what I mean that the future framework will automatically protect you from over-parallelization.

If you'd like to have the inner layer parallelize on a 'multisession' cluster and the outer to be sequential, you can set that up as:

plan(list(sequential, multisession))

If you really want to do nested parallelization, say, two outer-level workers and 4 inner-level workers, you can use:

plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 4))

This will run 2*4 = 8 parallel R processes at the same time.

What is more useful is when you have multiple machines available, then you can use those for the outer level and then use a multisession cluster on each of them. Something like:

plan(list(tweak(cluster, workers = c("machine1", "machine2")), multisession))

You can read more about this in future vignettes.

How to parallelize for loops in R using multiple cores?

You're not doing anything wrong, just that the operations you're running don't take enough time to make the parallel execution economical. Here's a snippet from the foreach vignette:

Running many tiny tasks in parallel will usually take more time to execute than running them sequentially, and if it already runs fast, there’s no motivation to make it run faster anyway. But if the operation that we’re executing in parallel takes a minute or longer, there starts to be some motivation.

You can see the benefits of parallel execution if we run sqrt not 500,000 times, but 50,000,000 times.

library(tictoc)
library(foreach)
library(doParallel)
registerDoParallel(16)

tic("no_parallel")

for (subject in 1:400){
for (number in 1:50000000){
sqrt(number)}
}

toc()
#> no_parallel: 271.312 sec elapsed

tic("parallel")

foreach (subject=1:400) %dopar% {
for (number in 1:50000000){
sqrt(number)}
}

toc()
#> parallel: 65.654 sec elapsed

How can I make this code run in parallel ? For loop

My go to is the future.apply package.

library(future.apply)
plan(multisession)

nested_inter$model = future_Map(nb_thesis_inter,
nested_nb$data,
nested_nb$model)

Two things to note.

  1. plan(multisession) allows Windows to be used in parallel. See ?plan for all options.
  2. I did not install all of the packages because the example was not reproducible. The future_Map call may need to be changed to future_map(function (x, y) nb_thesis_inter(df = x, mdl = y), ...) depending on the default argument order of nb_thesis_inter.

Parallel computation in R for saving data over loops

Because I don't have your data, I made some small dummy sample.

The packages I used:

library(tidyverse)
library(openxlsx)
library(foreach)
library(doParallel)

This part from you and didn't change anything.

TYPE1 <- rawdata %>% filter(TYPE == "A") 
TYPE2 <- rawdata %>% filter(TYPE == "B")

Split.TYPE1 <- split(TYPE1, TYPE1$Name)
Split.TYPE2 <- split(TYPE2, TYPE2$Name)

Define the parallel backend. I'am using 6 cores here.

cl <- makeCluster(6)
registerDoParallel(cl)

This is your first loop. Don't forget to add .packages = "openxlsx". This makes sure the package gets also send to the worker. I change a little bit the code, because nm in names(Split.TYPE1) doesn't work for foreach. Maybe there is an easier solution, but I don't know it.

foreach(nm = 1:length(Split.TYPE1), .combine = cbind, .packages = "openxlsx") %dopar% {
file <- paste0(names(Split.TYPE1)[nm], ".xlsx")
d1 <- as.data.frame(Split.TYPE1[[names(Split.TYPE1)[nm]]])

wb <- createWorkbook(file)
addWorksheet(wb, "test", gridLines = TRUE)
writeData(wb, sheet = "test", x = d1)
saveWorkbook(wb, file, overwrite = TRUE)
}

The second loop. I only used it once in the past and it worked quite well for me. This is how you can make a nested foreach loop. More info here.

foreach(dn = 1:length(Split.TYPE2)) %:%
foreach(fn = 1:length(unique(Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)), .packages = "openxlsx") %dopar% {
dnn <- paste0(names(Split.TYPE2)[dn])
dir.create(dnn)
sub_Split.TYPE2 <- split(Split.TYPE2[[names(Split.TYPE2)[dn]]], Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)


file <- file.path(dnn, paste0(names(sub_Split.TYPE2)[fn],".xlsx"))

d1 <- as.data.frame(sub_Split.TYPE2[[fn]])
wb <- createWorkbook(file)
addWorksheet(wb, "test", gridLines = T)
writeData(wb, sheet = "test", x = d1)
saveWorkbook(wb, file, overwrite = TRUE)
}

And stop the parallel backend.

stopCluster(cl)

Using your data I get the following folder/file structure for the nested loop:

- Alan
- Glass.xlsx
- Heather
- Poker.xlsx
- Rose
- beer.xlsx
- Sam
- Mac.xlsx
- Tara
- tea.xlsx

How to parallelize a for loop that is looping over a vector in R

Here's a foreach example:

library(foreach)
library(doParallel)

registerDoParallel(cores = 6)
output <- foreach(x = myvec) %dopar% {floor(x)^2 + exp(x)^2/2}


Related Topics



Leave a reply



Submit