run a for loop in parallel in R
Thanks for your feedback. I did look up parallel
after I posted this question.
Finally after a few tries, I got it running. I have added the code below in case it is useful to others
library(foreach)
library(doParallel)
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
finalMatrix <- foreach(i=1:150000, .combine=cbind) %dopar% {
tempMatrix = functionThatDoesSomething() #calling a function
#do other things if you want
tempMatrix #Equivalent to finalMatrix = cbind(finalMatrix, tempMatrix)
}
#stop cluster
stopCluster(cl)
Note - I must add a note that if the user allocates too many processes, then user may get this error: Error in serialize(data, node$con) : error writing to connection
Note - If .combine
in the foreach
statement is rbind
, then the final object returned would have been created by appending output of each loop row-wise.
Hope this is useful for folks trying out parallel processing in R for the first time like me.
References:
http://www.r-bloggers.com/parallel-r-loops-for-windows-and-linux/
https://beckmw.wordpress.com/2014/01/21/a-brief-foray-into-parallel-processing-with-r/
Parallel Computing for nested for loop in R
This is a pretty simple parallelization scenario.
Without a reproducible example I cannot guarantee this will work. However, this is how I would approach it, that is parallelizing the outermost loop.
library(doParallel)
library(foreach)
library(fGarch)
registerDoParallel(parallel::detectCores()-2) #Or set this to whatever is reasonable for your computer/server
holder <- foreach(x=1:1000, .combine = "rbind", .packages='fGarch') %dopar% {
end = x + 99
thedata = dataindataframe[x:end,]
pred <- numeric(20L)
for (y in 1:20) {
m = garchFit(~garch(1,1), data = thedata[,y], trace = FALSE)
pred[y] = predict(m, 1)[,3]
}
return(pred)
}
Some other resources:
- foreach vignette
- doparallel vignette
Can I use the parallel version of for loop and apply family together?
I feel like it's due to the parSapply occupied the whole cores so foreach don't have additional cores to compute. Is there any good idea to fix it? Basically I want to achieve both two processes running in their parallel versions.
Nah, that's not a good idea. You're basically trying to over-parallelize here (but that does actually happen in your code as explained below).
Another problem is: suppose we can only choose one process to do the parallel computation, which one should I choose? The for loop or apply family?
There is no one right answer to that. I recommend that you profile your *** process ***
code to figure out how much it gains from parallelization.
So, I found your parSapply(cl, ...)
on top a foreach() %dopar% { ... }
using the same cluster cl
interesting. First time I've seen this asked/proposed in that way. You don't want to do this for sure but the question/attempt is not crazy. Your intuition that all workers would be occupied when foreach() %dopar% { ... }
attempts to use them is partly correct. However, what is really happening is also that the foreach() %dopar% { ... }
statement is evaluated in the workers not the main R session where the cluster cl
was defined. On the workers, there are no foreach adaptors registered, so those calls will default to sequential processing (== foreach::registerDoSEQ()
). To achieve nested parallelization, you'd had to set up and register a cluster within each worker, e.g. inside the myfunction()
function.
As the author of the future framework, I'd like to propose you make use of that. It'll protect you against the above mistakes and it will not over-parallel either (you can do it if you really really want to do it). Here is how I would rewrite your code example:
library(foreach) ## foreach() and %dopar%
myfunction <- function{data} {
df <- foreach(i = 1:200, .combine = "rbind") %:%
foreach(j = 1:200, .combine = "rbind") %dopar% {
*****
process
*****
}
data <- df[1,1]
return(data)
}
## Tell foreach to parallelize via the future framework
doFuture::registerDoFuture()
## Have the future framework parallelize using a cluster of
## local workers (similar to makeCluster(detectCores()))
library(future)
plan(multisession)
library(future.apply) ## future_sapply()
system.time({
mat <- t(future_sapply(list, myfuntion))
})
Now, what is important to understand is that the outer future_sapply()
parallelization will operate on the 'multisession' cluster. When you get to the inner foreach() %dopar% { ... }
parallelization, all that foreach sees is a sequential worker so that inner layer will be processed in parallel. This is what I mean that the future framework will automatically protect you from over-parallelization.
If you'd like to have the inner layer parallelize on a 'multisession' cluster and the outer to be sequential, you can set that up as:
plan(list(sequential, multisession))
If you really want to do nested parallelization, say, two outer-level workers and 4 inner-level workers, you can use:
plan(list(tweak(multisession, workers = 2), tweak(multisession, workers = 4))
This will run 2*4 = 8 parallel R processes at the same time.
What is more useful is when you have multiple machines available, then you can use those for the outer level and then use a multisession cluster on each of them. Something like:
plan(list(tweak(cluster, workers = c("machine1", "machine2")), multisession))
You can read more about this in future vignettes.
How to parallelize for loops in R using multiple cores?
You're not doing anything wrong, just that the operations you're running don't take enough time to make the parallel execution economical. Here's a snippet from the foreach vignette:
Running many tiny tasks in parallel will usually take more time to execute than running them sequentially, and if it already runs fast, there’s no motivation to make it run faster anyway. But if the operation that we’re executing in parallel takes a minute or longer, there starts to be some motivation.
You can see the benefits of parallel execution if we run sqrt
not 500,000 times, but 50,000,000 times.
library(tictoc)
library(foreach)
library(doParallel)
registerDoParallel(16)
tic("no_parallel")
for (subject in 1:400){
for (number in 1:50000000){
sqrt(number)}
}
toc()
#> no_parallel: 271.312 sec elapsed
tic("parallel")
foreach (subject=1:400) %dopar% {
for (number in 1:50000000){
sqrt(number)}
}
toc()
#> parallel: 65.654 sec elapsed
How can I make this code run in parallel ? For loop
My go to is the future.apply
package.
library(future.apply)
plan(multisession)
nested_inter$model = future_Map(nb_thesis_inter,
nested_nb$data,
nested_nb$model)
Two things to note.
plan(multisession)
allows Windows to be used in parallel. See?plan
for all options.- I did not install all of the packages because the example was not reproducible. The
future_Map
call may need to be changed tofuture_map(function (x, y) nb_thesis_inter(df = x, mdl = y), ...)
depending on the default argument order ofnb_thesis_inter
.
Parallel computation in R for saving data over loops
Because I don't have your data, I made some small dummy sample.
The packages I used:
library(tidyverse)
library(openxlsx)
library(foreach)
library(doParallel)
This part from you and didn't change anything.
TYPE1 <- rawdata %>% filter(TYPE == "A")
TYPE2 <- rawdata %>% filter(TYPE == "B")
Split.TYPE1 <- split(TYPE1, TYPE1$Name)
Split.TYPE2 <- split(TYPE2, TYPE2$Name)
Define the parallel backend. I'am using 6 cores here.
cl <- makeCluster(6)
registerDoParallel(cl)
This is your first loop. Don't forget to add .packages = "openxlsx"
. This makes sure the package gets also send to the worker. I change a little bit the code, because nm in names(Split.TYPE1)
doesn't work for foreach. Maybe there is an easier solution, but I don't know it.
foreach(nm = 1:length(Split.TYPE1), .combine = cbind, .packages = "openxlsx") %dopar% {
file <- paste0(names(Split.TYPE1)[nm], ".xlsx")
d1 <- as.data.frame(Split.TYPE1[[names(Split.TYPE1)[nm]]])
wb <- createWorkbook(file)
addWorksheet(wb, "test", gridLines = TRUE)
writeData(wb, sheet = "test", x = d1)
saveWorkbook(wb, file, overwrite = TRUE)
}
The second loop. I only used it once in the past and it worked quite well for me. This is how you can make a nested foreach loop. More info here.
foreach(dn = 1:length(Split.TYPE2)) %:%
foreach(fn = 1:length(unique(Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)), .packages = "openxlsx") %dopar% {
dnn <- paste0(names(Split.TYPE2)[dn])
dir.create(dnn)
sub_Split.TYPE2 <- split(Split.TYPE2[[names(Split.TYPE2)[dn]]], Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)
file <- file.path(dnn, paste0(names(sub_Split.TYPE2)[fn],".xlsx"))
d1 <- as.data.frame(sub_Split.TYPE2[[fn]])
wb <- createWorkbook(file)
addWorksheet(wb, "test", gridLines = T)
writeData(wb, sheet = "test", x = d1)
saveWorkbook(wb, file, overwrite = TRUE)
}
And stop the parallel backend.
stopCluster(cl)
Using your data I get the following folder/file structure for the nested loop:
- Alan
- Glass.xlsx
- Heather
- Poker.xlsx
- Rose
- beer.xlsx
- Sam
- Mac.xlsx
- Tara
- tea.xlsx
How to parallelize a for loop that is looping over a vector in R
Here's a foreach
example:
library(foreach)
library(doParallel)
registerDoParallel(cores = 6)
output <- foreach(x = myvec) %dopar% {floor(x)^2 + exp(x)^2/2}
Related Topics
Tidyverse Pivot_Longer Several Sets of Columns, But Avoid Intermediate Mutate_Wider Steps
R - Group by Variable and Then Assign a Unique Id
Meaning of Ddply Error: 'Names' Attribute [9] Must Be the Same Length as the Vector [1]
Remove All of X Axis Labels in Ggplot
Add Empty Columns to a Dataframe with Specified Names from a Vector
Error in If/While (Condition) {:Argument Is of Length Zero
Change Colors in Ggpairs Now That Params Is Deprecated
How to Loop/Repeat a Linear Regression in R
Ggplot2: Curly Braces on an Axis
R Ggplot2 Merge with Shapefile and CSV Data to Fill Polygons
Why Does R Use Partial Matching
How to Create a Loop That Includes Both a Code Chunk and Text with Knitr in R
Combine Rows in Data Frame Containing Na to Make Complete Row
Getting Strings Recognized as Variable Names in R
Scraping a Dynamic Ecommerce Page with Infinite Scroll
Ggplot Side by Side Geom_Bar()