Parallel while loop in R
You can use futures with any looping R construct including for()
and while()
loops. Futures are implemented in the future package (I'm the author).
It is not clear what condition you want to use for your while loop. That would have helped give a more precise answer. Here is an example where the while condition does not depend on any of the results calculated:
library("listenv")
library("future")
plan(multiprocess, workers = 2L)
res <- listenv()
ii <- 1L
while (ii <= 5) {
## This is evaluated in parallel and will only block
## if all workers are busy.
res[[ii]] %<-% {
list(iteration = ii, pid = Sys.getpid())
}
ii <- ii + 1L
}
## Resolve all futures (blocks if not already finished)
res <- as.list(res)
str(res)
List of 5
$ :List of 2
..$ iteration: int 1
..$ pid : int 15606
$ :List of 2
..$ iteration: int 2
..$ pid : int 15608
$ :List of 2
..$ iteration: int 3
..$ pid : int 15609
$ :List of 2
..$ iteration: int 4
..$ pid : int 15610
$ :List of 2
..$ iteration: int 5
..$ pid : int 15611
If you want your while condition to depend on the outcome of one or more of the parallel results (in res[[ii]]
) then things become much more complicated, because then you need to figure out what should happen if the future (= parallel task) is still not resolved; should you check back, say, 5 iterations later, or should you wait. That's a design issue of the parallel algorithm and not how to implement. Anything is possible.
PS. I don't understand the downvotes this question received. It could be that it's because your question is poorly phrased / not very clear - remember to try as far as possible to "help the helper help you" by being as precise as possible. If the downvotes are because of the while loop, I (obviously) beg to disagree.
How to parallelize while loops?
Actually you don't need to parallelize the while loop. You can vectorize your operations over x
like below
iter <- 1000
myvec <- c()
while (is.null(myvec) || nrow(myvec) <= iter) {
x <- matrix(rnorm(iter * 10, mean = 0, sd = 1), ncol = 10)
myvec <- rbind(myvec, subset(x, rowSums(x) > 2.5))
}
myvec <- head(myvec, iter)
or
iter <- 1000
myvec <- list()
nl <- 0
while (nl < iter) {
x <- matrix(rnorm(iter * 10, mean = 0, sd = 1), ncol = 10)
v <- subset(x, rowSums(x) > 2.5)
nl <- nl + nrow(v)
myvec[[length(myvec) + 1]] <- v
}
myvec <- head(do.call(rbind, myvec), iter)
which would be much faster even if you have large iter
, I believe.
run a for loop in parallel in R
Thanks for your feedback. I did look up parallel
after I posted this question.
Finally after a few tries, I got it running. I have added the code below in case it is useful to others
library(foreach)
library(doParallel)
#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
finalMatrix <- foreach(i=1:150000, .combine=cbind) %dopar% {
tempMatrix = functionThatDoesSomething() #calling a function
#do other things if you want
tempMatrix #Equivalent to finalMatrix = cbind(finalMatrix, tempMatrix)
}
#stop cluster
stopCluster(cl)
Note - I must add a note that if the user allocates too many processes, then user may get this error: Error in serialize(data, node$con) : error writing to connection
Note - If .combine
in the foreach
statement is rbind
, then the final object returned would have been created by appending output of each loop row-wise.
Hope this is useful for folks trying out parallel processing in R for the first time like me.
References:
http://www.r-bloggers.com/parallel-r-loops-for-windows-and-linux/
https://beckmw.wordpress.com/2014/01/21/a-brief-foray-into-parallel-processing-with-r/
How to speed up a while-loop in R (perhaps using dopar)?
Thank you @Bas! I tested your suggestion on a Linux machine: for a file with ~239 million lines it took less than 1 min. By adding >lines.txt
I could save the results. Interestingly, my first readLines
R script needed "only" 29 min, which was surprisingly fast compared with my first experience (so I might have had some problem with my Windows computer at work which was not related to R).
How to parallelize for loops in R using multiple cores?
You're not doing anything wrong, just that the operations you're running don't take enough time to make the parallel execution economical. Here's a snippet from the foreach vignette:
Running many tiny tasks in parallel will usually take more time to execute than running them sequentially, and if it already runs fast, there’s no motivation to make it run faster anyway. But if the operation that we’re executing in parallel takes a minute or longer, there starts to be some motivation.
You can see the benefits of parallel execution if we run sqrt
not 500,000 times, but 50,000,000 times.
library(tictoc)
library(foreach)
library(doParallel)
registerDoParallel(16)
tic("no_parallel")
for (subject in 1:400){
for (number in 1:50000000){
sqrt(number)}
}
toc()
#> no_parallel: 271.312 sec elapsed
tic("parallel")
foreach (subject=1:400) %dopar% {
for (number in 1:50000000){
sqrt(number)}
}
toc()
#> parallel: 65.654 sec elapsed
Why does this for loop not work in parallel?
Firstly: the loop works in parallel, we just don't see the print
. The returned NULL
is not a result of the print
function, it is the list of function return values from parallel calls. Instead of print
, collect the values and return it. Printing to an external file will also work, but I suggest to start with the ordinary way first, as parLapply
collates the return values in a convenient way.
As an example how to use return values, try the following:
library(parallel)
rows <- seq(1,9,1)
for_test <- function(i){
txt <- NULL
for (s in 1:3){
txt <- rbind(txt, c(i, s, i*s))
}
txt
}
cls <- makeCluster(length(rows))
parLapply(cls, rows, for_test)
stopCluster(cls)
Explanation:
In the first example of the OP print
is within a for loop
, while in the second version it is the last statement. print
returns a value while for()
returns NULL
.
Demo:
> x <- print(2)
[1] 2
> x
[1] 2
>
> x <- for (i in 1:2) print(2 * i)
[1] 2
[1] 4
> x
NULL
>
Parallel computation in R for saving data over loops
Because I don't have your data, I made some small dummy sample.
The packages I used:
library(tidyverse)
library(openxlsx)
library(foreach)
library(doParallel)
This part from you and didn't change anything.
TYPE1 <- rawdata %>% filter(TYPE == "A")
TYPE2 <- rawdata %>% filter(TYPE == "B")
Split.TYPE1 <- split(TYPE1, TYPE1$Name)
Split.TYPE2 <- split(TYPE2, TYPE2$Name)
Define the parallel backend. I'am using 6 cores here.
cl <- makeCluster(6)
registerDoParallel(cl)
This is your first loop. Don't forget to add .packages = "openxlsx"
. This makes sure the package gets also send to the worker. I change a little bit the code, because nm in names(Split.TYPE1)
doesn't work for foreach. Maybe there is an easier solution, but I don't know it.
foreach(nm = 1:length(Split.TYPE1), .combine = cbind, .packages = "openxlsx") %dopar% {
file <- paste0(names(Split.TYPE1)[nm], ".xlsx")
d1 <- as.data.frame(Split.TYPE1[[names(Split.TYPE1)[nm]]])
wb <- createWorkbook(file)
addWorksheet(wb, "test", gridLines = TRUE)
writeData(wb, sheet = "test", x = d1)
saveWorkbook(wb, file, overwrite = TRUE)
}
The second loop. I only used it once in the past and it worked quite well for me. This is how you can make a nested foreach loop. More info here.
foreach(dn = 1:length(Split.TYPE2)) %:%
foreach(fn = 1:length(unique(Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)), .packages = "openxlsx") %dopar% {
dnn <- paste0(names(Split.TYPE2)[dn])
dir.create(dnn)
sub_Split.TYPE2 <- split(Split.TYPE2[[names(Split.TYPE2)[dn]]], Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)
file <- file.path(dnn, paste0(names(sub_Split.TYPE2)[fn],".xlsx"))
d1 <- as.data.frame(sub_Split.TYPE2[[fn]])
wb <- createWorkbook(file)
addWorksheet(wb, "test", gridLines = T)
writeData(wb, sheet = "test", x = d1)
saveWorkbook(wb, file, overwrite = TRUE)
}
And stop the parallel backend.
stopCluster(cl)
Using your data I get the following folder/file structure for the nested loop:
- Alan
- Glass.xlsx
- Heather
- Poker.xlsx
- Rose
- beer.xlsx
- Sam
- Mac.xlsx
- Tara
- tea.xlsx
Parallelizing a loop with updating during each iteration
Here is a solution that uses doParallel
(with the options for UNIX systems, but you can also use it on Windows, see here) and foreach
that stores the results for every state separately and afterwards reads in the single files and combines them to a list.
library(doParallel)
library(foreach)
path_results <- "my_path"
ncpus = 8L
registerDoParallel(cores = ncpus)
states <- c("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI",
"ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI",
"MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC",
"ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT",
"VT","VA","WA","WV","WI","WY","DC","PR")
results <- foreach(state = states) %dopar% {
county <- census_geo_api(key = "XXX", state = state, geo = "county", age = TRUE, sex = TRUE)
tract <- census_geo_api(key = "XXX", state = state, geo = "tract", age = TRUE, sex = TRUE)
block <- census_geo_api(key = "XXX", state = state, geo = "block", age = TRUE, sex = TRUE)
results <- list(state = state, age = TRUE, sex = TRUE, block = block, tract = tract, county = county)
# store the results as rds
saveRDS(results,
file = paste0(path_results, "/", state, ".Rds"))
# remove the results
rm(county)
rm(tract)
rm(block)
rm(results)
gc()
# just return a string
paste0("done with ", state)
}
library(purrr)
# combine the results to a list
result_files <- list.files(path = path_results)
CensusObj_block_age_sex <- set_names(result_files, states) %>%
map(~ readRDS(file = paste0(path_results, "/", .x)))
Related Topics
Using Italic() with a Variable in Ggplot2 Title Expression
Ggplot2: Geom_Smooth Confidence Band Does Not Extend to Edge of Graph, Even with Fullrange=True
R Error: Cannot Coerce Type 'Closure' to Vector of Type 'Double'
Convert Data with One Column and Multiple Rows into Multi Column Multi Row Data
Ggplot2: Have Common Facet Bar in Outer Facet Panel in 3-Way Plot
Highlight a Single "Bar" in Ggplot
Add Months of Zero Demand to Zoo Time Series
How to Scrape Website with Form Using Rvest
R Convert String Date (E.G. "October 1, 2014") to Date Format
Add Multiple Curves/Functions to One Ggplot Through Looping
Removing Unicode Symbols from Column Names
Group Values by Unique Elements
How to Decode Postgresql Bytea Column Hex to Int16/Uint16 in R
Create All Subvectors of a Certain Length (Moving Window)
Selecting Multiple Columns in Data Frame Using Partial Column Name