Parallel While Loop in R

Parallel while loop in R

You can use futures with any looping R construct including for() and while() loops. Futures are implemented in the future package (I'm the author).

It is not clear what condition you want to use for your while loop. That would have helped give a more precise answer. Here is an example where the while condition does not depend on any of the results calculated:

library("listenv")
library("future")
plan(multiprocess, workers = 2L)

res <- listenv()

ii <- 1L
while (ii <= 5) {
  ## This is evaluated in parallel and will only block
  ## if all workers are busy.
  res[[ii]] %<-% {
    list(iteration = ii, pid = Sys.getpid())
  }
  ii <- ii + 1L
}

## Resolve all futures (blocks if not already finished)
res <- as.list(res)

str(res)
List of 5
 $ :List of 2
  ..$ iteration: int 1
  ..$ pid      : int 15606
 $ :List of 2
  ..$ iteration: int 2
  ..$ pid      : int 15608
 $ :List of 2
  ..$ iteration: int 3
  ..$ pid      : int 15609
 $ :List of 2
  ..$ iteration: int 4
  ..$ pid      : int 15610
 $ :List of 2
  ..$ iteration: int 5
  ..$ pid      : int 15611

If you want your while condition to depend on the outcome of one or more of the parallel results (in res[[ii]]) then things become much more complicated, because then you need to figure out what should happen if the future (= parallel task) is still not resolved; should you check back, say, 5 iterations later, or should you wait. That's a design issue of the parallel algorithm and not how to implement. Anything is possible.

PS. I don't understand the downvotes this question received. It could be that it's because your question is poorly phrased / not very clear - remember to try as far as possible to "help the helper help you" by being as precise as possible. If the downvotes are because of the while loop, I (obviously) beg to disagree.

How to parallelize while loops?

Actually you don't need to parallelize the while loop. You can vectorize your operations over x like below

iter <- 1000
myvec <- c()
while (is.null(myvec) || nrow(myvec) <= iter) {
  x <- matrix(rnorm(iter * 10, mean = 0, sd = 1), ncol = 10)
  myvec <- rbind(myvec, subset(x, rowSums(x) > 2.5))
}
myvec <- head(myvec, iter)

iter <- 1000
myvec <- list()
nl <- 0
while (nl < iter) {
  x <- matrix(rnorm(iter * 10, mean = 0, sd = 1), ncol = 10)
  v <- subset(x, rowSums(x) > 2.5)
  nl <- nl + nrow(v)
  myvec[[length(myvec) + 1]] <- v
}
myvec <- head(do.call(rbind, myvec), iter)

which would be much faster even if you have large iter, I believe.

run a for loop in parallel in R

Thanks for your feedback. I did look up parallel after I posted this question.

Finally after a few tries, I got it running. I have added the code below in case it is useful to others

library(foreach)
library(doParallel)

#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)

finalMatrix <- foreach(i=1:150000, .combine=cbind) %dopar% {
   tempMatrix = functionThatDoesSomething() #calling a function
   #do other things if you want

   tempMatrix #Equivalent to finalMatrix = cbind(finalMatrix, tempMatrix)
}
#stop cluster
stopCluster(cl)

Note - I must add a note that if the user allocates too many processes, then user may get this error: Error in serialize(data, node$con) : error writing to connection

Note - If .combine in the foreach statement is rbind , then the final object returned would have been created by appending output of each loop row-wise.

Hope this is useful for folks trying out parallel processing in R for the first time like me.

References:
http://www.r-bloggers.com/parallel-r-loops-for-windows-and-linux/
https://beckmw.wordpress.com/2014/01/21/a-brief-foray-into-parallel-processing-with-r/

How to speed up a while-loop in R (perhaps using dopar)?

Thank you @Bas! I tested your suggestion on a Linux machine: for a file with ~239 million lines it took less than 1 min. By adding >lines.txt I could save the results. Interestingly, my first readLines R script needed "only" 29 min, which was surprisingly fast compared with my first experience (so I might have had some problem with my Windows computer at work which was not related to R).

How to parallelize for loops in R using multiple cores?

You're not doing anything wrong, just that the operations you're running don't take enough time to make the parallel execution economical. Here's a snippet from the foreach vignette:

Running many tiny tasks in parallel will usually take more time to execute than running them sequentially, and if it already runs fast, there’s no motivation to make it run faster anyway. But if the operation that we’re executing in parallel takes a minute or longer, there starts to be some motivation.

You can see the benefits of parallel execution if we run sqrt not 500,000 times, but 50,000,000 times.

library(tictoc)
library(foreach)
library(doParallel)
registerDoParallel(16)

tic("no_parallel")

for (subject in 1:400){
  for (number in 1:50000000){
    sqrt(number)}
}

toc()
#> no_parallel: 271.312 sec elapsed

tic("parallel")

foreach (subject=1:400) %dopar% {
  for (number in 1:50000000){
    sqrt(number)}
}

toc()
#> parallel: 65.654 sec elapsed

Why does this for loop not work in parallel?

Firstly: the loop works in parallel, we just don't see the print. The returned NULL is not a result of the print function, it is the list of function return values from parallel calls. Instead of print, collect the values and return it. Printing to an external file will also work, but I suggest to start with the ordinary way first, as parLapply collates the return values in a convenient way.

As an example how to use return values, try the following:

library(parallel)

rows <- seq(1,9,1)

for_test <- function(i){
  txt <- NULL
  for (s in 1:3){
    txt <- rbind(txt, c(i, s, i*s))
  }
  txt
}

cls <- makeCluster(length(rows))
parLapply(cls, rows, for_test)
stopCluster(cls)

Explanation:

In the first example of the OP printis within a for loop, while in the second version it is the last statement. print returns a value while for()returns NULL.

Demo:

> x <- print(2)
[1] 2
> x
[1] 2
> 
> x <- for (i in 1:2) print(2 * i)
[1] 2
[1] 4
> x
NULL
>

Parallel computation in R for saving data over loops

Because I don't have your data, I made some small dummy sample.

The packages I used:

library(tidyverse)
library(openxlsx)
library(foreach)
library(doParallel)

This part from you and didn't change anything.

TYPE1 <- rawdata %>% filter(TYPE == "A") 
TYPE2 <- rawdata %>% filter(TYPE == "B") 

Split.TYPE1 <- split(TYPE1, TYPE1$Name) 
Split.TYPE2 <- split(TYPE2, TYPE2$Name)

Define the parallel backend. I'am using 6 cores here.

cl <- makeCluster(6)
registerDoParallel(cl)

This is your first loop. Don't forget to add .packages = "openxlsx". This makes sure the package gets also send to the worker. I change a little bit the code, because nm in names(Split.TYPE1) doesn't work for foreach. Maybe there is an easier solution, but I don't know it.

foreach(nm = 1:length(Split.TYPE1), .combine = cbind, .packages = "openxlsx") %dopar% {
  file <- paste0(names(Split.TYPE1)[nm], ".xlsx")
  d1 <- as.data.frame(Split.TYPE1[[names(Split.TYPE1)[nm]]])

  wb <- createWorkbook(file)
  addWorksheet(wb, "test", gridLines = TRUE)
  writeData(wb, sheet = "test", x = d1)
  saveWorkbook(wb, file, overwrite = TRUE)
}

The second loop. I only used it once in the past and it worked quite well for me. This is how you can make a nested foreach loop. More info here.

foreach(dn = 1:length(Split.TYPE2)) %:%
  foreach(fn = 1:length(unique(Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)), .packages = "openxlsx") %dopar% {
    dnn <- paste0(names(Split.TYPE2)[dn])
    dir.create(dnn)
    sub_Split.TYPE2 <- split(Split.TYPE2[[names(Split.TYPE2)[dn]]], Split.TYPE2[[names(Split.TYPE2)[dn]]]$Surname)

    file <- file.path(dnn, paste0(names(sub_Split.TYPE2)[fn],".xlsx"))

    d1 <- as.data.frame(sub_Split.TYPE2[[fn]])
    wb <- createWorkbook(file)
    addWorksheet(wb, "test", gridLines = T)
    writeData(wb, sheet = "test", x = d1)
    saveWorkbook(wb, file, overwrite = TRUE)
  }

And stop the parallel backend.

stopCluster(cl)

Using your data I get the following folder/file structure for the nested loop:

- Alan
    - Glass.xlsx
- Heather
    - Poker.xlsx
- Rose
    - beer.xlsx
- Sam
    - Mac.xlsx
- Tara
    - tea.xlsx

Parallelizing a loop with updating during each iteration

Here is a solution that uses doParallel (with the options for UNIX systems, but you can also use it on Windows, see here) and foreach that stores the results for every state separately and afterwards reads in the single files and combines them to a list.

library(doParallel)
library(foreach)

path_results <- "my_path"
ncpus = 8L
registerDoParallel(cores = ncpus)
states <- c("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI",
            "ID","IL","IN","IA","KS","KY","LA","ME","MD","MA","MI",
            "MN","MS","MO","MT","NE","NV","NH","NJ","NM","NY","NC",
            "ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT",
            "VT","VA","WA","WV","WI","WY","DC","PR")
results <- foreach(state = states) %dopar% {
                     county <- census_geo_api(key = "XXX", state = state, geo = "county", age = TRUE, sex = TRUE)
                     tract  <- census_geo_api(key = "XXX", state = state, geo = "tract",  age = TRUE, sex = TRUE)
                     block  <- census_geo_api(key = "XXX", state = state, geo = "block",  age = TRUE, sex = TRUE)
                     results <- list(state = state, age = TRUE, sex = TRUE, block = block, tract = tract, county = county)
                     
                     # store the results as rds
                     saveRDS(results,
                             file = paste0(path_results, "/", state, ".Rds"))
                     
                     # remove the results
                     rm(county)
                     rm(tract)
                     rm(block)
                     rm(results)
                     gc()
                     
                     # just return a string
                     paste0("done with ", state)
}

library(purrr)
# combine the results to a list
result_files <- list.files(path = path_results)
CensusObj_block_age_sex <- set_names(result_files, states) %>% 
  map(~ readRDS(file = paste0(path_results, "/", .x)))

Parallel While Loop in R