Parallel R on a Windows Cluster

Parallel R on a Windows cluster

Setting up snow on a Windows cluster is rather difficult. Each of the machines needs to have R and snow installed, but that's the easy part. To start a SOCK cluster, you would need an sshd daemon running on each of the worker machines, but you can still run into troubles, so I wouldn't recommend it unless you're good at debugging and Windows system administration.

I think your best option on a Windows cluster is to use MPI. I don't have any experience with MPI on Windows myself, but I've heard of people having success with the MPICH and DeinoMPI MPI distributions for Windows. Once MPI is installed on your cluster, you also need to install the Rmpi package from source on each of your worker machines. You would then create the cluster object using the makeMPIcluster function. It's a lot of work, but I think it's more likely to eventually work than trying to use a SOCK cluster due to the problems with ssh/sshd on Windows.

If you're desperate to run a parallel job once or twice on a Windows cluster, you could try using manual mode. It allows you to create a SOCK cluster without ssh:

workers <- c(rep("COMP01",32), rep("COMP02",32))
cl <- makeSOCKluster(workers, manual=TRUE)

The makeSOCKcluster function will prompt you to start each one of the workers, displaying the command to use for each. You have to manually open a command window on the specified machine and execute the specified command. It can be extremely tedious, particularly with many workers, but at least it's not complicated or tricky. It can also be very useful for debugging in combination with the outfile='' option.

How do I parallelize in r on windows - example?

Posting this because this took me bloody forever to figure out. Here's a simple example of parallelization in r that will let you test if things are working right for you and get you on the right path.

library(snow)
z=vector('list',4)
z=1:4
system.time(lapply(z,function(x) Sys.sleep(1)))
cl<-makeCluster(###YOUR NUMBER OF CORES GOES HERE ###,type="SOCK")
system.time(clusterApply(cl, z,function(x) Sys.sleep(1)))
stopCluster(cl)

You should also use library doSNOW to register foreach to the snow cluster, this will cause many packages to parallelize automatically. The command to register is registerDoSNOW(cl) (with cl being the return value from makeCluster()) , the command that undoes registration is registerDoSEQ(). Don't forget to turn off your clusters.

How to shut down an open R cluster connection using parallel

Use

autoStopCluster <- function(cl) {
  stopifnot(inherits(cl, "cluster"))
  env <- new.env()
  env$cluster <- cl
  attr(cl, "gcMe") <- env
  reg.finalizer(env, function(e) {
    message("Finalizing cluster ...")
    message(capture.output(print(e$cluster)))
    try(parallel::stopCluster(e$cluster), silent = FALSE)
    message("Finalizing cluster ... done")
  })
  cl
}

and then set up your cluster as:

cl <- autoStopCluster(makeCluster(n_c))

Old cluster objects no longer reachable will then be automatically stopped when garbage collected. You can trigger the garbage collector by calling gc(). For example, if you call:

cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
gc()

and watch your OSes process monitor, you'll see lots of workers being launched, but eventually when the garbage collector runs only the most recent set of cluster workers remain.

EDIT 2018-09-05: Added debug output messages to show when the registered finalizer runs, which happens when the garbage collector runs. Remove those message() lines and use silent = TRUE if you want it to be completely silent.

R with parallel & pls - How to handle non-terminating Rscript processes in Windows

You want to create only one "cluster" object (= one call to makeCluster(), not multiple). Something like:

cl <- makeCluster(num_cores, type = "PSOCK")
pls.options(parallel = cl)

[...]
for (i in 1:num_iterations) {
   gas1 <- plsr(octane ~ NIR, data = gasoline, validation = "LOO")
}

stopCluster(cl)

Explanation of your observations: If you use pls.options(parallel = makeCluster(...)) you end up creating another cluster in that call, which will not be explicitly stopped since you don't have a handle to it. Its underlying connections will eventually be closed when R's garbage collector finds such a "stray" cluster - this is why/when you get those warnings. If you put pls.options(parallel = makeCluster(...)) inside the loop you'll create one stray cluster per iteration and you'll get even more warnings. The garbage collector runs at "random" times, which is why the traceback of those warnings appear random/non-reproducible.

Create a cluster of co-workers' Windows 7 PCs for parallel processing in R?

Yes you can. There are a number of ways. One of the easiest is to use redis as a backend (as easy as calling sudo apt-get install redis-server on an Ubuntu machine; rumor has that you could have a redis backend on a windows machine too).

By using the doRedis package, you can very easily en-queue jobs on a task queue in redis, and then use one, two, ... idle workers to query the queue. Best of all, you can easily mix operating systems so yes, your co-workers' windows machines qualify. Moreover, you can use one, two, three, ... clients as you see fit and need and scale up or down. The queue does not know or care, it simply supplies jobs.

Bost of all, the vignette in the doRedis has working examples of a mix of Linux and Windows clients to make a bootstrapping example go faster.

Parallel R on a Windows Cluster