Makecluster Function in R Snow Hangs Indefinitely

makeCluster function in R snow hangs indefinitely

Unfortunately, there are a lot of things that can go wrong when creating a snow (or parallel) cluster object, and the most common failure mode is to hang indefinitely. The problem is that makeSOCKcluster launches the cluster workers one by one, and each worker (if successfully started) must make a socket connection back to the master before the master proceeds to launch the next worker. If any of the workers fail to connect back to the master, makeSOCKcluster will hang without any error message. The worker may issue an error message, but by default any error message is redirected to /dev/null.

In addition to ssh problems, makeSOCKcluster could hang because:

  • R not installed on a worker machine
  • snow not installed on a the worker machine
  • R or snow not installed in the same location as the local machine
  • current user doesn't exist on a worker machine
  • networking problem
  • firewall problem

and there are many more possibilities.

In other words, no one can diagnose this problem without further information, so you have to do some troubleshooting in order to get that information.

In my experience, the single most useful troubleshooting technique is manual mode which you enable by specifying manual=TRUE when creating the cluster object. It's also a good idea to set outfile="" so that error messages from the workers aren't redirected to /dev/null:

cl <- makeSOCKcluster("192.168.128.24", manual=TRUE, outfile="")

makeSOCKcluster will display an Rscript command to execute in a terminal on the specified machine, and then it will wait for you to execute that command. In other words, makeSOCKcluster will hang until you manually start the worker on host 192.168.128.24, in your case. Remember that this is a troubleshooting technique, not a solution to the problem, and the hope is to get more information about why the workers aren't starting by trying to start them manually.

Obviously, the use of manual mode bypasses any ssh issues (since you're not using ssh), so if you can create a SOCK cluster successfully in manual mode, then probably ssh is your problem. If the Rscript command isn't found, then either R isn't installed, or it's installed in a different location. But hopefully you'll get some error message that will lead you to the solution.

If makeSOCKcluster still just hangs after you've executed the specified Rscript command on the specified machine, then you probably have a networking or firewall issue.

For more troubleshooting advice, see my answer for making cluster in doParallel / snowfall hangs.

R: making cluster in doParallel / snowfall hangs

You could start by setting the "outfile" option to an empty string when creating the cluster object:

makePSOCKcluster("192.168.1.1",user="username",outfile="")

This allows you to see error messages from the workers in your terminal, which will hopefully provide a clue to the problem. If that doesn't help, I recommend using manual mode:

makePSOCKcluster("192.168.1.1",user="username",outfile="",manual=TRUE)

This bypasses ssh, and displays commands for you to execute in order to manually start each of the workers in separate terminals. This can uncover problems such as R packages that are not installed. It also allows you to debug the workers using whatever debugging tools you choose, although that takes a bit of work.

If makePSOCKcluster doesn't respond after you execute the specified command, it means that the worker wasn't able to connect to the master process. If the worker doesn't display any error message, it may indicate a networking problem, possibly due to a firewall blocking the connection. Since makePSOCKcluster uses a random port by default in R 3.X, you should specify an explicit value for port and configure your firewall to allow connections to that port.

To test for networking or firewall problems, you could try connecting to the master process using "netcat". Execute makePSOCKcluster in manual mode, specifying the hostname of the desired worker host and the port on local machine that should allow incoming connections:

> library(parallel)
> makePSOCKcluster("node03", port=11234, manual=TRUE)
Manually start worker on node03 with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=node01
PORT=11234 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

Now start a terminal session on "node03" and execute "nc" using the indicated values of "MASTER" and "PORT" as arguments:

node03$ nc node01 11234

The master process should immediately return with the message:

socket cluster with 1 nodes on host ‘node03’

while netcat should display no message, since it is quietly reading from the socket connection.

However, if netcat displays the message:

nc: getaddrinfo: Name or service not known

then you have a hostname resolution problem. If you can find a hostname that does work with netcat, you may be able to get makePSOCKcluster to work by specifying that name via the "master" option: makePSOCKcluster("node03", master="node01", port=11234).

If netcat returns immediately, that may indicate that it wasn't able to connect to the specified port. If it returns after a minute or two, that may indicate that it wasn't able to communicate with specified host at all. In either case, check netcat's return value to verify that it was an error:

node03$ echo $?
1

Hopefully that will give you enough information about the problem that you can get help from a network administrator.

ClusterFuture with snow blocks

In your code the actual parallel workload is not handled by future but by snow::parLapply. You can see that with the following example, where I am using parallel instead of snow, which I would treat as deprecated for simple PSOCK clusters:

RunNM2 <- function(index){
Sys.sleep(4)
return(index)
}
library(tictoc)
library(parallel)
cl <- makePSOCKcluster(rep("localhost",8))
tic("cluster")
res <- parLapply(cl,1:8,RunNM2)
toc()
#> cluster: 4.015 sec elapsed
stopCluster(cl)
rm(cl)

Created on 2019-06-04 by the reprex package (v0.3.0)

So currently you are creating one future out of the result of your parallel computation. Instead you should create several futures, that are then evaluated in parallel:

RunNM2 <- function(index){
Sys.sleep(4)
return(index)
}
library(tictoc)
library(future)
cl <- makeClusterPSOCK(rep("localhost",8))
plan(cluster, workers = cl)
tic("cluster")
res.1 <- lapply(1:8, function(index) future(RunNM2(index)))
res <- values(res.1)
# blocks here
toc()
#> cluster: 4.66 sec elapsed
parallel::stopCluster(cl)
rm(cl)

Created on 2019-06-04 by the reprex package (v0.3.0)

Note: As per ?cluster the prefered method for creating a ClusterFuture is future() or %<-% after registering a suitable (cluster) plan for execution.

Does clusterMap in Snow support dynamic processing?

It is true that clusterMap doesn't support dynamic processing, but there is a comment in the code suggesting that it might be implemented in the future.

In the meantime, I would create a list from the data in order to call clusterApplyLB with a slightly different worker function:

ldf <- lapply(seq_len(nrow(df_t)), function(i) df_t[i,])
clusterApplyLB(cl2, ldf, function(df) {paste(df$type, df$value)})

This was common before clusterMap was added to the snow package.

Note that your use of clusterMap doesn't actually require you to export df_t since your worker function doesn't refer to it. But if you're willing to export df_t to the workers, you could also use:

clusterApplyLB(cl2, 1:nrow(df_t), function(i){paste(df_t$type[i],df_t$value[i])})

In this case, df_t must be exported to the cluster workers since the worker function references it. However, it is generally less efficient since each worker only needs a fraction of the entire data frame.

can't open sockets for parallel cluster

What you're describing is the classic problem with PSOCK clusters: makeCluster hangs. It can hang for dozens of reasons because it has to create all of the processes, called "worker" processes, that will perform the actual work of the "cluster", and that involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK function, which will create a socket connection back to the master and then execute the slaveLoop function where it will eventually execute the tasks sent to it by the master. If anything goes wrong starting any of the worker processes (and trust me: a lot can go wrong), the master will hang while executing socketConnection, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.

For many failure scenarios, using the outfile argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, I go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.

Here's an example:

> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

At this point, your R session is hung because it's executing socketConnection, just as you described. It's now your job to open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.

To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:

$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

In that R session, you can put a breakpoint on the .slaveRSOCK function and then execute it:

> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()

Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop and makeSOCKmaster functions. In your case, I assume that it will hang trying to create the socket connection, in which case the title of your question will be appropriate.

For more information on this kind of problem, see my answer to a similar question.

UPDATE

Now that this particular problem has been resolved, I can add two tips for debugging makePSOCKcluster problems:

  • Check to see if anything in your .Rprofile only works in interactive mode
  • On Windows, use the Rterm command rather than Rgui so that you're more likely to see error messages and output from using outfile=''.


Related Topics



Leave a reply



Submit