Can't Open Sockets for Parallel Cluster

can't open sockets for parallel cluster

What you're describing is the classic problem with PSOCK clusters: makeCluster hangs. It can hang for dozens of reasons because it has to create all of the processes, called "worker" processes, that will perform the actual work of the "cluster", and that involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK function, which will create a socket connection back to the master and then execute the slaveLoop function where it will eventually execute the tasks sent to it by the master. If anything goes wrong starting any of the worker processes (and trust me: a lot can go wrong), the master will hang while executing socketConnection, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.

For many failure scenarios, using the outfile argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, I go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.

Here's an example:

> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

At this point, your R session is hung because it's executing socketConnection, just as you described. It's now your job to open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.

To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:

$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

In that R session, you can put a breakpoint on the .slaveRSOCK function and then execute it:

> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()

Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop and makeSOCKmaster functions. In your case, I assume that it will hang trying to create the socket connection, in which case the title of your question will be appropriate.

For more information on this kind of problem, see my answer to a similar question.

UPDATE

Now that this particular problem has been resolved, I can add two tips for debugging makePSOCKcluster problems:

  • Check to see if anything in your .Rprofile only works in interactive mode
  • On Windows, use the Rterm command rather than Rgui so that you're more likely to see error messages and output from using outfile=''.

R cannot makeCluster (multinode) due to cannot open the connection error

The socketConnection error is happening when a worker tries to connect to the master process, probably because at least one of the workers can't resolve the master's hostname, which is "ubuntu-r-node1" in your example. The master's hostname is determined using Sys.info()['nodename'] by default, and if any of the workers can't resolve this name, they won't be able to create the socket connection to the master, and makeCluster will hang.

A common work-around for this problem is to use the makeCluster "master" option to specify the IP address of the machine where the master is executing. Here's a way to do that using the nsl function (which is not available on Windows) to look up the master's hostname on the master rather than the workers:

cl <- makePSOCKcluster(c(rep('192.168.42.26', 2),
rep('192.168.42.32', 2)),
master=nsl(Sys.info()['nodename']),
outfile='')

By specifying IP addresses for both the workers and the master, you have much less problems with DNS issues. In this example, the master will start the workers by ssh'ing to '192.168.42.26' and '192.168.42.32', and the workers will connect back to the master using socketConnection with the value returned by nsl(Sys.info()['nodename']).

Note that the makeCluster "port" option can also be important if the master has a firewall, since by default, the port is randomly chosen in the range 11000 to 11999.

How to shut down an open R cluster connection using parallel

Use

autoStopCluster <- function(cl) {
stopifnot(inherits(cl, "cluster"))
env <- new.env()
env$cluster <- cl
attr(cl, "gcMe") <- env
reg.finalizer(env, function(e) {
message("Finalizing cluster ...")
message(capture.output(print(e$cluster)))
try(parallel::stopCluster(e$cluster), silent = FALSE)
message("Finalizing cluster ... done")
})
cl
}

and then set up your cluster as:

cl <- autoStopCluster(makeCluster(n_c))

Old cluster objects no longer reachable will then be automatically stopped when garbage collected. You can trigger the garbage collector by calling gc(). For example, if you call:

cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
gc()

and watch your OSes process monitor, you'll see lots of workers being launched, but eventually when the garbage collector runs only the most recent set of cluster workers remain.

EDIT 2018-09-05: Added debug output messages to show when the registered finalizer runs, which happens when the garbage collector runs. Remove those message() lines and use silent = TRUE if you want it to be completely silent.

How to check if stopCluster (R) worked

You shouldn't get an error that cl is not found, until you run rm(cl). Stopping a cluster doesn't remove the object from your environment.

Use showConnections to see that no connections are active:

> require(parallel)
Loading required package: parallel
> cl <- makeCluster(3)
> cl
socket cluster with 3 nodes on host ‘localhost’
> showConnections()
description class mode text isopen can read can write
3 "<-localhost:11129" "sockconn" "a+b" "binary" "opened" "yes" "yes"
4 "<-localhost:11129" "sockconn" "a+b" "binary" "opened" "yes" "yes"
5 "<-localhost:11129" "sockconn" "a+b" "binary" "opened" "yes" "yes"
> stopCluster(cl)
> showConnections()
description class mode text isopen can read can write
>

Whether or not your computer is "returned to its normal state" depends on the type of cluster you create. If it's just a simple socket or fork cluster, then gracefully stopping the parent process should cause all the child processes to terminate. If it's a more complicated cluster, it's possible terminating R will not stop all the jobs it started on the nodes.

R: making cluster in doParallel / snowfall hangs

You could start by setting the "outfile" option to an empty string when creating the cluster object:

makePSOCKcluster("192.168.1.1",user="username",outfile="")

This allows you to see error messages from the workers in your terminal, which will hopefully provide a clue to the problem. If that doesn't help, I recommend using manual mode:

makePSOCKcluster("192.168.1.1",user="username",outfile="",manual=TRUE)

This bypasses ssh, and displays commands for you to execute in order to manually start each of the workers in separate terminals. This can uncover problems such as R packages that are not installed. It also allows you to debug the workers using whatever debugging tools you choose, although that takes a bit of work.

If makePSOCKcluster doesn't respond after you execute the specified command, it means that the worker wasn't able to connect to the master process. If the worker doesn't display any error message, it may indicate a networking problem, possibly due to a firewall blocking the connection. Since makePSOCKcluster uses a random port by default in R 3.X, you should specify an explicit value for port and configure your firewall to allow connections to that port.

To test for networking or firewall problems, you could try connecting to the master process using "netcat". Execute makePSOCKcluster in manual mode, specifying the hostname of the desired worker host and the port on local machine that should allow incoming connections:

> library(parallel)
> makePSOCKcluster("node03", port=11234, manual=TRUE)
Manually start worker on node03 with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=node01
PORT=11234 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE

Now start a terminal session on "node03" and execute "nc" using the indicated values of "MASTER" and "PORT" as arguments:

node03$ nc node01 11234

The master process should immediately return with the message:

socket cluster with 1 nodes on host ‘node03’

while netcat should display no message, since it is quietly reading from the socket connection.

However, if netcat displays the message:

nc: getaddrinfo: Name or service not known

then you have a hostname resolution problem. If you can find a hostname that does work with netcat, you may be able to get makePSOCKcluster to work by specifying that name via the "master" option: makePSOCKcluster("node03", master="node01", port=11234).

If netcat returns immediately, that may indicate that it wasn't able to connect to the specified port. If it returns after a minute or two, that may indicate that it wasn't able to communicate with specified host at all. In either case, check netcat's return value to verify that it was an error:

node03$ echo $?
1

Hopefully that will give you enough information about the problem that you can get help from a network administrator.

makeCluster function in R snow hangs indefinitely

Unfortunately, there are a lot of things that can go wrong when creating a snow (or parallel) cluster object, and the most common failure mode is to hang indefinitely. The problem is that makeSOCKcluster launches the cluster workers one by one, and each worker (if successfully started) must make a socket connection back to the master before the master proceeds to launch the next worker. If any of the workers fail to connect back to the master, makeSOCKcluster will hang without any error message. The worker may issue an error message, but by default any error message is redirected to /dev/null.

In addition to ssh problems, makeSOCKcluster could hang because:

  • R not installed on a worker machine
  • snow not installed on a the worker machine
  • R or snow not installed in the same location as the local machine
  • current user doesn't exist on a worker machine
  • networking problem
  • firewall problem

and there are many more possibilities.

In other words, no one can diagnose this problem without further information, so you have to do some troubleshooting in order to get that information.

In my experience, the single most useful troubleshooting technique is manual mode which you enable by specifying manual=TRUE when creating the cluster object. It's also a good idea to set outfile="" so that error messages from the workers aren't redirected to /dev/null:

cl <- makeSOCKcluster("192.168.128.24", manual=TRUE, outfile="")

makeSOCKcluster will display an Rscript command to execute in a terminal on the specified machine, and then it will wait for you to execute that command. In other words, makeSOCKcluster will hang until you manually start the worker on host 192.168.128.24, in your case. Remember that this is a troubleshooting technique, not a solution to the problem, and the hope is to get more information about why the workers aren't starting by trying to start them manually.

Obviously, the use of manual mode bypasses any ssh issues (since you're not using ssh), so if you can create a SOCK cluster successfully in manual mode, then probably ssh is your problem. If the Rscript command isn't found, then either R isn't installed, or it's installed in a different location. But hopefully you'll get some error message that will lead you to the solution.

If makeSOCKcluster still just hangs after you've executed the specified Rscript command on the specified machine, then you probably have a networking or firewall issue.

For more troubleshooting advice, see my answer for making cluster in doParallel / snowfall hangs.



Related Topics



Leave a reply



Submit