can't open sockets for parallel cluster
What you're describing is the classic problem with PSOCK clusters: makeCluster
hangs. It can hang for dozens of reasons because it has to create all of the processes, called "worker" processes, that will perform the actual work of the "cluster", and that involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK
function, which will create a socket connection back to the master and then execute the slaveLoop
function where it will eventually execute the tasks sent to it by the master. If anything goes wrong starting any of the worker processes (and trust me: a lot can go wrong), the master will hang while executing socketConnection
, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.
For many failure scenarios, using the outfile
argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, I go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.
Here's an example:
> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
At this point, your R session is hung because it's executing socketConnection
, just as you described. It's now your job to open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster
should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.
To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:
$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
In that R session, you can put a breakpoint on the .slaveRSOCK
function and then execute it:
> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()
Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop
and makeSOCKmaster
functions. In your case, I assume that it will hang trying to create the socket connection, in which case the title of your question will be appropriate.
For more information on this kind of problem, see my answer to a similar question.
UPDATE
Now that this particular problem has been resolved, I can add two tips for debugging makePSOCKcluster
problems:
- Check to see if anything in your .Rprofile only works in interactive mode
- On Windows, use the Rterm command rather than Rgui so that you're more likely to see error messages and output from using
outfile=''
.
R cannot makeCluster (multinode) due to cannot open the connection error
The socketConnection
error is happening when a worker tries to connect to the master process, probably because at least one of the workers can't resolve the master's hostname, which is "ubuntu-r-node1" in your example. The master's hostname is determined using Sys.info()['nodename']
by default, and if any of the workers can't resolve this name, they won't be able to create the socket connection to the master, and makeCluster
will hang.
A common work-around for this problem is to use the makeCluster
"master" option to specify the IP address of the machine where the master is executing. Here's a way to do that using the nsl
function (which is not available on Windows) to look up the master's hostname on the master rather than the workers:
cl <- makePSOCKcluster(c(rep('192.168.42.26', 2),
rep('192.168.42.32', 2)),
master=nsl(Sys.info()['nodename']),
outfile='')
By specifying IP addresses for both the workers and the master, you have much less problems with DNS issues. In this example, the master will start the workers by ssh'ing to '192.168.42.26' and '192.168.42.32', and the workers will connect back to the master using socketConnection
with the value returned by nsl(Sys.info()['nodename'])
.
Note that the makeCluster
"port" option can also be important if the master has a firewall, since by default, the port is randomly chosen in the range 11000 to 11999.
How to shut down an open R cluster connection using parallel
Use
autoStopCluster <- function(cl) {
stopifnot(inherits(cl, "cluster"))
env <- new.env()
env$cluster <- cl
attr(cl, "gcMe") <- env
reg.finalizer(env, function(e) {
message("Finalizing cluster ...")
message(capture.output(print(e$cluster)))
try(parallel::stopCluster(e$cluster), silent = FALSE)
message("Finalizing cluster ... done")
})
cl
}
and then set up your cluster as:
cl <- autoStopCluster(makeCluster(n_c))
Old cluster objects no longer reachable will then be automatically stopped when garbage collected. You can trigger the garbage collector by calling gc()
. For example, if you call:
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
cl <- autoStopCluster(makeCluster(n_c))
gc()
and watch your OSes process monitor, you'll see lots of workers being launched, but eventually when the garbage collector runs only the most recent set of cluster workers remain.
EDIT 2018-09-05: Added debug output messages to show when the registered finalizer runs, which happens when the garbage collector runs. Remove those message()
lines and use silent = TRUE
if you want it to be completely silent.
How to check if stopCluster (R) worked
You shouldn't get an error that cl
is not found, until you run rm(cl)
. Stopping a cluster doesn't remove the object from your environment.
Use showConnections
to see that no connections are active:
> require(parallel)
Loading required package: parallel
> cl <- makeCluster(3)
> cl
socket cluster with 3 nodes on host ‘localhost’
> showConnections()
description class mode text isopen can read can write
3 "<-localhost:11129" "sockconn" "a+b" "binary" "opened" "yes" "yes"
4 "<-localhost:11129" "sockconn" "a+b" "binary" "opened" "yes" "yes"
5 "<-localhost:11129" "sockconn" "a+b" "binary" "opened" "yes" "yes"
> stopCluster(cl)
> showConnections()
description class mode text isopen can read can write
>
Whether or not your computer is "returned to its normal state" depends on the type of cluster you create. If it's just a simple socket or fork cluster, then gracefully stopping the parent process should cause all the child processes to terminate. If it's a more complicated cluster, it's possible terminating R will not stop all the jobs it started on the nodes.
R: making cluster in doParallel / snowfall hangs
You could start by setting the "outfile" option to an empty string when creating the cluster object:
makePSOCKcluster("192.168.1.1",user="username",outfile="")
This allows you to see error messages from the workers in your terminal, which will hopefully provide a clue to the problem. If that doesn't help, I recommend using manual mode:
makePSOCKcluster("192.168.1.1",user="username",outfile="",manual=TRUE)
This bypasses ssh, and displays commands for you to execute in order to manually start each of the workers in separate terminals. This can uncover problems such as R packages that are not installed. It also allows you to debug the workers using whatever debugging tools you choose, although that takes a bit of work.
If makePSOCKcluster
doesn't respond after you execute the specified command, it means that the worker wasn't able to connect to the master process. If the worker doesn't display any error message, it may indicate a networking problem, possibly due to a firewall blocking the connection. Since makePSOCKcluster
uses a random port by default in R 3.X, you should specify an explicit value for port and configure your firewall to allow connections to that port.
To test for networking or firewall problems, you could try connecting to the master process using "netcat". Execute makePSOCKcluster
in manual mode, specifying the hostname of the desired worker host and the port on local machine that should allow incoming connections:
> library(parallel)
> makePSOCKcluster("node03", port=11234, manual=TRUE)
Manually start worker on node03 with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=node01
PORT=11234 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
Now start a terminal session on "node03" and execute "nc" using the indicated values of "MASTER" and "PORT" as arguments:
node03$ nc node01 11234
The master process should immediately return with the message:
socket cluster with 1 nodes on host ‘node03’
while netcat should display no message, since it is quietly reading from the socket connection.
However, if netcat displays the message:
nc: getaddrinfo: Name or service not known
then you have a hostname resolution problem. If you can find a hostname that does work with netcat, you may be able to get makePSOCKcluster
to work by specifying that name via the "master" option: makePSOCKcluster("node03", master="node01", port=11234)
.
If netcat returns immediately, that may indicate that it wasn't able to connect to the specified port. If it returns after a minute or two, that may indicate that it wasn't able to communicate with specified host at all. In either case, check netcat's return value to verify that it was an error:
node03$ echo $?
1
Hopefully that will give you enough information about the problem that you can get help from a network administrator.
makeCluster function in R snow hangs indefinitely
Unfortunately, there are a lot of things that can go wrong when creating a snow (or parallel) cluster object, and the most common failure mode is to hang indefinitely. The problem is that makeSOCKcluster
launches the cluster workers one by one, and each worker (if successfully started) must make a socket connection back to the master before the master proceeds to launch the next worker. If any of the workers fail to connect back to the master, makeSOCKcluster
will hang without any error message. The worker may issue an error message, but by default any error message is redirected to /dev/null
.
In addition to ssh problems, makeSOCKcluster
could hang because:
- R not installed on a worker machine
- snow not installed on a the worker machine
- R or snow not installed in the same location as the local machine
- current user doesn't exist on a worker machine
- networking problem
- firewall problem
and there are many more possibilities.
In other words, no one can diagnose this problem without further information, so you have to do some troubleshooting in order to get that information.
In my experience, the single most useful troubleshooting technique is manual mode which you enable by specifying manual=TRUE
when creating the cluster object. It's also a good idea to set outfile=""
so that error messages from the workers aren't redirected to /dev/null
:
cl <- makeSOCKcluster("192.168.128.24", manual=TRUE, outfile="")
makeSOCKcluster
will display an Rscript command to execute in a terminal on the specified machine, and then it will wait for you to execute that command. In other words, makeSOCKcluster will hang until you manually start the worker on host 192.168.128.24, in your case. Remember that this is a troubleshooting technique, not a solution to the problem, and the hope is to get more information about why the workers aren't starting by trying to start them manually.
Obviously, the use of manual mode bypasses any ssh issues (since you're not using ssh), so if you can create a SOCK cluster successfully in manual mode, then probably ssh is your problem. If the Rscript command isn't found, then either R isn't installed, or it's installed in a different location. But hopefully you'll get some error message that will lead you to the solution.
If makeSOCKcluster
still just hangs after you've executed the specified Rscript command on the specified machine, then you probably have a networking or firewall issue.
For more troubleshooting advice, see my answer for making cluster in doParallel / snowfall hangs.
Related Topics
Plotting a Large Number of Custom Functions in Ggplot in R Using Stat_Function()
How to Find the Polygon Nearest to a Point in R
Create Lagged Variable in Unbalanced Panel Data in R
Linear Regression and Storing Results in Data Frame
Excel Cell Coloring Using Xlsx
Filter One Selectinput Based on Selection from Another Selectinput
Distance of Point Feature to Nearest Polygon in R
Warning in Install.Packages:Installation of Package 'Tidyverse' Had Non-Zero Exit Status
In R, What Does "Loaded via a Namespace (And Not Attached)" Mean
R - How to Replace Parts of Variable Strings Within Data Frame
Remove Plot Margins in Ggplot2
R: Xtable Caption (Or Comment)
Apply() Is Slow - How to Make It Faster or What Are My Alternatives
How to Specify "Does Not Contain" in Dplyr Filter
Choosing Eps and Minpts for Dbscan (R)
Filter Out Rows from One Data.Frame That Are Present in Another Data.Frame