makeCluster function in R snow hangs indefinitely
Unfortunately, there are a lot of things that can go wrong when creating a snow (or parallel) cluster object, and the most common failure mode is to hang indefinitely. The problem is that makeSOCKcluster
launches the cluster workers one by one, and each worker (if successfully started) must make a socket connection back to the master before the master proceeds to launch the next worker. If any of the workers fail to connect back to the master, makeSOCKcluster
will hang without any error message. The worker may issue an error message, but by default any error message is redirected to /dev/null
.
In addition to ssh problems, makeSOCKcluster
could hang because:
- R not installed on a worker machine
- snow not installed on a the worker machine
- R or snow not installed in the same location as the local machine
- current user doesn't exist on a worker machine
- networking problem
- firewall problem
and there are many more possibilities.
In other words, no one can diagnose this problem without further information, so you have to do some troubleshooting in order to get that information.
In my experience, the single most useful troubleshooting technique is manual mode which you enable by specifying manual=TRUE
when creating the cluster object. It's also a good idea to set outfile=""
so that error messages from the workers aren't redirected to /dev/null
:
cl <- makeSOCKcluster("192.168.128.24", manual=TRUE, outfile="")
makeSOCKcluster
will display an Rscript command to execute in a terminal on the specified machine, and then it will wait for you to execute that command. In other words, makeSOCKcluster will hang until you manually start the worker on host 192.168.128.24, in your case. Remember that this is a troubleshooting technique, not a solution to the problem, and the hope is to get more information about why the workers aren't starting by trying to start them manually.
Obviously, the use of manual mode bypasses any ssh issues (since you're not using ssh), so if you can create a SOCK cluster successfully in manual mode, then probably ssh is your problem. If the Rscript command isn't found, then either R isn't installed, or it's installed in a different location. But hopefully you'll get some error message that will lead you to the solution.
If makeSOCKcluster
still just hangs after you've executed the specified Rscript command on the specified machine, then you probably have a networking or firewall issue.
For more troubleshooting advice, see my answer for making cluster in doParallel / snowfall hangs.
R: making cluster in doParallel / snowfall hangs
You could start by setting the "outfile" option to an empty string when creating the cluster object:
makePSOCKcluster("192.168.1.1",user="username",outfile="")
This allows you to see error messages from the workers in your terminal, which will hopefully provide a clue to the problem. If that doesn't help, I recommend using manual mode:
makePSOCKcluster("192.168.1.1",user="username",outfile="",manual=TRUE)
This bypasses ssh, and displays commands for you to execute in order to manually start each of the workers in separate terminals. This can uncover problems such as R packages that are not installed. It also allows you to debug the workers using whatever debugging tools you choose, although that takes a bit of work.
If makePSOCKcluster
doesn't respond after you execute the specified command, it means that the worker wasn't able to connect to the master process. If the worker doesn't display any error message, it may indicate a networking problem, possibly due to a firewall blocking the connection. Since makePSOCKcluster
uses a random port by default in R 3.X, you should specify an explicit value for port and configure your firewall to allow connections to that port.
To test for networking or firewall problems, you could try connecting to the master process using "netcat". Execute makePSOCKcluster
in manual mode, specifying the hostname of the desired worker host and the port on local machine that should allow incoming connections:
> library(parallel)
> makePSOCKcluster("node03", port=11234, manual=TRUE)
Manually start worker on node03 with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=node01
PORT=11234 OUT=/dev/null TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
Now start a terminal session on "node03" and execute "nc" using the indicated values of "MASTER" and "PORT" as arguments:
node03$ nc node01 11234
The master process should immediately return with the message:
socket cluster with 1 nodes on host ‘node03’
while netcat should display no message, since it is quietly reading from the socket connection.
However, if netcat displays the message:
nc: getaddrinfo: Name or service not known
then you have a hostname resolution problem. If you can find a hostname that does work with netcat, you may be able to get makePSOCKcluster
to work by specifying that name via the "master" option: makePSOCKcluster("node03", master="node01", port=11234)
.
If netcat returns immediately, that may indicate that it wasn't able to connect to the specified port. If it returns after a minute or two, that may indicate that it wasn't able to communicate with specified host at all. In either case, check netcat's return value to verify that it was an error:
node03$ echo $?
1
Hopefully that will give you enough information about the problem that you can get help from a network administrator.
ClusterFuture with snow blocks
In your code the actual parallel workload is not handled by future
but by snow::parLapply
. You can see that with the following example, where I am using parallel
instead of snow
, which I would treat as deprecated for simple PSOCK clusters:
RunNM2 <- function(index){
Sys.sleep(4)
return(index)
}
library(tictoc)
library(parallel)
cl <- makePSOCKcluster(rep("localhost",8))
tic("cluster")
res <- parLapply(cl,1:8,RunNM2)
toc()
#> cluster: 4.015 sec elapsed
stopCluster(cl)
rm(cl)
Created on 2019-06-04 by the reprex package (v0.3.0)
So currently you are creating one future out of the result of your parallel computation. Instead you should create several futures, that are then evaluated in parallel:
RunNM2 <- function(index){
Sys.sleep(4)
return(index)
}
library(tictoc)
library(future)
cl <- makeClusterPSOCK(rep("localhost",8))
plan(cluster, workers = cl)
tic("cluster")
res.1 <- lapply(1:8, function(index) future(RunNM2(index)))
res <- values(res.1)
# blocks here
toc()
#> cluster: 4.66 sec elapsed
parallel::stopCluster(cl)
rm(cl)
Created on 2019-06-04 by the reprex package (v0.3.0)
Note: As per ?cluster
the prefered method for creating a ClusterFuture
is future()
or %<-%
after registering a suitable (cluster) plan for execution.
Does clusterMap in Snow support dynamic processing?
It is true that clusterMap
doesn't support dynamic processing, but there is a comment in the code suggesting that it might be implemented in the future.
In the meantime, I would create a list from the data in order to call clusterApplyLB
with a slightly different worker function:
ldf <- lapply(seq_len(nrow(df_t)), function(i) df_t[i,])
clusterApplyLB(cl2, ldf, function(df) {paste(df$type, df$value)})
This was common before clusterMap
was added to the snow package.
Note that your use of clusterMap
doesn't actually require you to export df_t
since your worker function doesn't refer to it. But if you're willing to export df_t
to the workers, you could also use:
clusterApplyLB(cl2, 1:nrow(df_t), function(i){paste(df_t$type[i],df_t$value[i])})
In this case, df_t
must be exported to the cluster workers since the worker function references it. However, it is generally less efficient since each worker only needs a fraction of the entire data frame.
can't open sockets for parallel cluster
What you're describing is the classic problem with PSOCK clusters: makeCluster
hangs. It can hang for dozens of reasons because it has to create all of the processes, called "worker" processes, that will perform the actual work of the "cluster", and that involves starting new R sessions using the Rscript command that will execute the .slaveRSOCK
function, which will create a socket connection back to the master and then execute the slaveLoop
function where it will eventually execute the tasks sent to it by the master. If anything goes wrong starting any of the worker processes (and trust me: a lot can go wrong), the master will hang while executing socketConnection
, waiting for the worker to connect to it even though that worker may have died or never even been created successfully.
For many failure scenarios, using the outfile
argument is great because it often reveals the error that causes the worker process to die and thus the master to hang. But if that reveals nothing, I go to manual mode. In manual mode, the master prints the command to start each worker instead of executing the command itself. It's more work, but it gives you complete control, and you can even debug into the workers if you need to.
Here's an example:
> library(parallel)
> cl <- makePSOCKcluster(1, manual=TRUE, outfile='log.txt')
Manually start worker on localhost with
'/usr/lib/R/bin/Rscript' -e 'parallel:::.slaveRSOCK()' MASTER=localhost
PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
At this point, your R session is hung because it's executing socketConnection
, just as you described. It's now your job to open a new terminal window (command prompt, or whatever), and paste in that Rscript command. As soon as you've executed it, makePSOCKcluster
should return since we only requested one worker. Of course, if something goes wrong, it won't return, but if you're lucky, you'll get an error message in your terminal window and you'll have an important clue that will hopefully lead to a solution to your problem. If you're not so lucky, the Rscript command will also hang, and you'll have to dive in even deeper.
To debug the worker, you don't execute the displayed Rscript command because you need an interactive session. Instead, you start an R session with a command such as:
$ R --vanilla --args MASTER=localhost PORT=10187 OUT=log.txt TIMEOUT=2592000 METHODS=TRUE XDR=TRUE
In that R session, you can put a breakpoint on the .slaveRSOCK
function and then execute it:
> debug(parallel:::.slaveRSOCK)
> parallel:::.slaveRSOCK()
Now you can start stepping through the code, possibly setting breakpoints on the slaveLoop
and makeSOCKmaster
functions. In your case, I assume that it will hang trying to create the socket connection, in which case the title of your question will be appropriate.
For more information on this kind of problem, see my answer to a similar question.
UPDATE
Now that this particular problem has been resolved, I can add two tips for debugging makePSOCKcluster
problems:
- Check to see if anything in your .Rprofile only works in interactive mode
- On Windows, use the Rterm command rather than Rgui so that you're more likely to see error messages and output from using
outfile=''
.
Related Topics
Text Color Based on Contrast Against Background
How to Add Abline with Lattice Xyplot Function
Behavior of Summing !Is.Na() Results
Separate a Column into 2 Columns at the Last Underscore in R
How to Include Custom CSS in HTMLwidgets for R And/Or Leafletr
How to Create a Hyperlink Interactively in Shiny App
Changing Multiple Column Values Given a Condition in Dplyr
R/Gis: How to Subset a Shapefile by a Lat-Long Bounding Box
Trouble Installing and Loading Rjava on MAC El Capitan
Transpose Only Certain Columns in Data.Frame
Rsqlite Query with User Specified Variable in the Where Field
Suppress Automatic Output to Console in R
Preventing Column-Class Inference in Fread()
Find Elements Not in Smaller Character Vector List But in Big List
R: How to Find What S3 Method Will Be Called on an Object
Testthat Fails Within Devtools::Check But Works in Devtools::Test