Parallel Parlapply Setup

parallel parLapply setup

Since you're calling functions from NLP on the cluster workers, you should load it on each of the workers before calling parLapply. You can do that from the worker function, but I tend to use clusterCall or clusterEvalQ right after creating the cluster object:

clusterEvalQ(cl, {library(openNLP); library(NLP)})

Since as.String and Maxent_Word_Token_Annotator are in those packages, they shouldn't be exported.

Note that while running your example on my machine, I noticed that the PTA object doesn't work after being exported to the worker machines. Presumably there is something in that object that can't be safely serialized and unserialized. After I created that object on the workers using clusterEvalQ, the example ran successfully. Here it is, using openNLP 0.2-1:

library(parallel)
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, PTA, a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}
text.var <- c("I like it.", "This is outstanding soup!",
"I really must get the recipe.")
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()/2))
clusterEvalQ(cl, {
library(openNLP)
library(NLP)
PTA <- Maxent_POS_Tag_Annotator()
})
m <- parLapply(cl, text.var, tagPOS)
print(m)
stopCluster(cl)

If clusterEvalQ fails because Maxent_POS_Tag_Annotator is not found, you might be loading the wrong version of openNLP on the workers. You can determine what package versions you're getting on the workers by executing sessionInfo with clusterEvalQ:

library(parallel)
cl <- makeCluster(2)
clusterEvalQ(cl, {library(openNLP); library(NLP)})
clusterEvalQ(cl, sessionInfo())

This will return the results of executing sessionInfo() on each of the cluster workers. Here is the version information for some of the packages that I'm using and that work for me:

other attached packages:
[1] NLP_0.1-0 openNLP_0.2-1

loaded via a namespace (and not attached):
[1] openNLPdata_1.5.3-1 rJava_0.9-4

parLapply on 2 set of list

There is no need to calculate the (potentially large) CalcList. You can use either the list of centres or the list of destinations in the parLapply. The function you are calling would then apply your original function on each element of the other list, e.g. using lapply.

R parLapply not parallel

I don't think parLapply is running sequentially. More likely, it's just running inefficiently, making it appear to run sequentially.

I have a few suggestions to improve it:

  • Don't define the worker function inside parSolver
  • Don't export all of varMatrix to each worker
  • Create the cluster outside of parSolver

The first point is important, because as your example now stands, all of the variables defined in parSolver will be serialized along with the anonymous worker function and sent to the workers by parLapply. By defining the worker function outside of any function, the serialization won't capture any unwanted variables.

The second point avoids unnecessary socket I/O and uses less memory, making the code more scalable.

Here's a fake, but self-contained example that is similar to yours that demonstrates my suggestions:

# Define worker function outside of any function to avoid
# serialization problems (such as unexpected variable capture)
workerfn <- function(mat, var1) {
library(raster)
mat * var1
}

parSolver <- function(cl, varMatrix, var1) {
parts <- splitIndices(nrow(varMatrix), length(cl))
varMatrixParts <- lapply(parts, function(i) varMatrix[i,,drop=FALSE])
rParts <- clusterApply(cl, varMatrixParts, workerfn, var1)
do.call(rbind, rParts)
}

library(parallel)
cl <- makePSOCKcluster(3)
r <- parSolver(cl, matrix(1:20, 10, 2), 2)
print(r)

Note that this takes advantage of the clusterApply function to iterate over a list of row-chunks of varMatrix so that the entire matrix doesn't need to be sent to everyone. It also avoids calls to clusterEvalQ and clusterExport, simplifying the code, as well as making it a bit more efficient.

using parallel's parLapply: unable to access variables within parallel code

You need to export those variables to the other R processes in the cluster:

cl <- makeCluster(mc <- getOption("cl.cores", 4))
clusterExport(cl=cl, varlist=c("text.var", "ntv", "gc.rate", "pos"))

R using varibales in parLapply

You have to use clusterExport:

clusterExport(cl = NULL, varlist, envir = .GlobalEnv)

clusterExport assigns the values on the master R process of the variables named in varlist to variables of the same names in the global environment (aka ‘workspace’) of each node. The environment on the master from which variables are exported defaults to the global environment.

In your case:

Druckfaktor <- 1.3      

no_cores <- detectCores()-1
cl <- makeCluster(no_cores)
clusterExport(cl, c("Druckfaktor"))
[...]

Parallel process list of rasters with focal function and parLapply in R

I took a different approach but ended up getting what I intended. Instead of using the parLapply, I use a foreach to loop through my list of rasters and execute my density function in parallel.

This blog was really helpful: http://www.gis-blog.com/increasing-the-speed-of-raster-processing-with-r-part-23-parallelisation/

library(doParallel)   
library(foreach)

#Density function, 1km circular radius
Density_Function_1000 <- function (raster_layer){
raster_name <- names(raster_layer)
short_name <- substr(raster_name,1,4)
weight <- focalWeight(raster_layer,1000,type = "circle")
half_output <- "X:/Path"
full_output <- paste0(half_output,short_name,"_1km.tif")
focal(raster_layer, weight, fun=sum, full_output, na.rm=TRUE, pad=TRUE, NAonly=FALSE, overwrite=TRUE)
}

#Define how many cores you want to use
UseCores <- detectCores() -1
#Register CoreCluster
cl <- makeCluster(UseCores)
registerDoParallel(cl)

#Create my list of rasters
raster_list <- list(roads_raster, cuts_raster, wells_raster, seis_raster, pipes_raster, fires_raster)

#Use foreach loop and %dopar% command to execute my density function in parallel
foreach(i = raster_list) %dopar% {
library(raster)
Density_Function_1000(i)
}

#end cluster
stopCluster(cl)


Related Topics



Leave a reply



Submit