How to run a job array in R using the rscript command from the command line?
This is how I would setup on a cluster using SLURM
scheduler
slurm
sbatch
job submission script#!/bin/bash
#SBATCH --partition=xxx ### Partition (like a queue in PBS)
#SBATCH --job-name=array_example ### Job Name
#SBATCH -o jarray.%j.%N.out ### File in which to store job output/error
#SBATCH --time=00-00:30:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1 ### Node count required for the job
#SBATCH --ntasks=1 ### Nuber of tasks to be launched per Node
#SBATCH --cpus-per-task=2 ### Number of threads per task (OMP threads)
#SBATCH --mail-type=FAIL ### When to send mail
#SBATCH --mail-user=xxx@gmail.com
#SBATCH --get-user-env ### Import your user environment setup
#SBATCH --requeue ### On failure, requeue for another try
#SBATCH --verbose ### Increase informational messages
#SBATCH --array=1-500%50 ### Array index | %50: number of simultaneously tasks
echo
echo "****************************************************************************"
echo "* *"
echo "********************** sbatch script for array job *************************"
echo "* *"
echo "****************************************************************************"
echo
current_dir=${PWD##*/}
echo "Current dir: $current_dir"
echo
pwd
echo
# First we ensure a clean running environment:
module purge
# Load R
module load R/R-3.5.0
### Initialization
# Get Array ID
i=${SLURM_ARRAY_TASK_ID}
# Output file
outFile="output_parameter_${i}.txt"
# Pass line #i to a R script
Rscript --vanilla my_R_script.R ${i} ${outFile}
echo
echo '******************** FINISHED ***********************'
echomy_R_script.R
that takesarg
from thesbatch
scriptargs <- commandArgs(trailingOnly = TRUE)
str(args)
cat(args, sep = "\n")
# test if there is at least one argument: if not, return an error
if (length(args) == 0) {
stop("At least one argument must be supplied (input file).\n", call. = FALSE)
} else if (length(args) == 1) {
# default output file
args[2] = "out.txt"
}
cat("\n")
print("Hello World !!!")
cat("\n")
print(paste0("i = ", as.numeric(args[1])))
print(paste0("outFile = ", args[2]))
### Parallel:
# https://hpc.nih.gov/apps/R.html
# https://github.com/tobigithub/R-parallel/blob/gh-pages/R/code-setups/Install-doSNOW-parallel-DeLuxe.R
# load doSnow and (parallel for CPU info) library
library(doSNOW)
library(parallel)
detectBatchCPUs <- function() {
ncores <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK"))
if (is.na(ncores)) {
ncores <- as.integer(Sys.getenv("SLURM_JOB_CPUS_PER_NODE"))
}
if (is.na(ncores)) {
return(2) # default
}
return(ncores)
}
ncpus <- detectBatchCPUs()
# or ncpus <- future::availableCores()
cat(ncpus, " cores detected.")
cluster = makeCluster(ncpus)
# register the cluster
registerDoSNOW(cluster)
# get info
getDoParWorkers(); getDoParName();
##### insert parallel computation here #####
# stop cluster and remove clients
stopCluster(cluster); print("Cluster stopped.")
# insert serial backend, otherwise error in repetitive tasks
registerDoSEQ()
# clean up a bit.
invisible(gc); remove(ncpus); remove(cluster);
# END
P.S: if you want to read a parameter file line by line, include the following line in the sbatch
script then pass them to my_R_script.R
### Parameter file to read
parameter_file="parameter_file.txt"
echo "Parameter file: ${parameter_file}"
echo
# Read line #i from the parameter file
PARAMETERS=$(sed "${i}q;d" ${parameter_file})
echo "Parameters are: ${PARAMETERS}"
echo
Refs:
- http://tuxette.nathalievilla.org/?p=1696
- https://hpc.nih.gov/apps/R.html
- https://github.com/tobigithub/R-parallel/blob/gh-pages/R/code-setups/Install-doSNOW-parallel-DeLuxe.R
Paralelizing an Rscript using a job array in Slurm
You should avoid using "R CMD BATCH". It doesn't handle arguments the way most functions do. "Rscript" has been the recommended option for a while now. By calling "R CMD BATCH" you are basically ignoring the "#!/usr/bin/env Rscript" part of your script.
So change your script file to
#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test
Rscript ~/Rscript_test.R $SLURM_ARRAY_TASK_ID
And then becareful in your script that you aren't using the same variable as both a string a data.frame. You can't easily paste a data.frame into a file path for example. So
taskid <- commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID') # This should also work
print(paste0("the number processed was... ", taskid))
outdata <- as.data.frame(taskid)
outfile <- paste0("~/test/", taskid, ".out")
write.table(outdata, outfile, quote=FALSE, row.names=FALSE, col.names=FALSE)
The extra files with just the array number were created because the usage of R CMD BATCH is
R CMD BATCH [options] infile [outfile]
So the $SLURM_ARRAY_TASK_ID
value you were passing at the command line was treated as the outfile name. Instead that value needed to be passed as options. But again, it's better to use Rscript which has more standard argument conventions.
How can I pass an array as argument to an R script command line run?
First, i am not sure if you should use set
to declare any variable in terminal, but you can see more of this on man set
page.
There are some ways i think you can try to pass an array to R script, it mainly depends on how you declare the my_array
in terminal.
1 - "my_array" can be a string in bash:
$ my_array="c(0,1,2,3,4)"
$ Rscript my_script ${my_array}
And use the eval(parse(text=()))
in R script to transform the argument as a vector to R environment.
args <- commandArgs(trailingOnly=TRUE)
print(args)
# c(0,1,2,3,4,5,6,7,8,9)
args <- eval(parse(text=args[1]))
print(args)
# [1] 0 1 2 3 4 5 6 7 8 9
2 - "my_array" can be an array in bash:
$ my_array=(0,1,2,3,4)
$ Rscript my_script ${my_array[*]}
and in R the args
is already your array, but of type character:
args <- commandArgs(trailingOnly=TRUE)
print(args)
# [1] "0" "1" "2" "3" "4" "5"
How to call R script from command line with multiple augument types (inc. list)
@MrFlick deserves credit for this answer. The issue was I was not accounting for a situation where the number of arguments would be greater than 7 (duh).
A quick very fix:
if (length(args) < 7)
{
stop('At least seven arguments must be supplied.', call.=FALSE)
}
if (length(args)==7)
{
project = args[1]
method = args[2]
lib_path = args[3]
storage = args[4]
compo = args[5]
res_file = args[6]
integrated_object = args[7]
}
if (length(args)>7)
{
project = args[1]
method = args[2]
lib_path = args[3]
storage = args[4]
compo = args[5]
res_file = args[6]
integrated_object = args[7:length(args)]
}
Thank you for your eyes @MrFlick
Slurm job array fails to run Rscript with shapefiles
3 problems identified and now solved:
Max array size refers to the entire array. The throttle just sets how many jobs get scheduled at one time. So I needed to break my 3,086 job task into 4 separate batches. This can be done in the .sh file as:
#SBATCH -a 1-999
for job 1#SBATCH -a 1000-1999
for job 2, and so on.The R script needs to catch the arguments from the command line. The script now begins:
args = commandArgs(trailingOnly=TRUE) shp_filename <- args[1] lihtc_filename <- args[2]
The submission file was sending arguments with quotations, which was preventing
paste0
from creating usable file names. Neithernoquote()
norprint(x, quotes = F)
was able to remove these quotes. Howevergsub('"', '', x)
worked.
An inelegant/lazy parallelization on my part, but it works. Case closed.
SGE array jobs and R
To boil down mithrado's answer to the bare essentials:
Create job script, pop_gen.bash
, that may or may not take SGE task id argument as input, storing results in specific file identified by same SGE task id:
#!/bin/bash
Rscript pop_gen.R ${SGE_TASK_ID} > Results_${SGE_TASK_ID}.txt
Submit this script as a job array, e.g. 1000 jobs:
qsub -t 1-1000 pop_gen.bash
Grid Engine will execute pop_gen.bash 1000 times, each time setting SGE_TASK_ID to value ranging from 1-1000.
Additionally, as mentioned above, via passing SGE_TASK_ID as command line variable to pop_gen.R
you can use SGE_TASK_ID to write to output file:
args <- commandArgs(trailingOnly = TRUE)
out.file <- paste("Results_", args[1], ".txt", sep="")
# d <- "some data frame"
write.table(d, file=out.file)
HTH
Related Topics
Shiny - How to Change the Font Size in Select Tags
Copying List of Files from One Folder to Other in R
Creating a Grouped Bar Plot in R
How to Make Join Operations in Dplyr Silent
Ordering Factors in Number Order for Ggplot
Trouble Installing "Sf" Due to "Gdal"
Greek Letters in Ggplot Strip Text
Simple Comparing of Two Texts in R
How to Configure Box.Color in Directlabels "Draw.Rects"
Scatterplot: Error in Fun(X[[I]], ...):Object 'Group' Not Found
Knitr: Object Cannot Be Found When Converting Markdown File into HTML
Ggplot Piecharts on a Ggmap: Labels Destroy the Small Plots
Plot Event Sequences/Event Sequences Clustering
Understanding Lm and Environment
Rscript Detect If R Script Is Being Called/Sourced from Another Script