How to Run a Job Array in R Using the Rscript Command from the Command Line

How to run a job array in R using the rscript command from the command line?

This is how I would setup on a cluster using SLURM scheduler

slurm sbatch job submission script

#!/bin/bash

#SBATCH --partition=xxx             ### Partition (like a queue in PBS)
#SBATCH --job-name=array_example    ### Job Name
#SBATCH -o jarray.%j.%N.out         ### File in which to store job output/error
#SBATCH --time=00-00:30:00          ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1                   ### Node count required for the job
#SBATCH --ntasks=1                  ### Nuber of tasks to be launched per Node
#SBATCH --cpus-per-task=2           ### Number of threads per task (OMP threads)
#SBATCH --mail-type=FAIL            ### When to send mail
#SBATCH --mail-user=xxx@gmail.com
#SBATCH --get-user-env              ### Import your user environment setup
#SBATCH --requeue                   ### On failure, requeue for another try
#SBATCH --verbose                   ### Increase informational messages
#SBATCH --array=1-500%50            ### Array index | %50: number of simultaneously tasks

echo
echo "****************************************************************************"
echo "*                                                                          *"
echo "********************** sbatch script for array job *************************"
echo "*                                                                          *"
echo "****************************************************************************"
echo

current_dir=${PWD##*/}
echo "Current dir: $current_dir"
echo
pwd
echo

# First we ensure a clean running environment:
module purge

# Load R
module load R/R-3.5.0

### Initialization
# Get Array ID
i=${SLURM_ARRAY_TASK_ID}

# Output file
outFile="output_parameter_${i}.txt"

# Pass line #i to a R script 
Rscript --vanilla my_R_script.R ${i} ${outFile}

echo
echo '******************** FINISHED ***********************'
echo

my_R_script.R that takes arg from the sbatch script

args <- commandArgs(trailingOnly = TRUE)
str(args)
cat(args, sep = "\n")

# test if there is at least one argument: if not, return an error
if (length(args) == 0) {
  stop("At least one argument must be supplied (input file).\n", call. = FALSE)
} else if (length(args) == 1) {
  # default output file
  args[2] = "out.txt"
}

cat("\n")
print("Hello World !!!")

cat("\n")
print(paste0("i = ", as.numeric(args[1])))
print(paste0("outFile = ", args[2]))

### Parallel:
# https://hpc.nih.gov/apps/R.html
# https://github.com/tobigithub/R-parallel/blob/gh-pages/R/code-setups/Install-doSNOW-parallel-DeLuxe.R

# load doSnow and (parallel for CPU info) library
library(doSNOW)
library(parallel)   

detectBatchCPUs <- function() { 
    ncores <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK")) 
    if (is.na(ncores)) { 
        ncores <- as.integer(Sys.getenv("SLURM_JOB_CPUS_PER_NODE")) 
    } 
    if (is.na(ncores)) { 
        return(2) # default
    } 
    return(ncores) 
}

ncpus <- detectBatchCPUs() 
# or ncpus <- future::availableCores()
cat(ncpus, " cores detected.")

cluster = makeCluster(ncpus)

# register the cluster
registerDoSNOW(cluster)

# get info
getDoParWorkers(); getDoParName();

##### insert parallel computation here #####

# stop cluster and remove clients
stopCluster(cluster); print("Cluster stopped.")

# insert serial backend, otherwise error in repetitive tasks
registerDoSEQ()

# clean up a bit.
invisible(gc); remove(ncpus); remove(cluster); 

# END

P.S: if you want to read a parameter file line by line, include the following line in the sbatch script then pass them to my_R_script.R

    ### Parameter file to read 
    parameter_file="parameter_file.txt"
    echo "Parameter file: ${parameter_file}"
    echo

    # Read line #i from the parameter file
    PARAMETERS=$(sed "${i}q;d" ${parameter_file})
    echo "Parameters are: ${PARAMETERS}"
    echo

Refs:

http://tuxette.nathalievilla.org/?p=1696
https://hpc.nih.gov/apps/R.html
https://github.com/tobigithub/R-parallel/blob/gh-pages/R/code-setups/Install-doSNOW-parallel-DeLuxe.R

Paralelizing an Rscript using a job array in Slurm

You should avoid using "R CMD BATCH". It doesn't handle arguments the way most functions do. "Rscript" has been the recommended option for a while now. By calling "R CMD BATCH" you are basically ignoring the "#!/usr/bin/env Rscript" part of your script.

So change your script file to

#!/bin/bash -l
#SBATCH --time=00:01:00
#SBATCH --array=1-10
conda activate R
cd ~/test 
Rscript ~/Rscript_test.R $SLURM_ARRAY_TASK_ID

And then becareful in your script that you aren't using the same variable as both a string a data.frame. You can't easily paste a data.frame into a file path for example. So

taskid <- commandArgs(trailingOnly=TRUE)
# taskid <- Sys.getenv('SLURM_ARRAY_TASK_ID')  # This should also work

print(paste0("the number processed was... ", taskid))

outdata <- as.data.frame(taskid)
outfile <- paste0("~/test/", taskid, ".out")

write.table(outdata, outfile, quote=FALSE, row.names=FALSE, col.names=FALSE)

The extra files with just the array number were created because the usage of R CMD BATCH is

R CMD BATCH [options] infile [outfile]

So the $SLURM_ARRAY_TASK_ID value you were passing at the command line was treated as the outfile name. Instead that value needed to be passed as options. But again, it's better to use Rscript which has more standard argument conventions.

How can I pass an array as argument to an R script command line run?

First, i am not sure if you should use set to declare any variable in terminal, but you can see more of this on man set page.

There are some ways i think you can try to pass an array to R script, it mainly depends on how you declare the my_array in terminal.

1 - "my_array" can be a string in bash:

$ my_array="c(0,1,2,3,4)"
$ Rscript my_script ${my_array}

And use the eval(parse(text=())) in R script to transform the argument as a vector to R environment.

args <- commandArgs(trailingOnly=TRUE) 
print(args)
# c(0,1,2,3,4,5,6,7,8,9)

args <- eval(parse(text=args[1]))
print(args)
# [1] 0 1 2 3 4 5 6 7 8 9

2 - "my_array" can be an array in bash:

$ my_array=(0,1,2,3,4)
$ Rscript my_script ${my_array[*]}

and in R the args is already your array, but of type character:

args <- commandArgs(trailingOnly=TRUE) 
print(args)
# [1] "0" "1" "2" "3" "4" "5"

How to call R script from command line with multiple augument types (inc. list)

@MrFlick deserves credit for this answer. The issue was I was not accounting for a situation where the number of arguments would be greater than 7 (duh).

A quick very fix:

if (length(args) < 7) 
{
  stop('At least seven arguments must be supplied.', call.=FALSE)
} 

if (length(args)==7) 
{  
   project = args[1]
   method = args[2] 
   lib_path = args[3]
   storage = args[4]
   compo = args[5]
   res_file = args[6]
   integrated_object = args[7]
}

if (length(args)>7) 
{  
   project = args[1]
   method = args[2] 
   lib_path = args[3]
   storage = args[4]
   compo = args[5]
   res_file = args[6]
   integrated_object = args[7:length(args)]
}

Thank you for your eyes @MrFlick

Slurm job array fails to run Rscript with shapefiles

3 problems identified and now solved:

Max array size refers to the entire array. The throttle just sets how many jobs get scheduled at one time. So I needed to break my 3,086 job task into 4 separate batches. This can be done in the .sh file as: #SBATCH -a 1-999 for job 1
#SBATCH -a 1000-1999 for job 2, and so on.
The R script needs to catch the arguments from the command line. The script now begins: args = commandArgs(trailingOnly=TRUE) shp_filename <- args[1] lihtc_filename <- args[2]
The submission file was sending arguments with quotations, which was preventing paste0 from creating usable file names. Neither noquote() nor print(x, quotes = F) was able to remove these quotes. However gsub('"', '', x) worked.

An inelegant/lazy parallelization on my part, but it works. Case closed.

SGE array jobs and R

To boil down mithrado's answer to the bare essentials:

Create job script, pop_gen.bash, that may or may not take SGE task id argument as input, storing results in specific file identified by same SGE task id:

#!/bin/bash
Rscript pop_gen.R ${SGE_TASK_ID} > Results_${SGE_TASK_ID}.txt

Submit this script as a job array, e.g. 1000 jobs:

qsub -t 1-1000 pop_gen.bash

Grid Engine will execute pop_gen.bash 1000 times, each time setting SGE_TASK_ID to value ranging from 1-1000.

Additionally, as mentioned above, via passing SGE_TASK_ID as command line variable to pop_gen.R you can use SGE_TASK_ID to write to output file:

args <- commandArgs(trailingOnly = TRUE)
out.file <- paste("Results_", args[1], ".txt", sep="")
# d <- "some data frame"
write.table(d, file=out.file)

HTH

How to Run a Job Array in R Using the Rscript Command from the Command Line