How to Execute 4 Shell Scripts in Parallel, I Can't Use Gnu Parallel

How to execute 4 shell scripts in parallel, I can't use GNU parallel?

The easiest way to do this is to background all four of the scripts. You could wrap these with another script "run_parallel.sh" that looks like this:

./dog.sh &
./bird.sh &
./cow.sh &
./fox.sh &

The ampersand backgrounds the invoked process in a non-blocking fashion causing all 4 to be executed at the same time.

As an example, here's a script called "one_two_three.sh":

echo 'One'
sleep 1
echo 'Two'
sleep 1
echo 'Three'
sleep 1
echo 'Done'

and a wrapper "wrapper.sh":

./one_two_three.sh &
./one_two_three.sh &
./one_two_three.sh &
./one_two_three.sh &
echo 'Four running at once!'

Running Multiple Bash Scripts parallel

You could use xargs:

echo "~/path/test1.sh $1 ~/path/test2.sh $1" | xargs -P0 -n2 /bin/bash

-P0 says "run all in parallel"

-n2 passes two arguments to /bin/bash, in this case the script and the parameter

How do you run multiple programs in parallel from a bash script?

To run multiple programs in parallel:

prog1 &
prog2 &

If you need your script to wait for the programs to finish, you can add:

wait

at the point where you want the script to wait for them.

GNU parallel and script not starting

GNU Parallel is not magic: You cannot tell it to parallelize any script.

Instead you need to tell it what to parallelize and how.

In general you need to think that you have to generate a list of commands that you want run in parallel and then give this list to GNU Parallel.

In the script you have 2 for loops and a pipe. All three can be parallelized by using GNU Parallel. It is, however, not certain it will make sense: There is an overhead in parallelizing and if the current implementation utilized the CPU and disk resources optimally, then you will not see a speedup by parallelizing.

A for loop like this

for x in x-value1 x-value2 x-value3 ... x-valueN; do
# do something to $x
done

is parallelized by:

myfunc() {
x="$1"
# do something to $x
}
export -f myfunc
parallel myfunc ::: x-value1 x-value2 x-value3 ... x-valueN

A pipe in the form of A | B | C where B is slow is parallelized by:

A | parallel --pipe B | C

So start by identifying the bottleneck.

For this top is really useful. If you see a single process running 100% in top that is a good candidate for parallelizing.

If not, then you may be limited by how fast your disk is, and that can rarely be sped up by GNU Parallel.

You have not included test data, so I cannot run your script and identify the bottleneck for you. But I have experience with samtools and samtools view was always the bottleneck in my scripts. So let us assume that is also the case here.

samtools ... | awk ...

This is does not fit the A | B | C template where B is slow, so we cannot use parallel --pipe to speed that up. If, however, awk is the bottleneck, then we can use parallel --pipe.

So let us instead look at the two for loops.

It is easy to parallelize the outer loop:

#!/bin/bash
files_chrM_ID="concat_chrM_*"

do_chrM() {
ID_file="$1"
bam_directory="../bam/"
echo "$(date +%H:%I:%S) $ID_file is being treated"
sample=${ID_file: -12}
sample=${sample:0:8}
echo "$(date +%H:%I:%S) $sample is being treated"
for bam_file_target in "${bam_directory}"*"${sample}"*".bam"
do
echo $bam_file_target // $sample
out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
echo "$out_file will be created"
echo "samtools and awk starting"

samtools view -@ 6 $bam_file_target | awk -v st="$ID_file" 'BEGIN {OFS="\t";ORS="\r\n"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
echo "$out_file done."
done
}
export -f do_chrM

parallel do_chrM ::: ${files_chrM_ID}

This is great if there are more ${files_chrM_ID} than there are CPU threads. But if that is not the case, we also need to parallelize the inner loop.

This is slightly trickier because we need to export a few variables to make them visible to do_bam which is called by parallel:

#!/bin/bash
files_chrM_ID="concat_chrM_*"

do_chrM() {
ID_file="$1"
bam_directory="../bam/"
echo "$(date +%H:%I:%S) $ID_file is being treated"
sample=${ID_file: -12}
sample=${sample:0:8}
# We need to export $sample and $ID_file to make them visible to do_bam()
export sample
export ID_file
echo "$(date +%H:%I:%S) $sample is being treated"
do_bam() {
bam_file_target="$1"
echo $bam_file_target // $sample
out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
echo "$out_file will be created"
echo "samtools and awk starting"

samtools view -@ 6 $bam_file_target |
awk -v st="$ID_file" 'BEGIN {OFS="\t";ORS="\r\n"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
echo "$out_file done."
}
export -f do_bam
parallel do_bam ::: "${bam_directory}"*"${sample}"*".bam"
}
export -f do_chrM

parallel do_chrM ::: ${files_chrM_ID}

This, however, may overload your server: The inner parallel does not communicate with the outer parallel so if you run this on a 64 core machine you risk running 64*64 jobs in parallel (but only if there are enough files matching concat_chrM_* and "${bam_directory}"*"${sample}"*".bam").

In that case it will make sense to limit the outer parallel to 1 or 2 jobs in parallel:

parallel -j2 do_chrM ::: ${files_chrM_ID}

This will at most run 2*64 jobs in parallel on a 64-core machine.

If, however, you want to run 64 jobs in parallel all the time then it becomes quite a bit trickier: It would have been fairly simple if the values of the inner loop did not depend on the outer loop, because then you could simply have done something like:

parallel do_stuff ::: chrM_1 ... chrM_100 ::: bam1.bam ... bam100.bam

which would generate all combinations of chrM_X,bamY.bam and run those in parallel - 64 at a time on a 64-core machine.

But in your case the values in the inner loop do depend on the values in the outer loop. This means you need to compute the values before starting any jobs. This also means you cannot have your script output information in the outer loop.

#!/bin/bash

sam_awk() {
bam_file_target="$1"
sample="$2"
ID_File="$3"

echo "$(date +%H:%I:%S) $ID_file is being treated"
echo "$(date +%H:%I:%S) $sample is being treated"

echo $bam_file_target // $sample
out_file=${ID_file:0:-4}_ON_${bam_file_target:8:-4}.sam
echo "$out_file will be created"
echo "samtools and awk starting"

samtools view -@ 6 $bam_file_target |
awk -v st="$ID_file" 'BEGIN {OFS="\t";ORS="\r\n"; while (getline < st) {st_array[$1]=$2}} {if ($1 in st_array) {print $0, st_array[$1], "target"}}' >> $out_file
echo "$out_file done."
}

files_chrM_ID="concat_chrM_*"
bam_directory="../bam/"
for ID_file in ${files_chrM_ID}
do
# Moved to inner
# echo "$(date +%H:%I:%S) $ID_file is being treated"
sample=${ID_file: -12}
sample=${sample:0:8}
# Moved to inner
# echo "$(date +%H:%I:%S) $sample is being treated"
for bam_file_target in "${bam_directory}"*"${sample}"*".bam"
do
echo "$bam_file_target"
echo "$sample"
echo "$ID_File"
done
done | parallel -n3 sam_awk

Given that you have not given us any test data, I cannot test whether these scripts will actually run, so there may be errors in them.

If you have not already done so, read at least chapter 1+2 of "GNU Parallel 2018" (available at
http://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html or
download it at: https://doi.org/10.5281/zenodo.1146014)

It should take you less than 20 minutes and your command line will love you for it.

How to use parallel execution in a shell script?

Convert this into a Makefile with proper dependencies. Then you can use make -j to have Make run everything possible in parallel.

Note that all the indents in a Makefile must be TABs. TAB shows Make where the commands to run are.

Also note that this Makefile is now using GNU Make extensions (the wildcard and subst functions).

It might look like this:

export PATH := .:${PATH}

FILES=$(wildcard file*)
RFILES=$(subst file,r,${FILES})

final: combine ${RFILES}
combine ${RFILES} final
rm ${RFILES}

ex: example.c

combine: combine.c

r%: file% ex
ex $< $@

Use GNU parallel to parallelise a bash for loop

Replace echo $folders | parallel ... with echo "$folders" | parallel ....

Without the double quotes, the shell parses spaces in $folders and passes them as separate arguments to echo, which causes them to be printed on one line. parallel provides each line as argument to the job.

To avoid such quoting issues altogether, it is always a good idea to pipe find to parallel directly, and use the null character as the delimiter:

find ... -print0 | parallel -0 ...

This will work even when encountering file names that contain multiple spaces or a newline character.

How to run program in Bash script for as long as other program runs in parallel?

#!/bin/bash

execProgram(){
case $1 in
server)
sleep 5 & # <-- change "sleep 5" to your server command.
# use "&" for background process
SERVER_PID=$!
echo "server started with pid $SERVER_PID"
;;
client)
sleep 18 & # <-- change "sleep 18" to your client command
# use "&" for background process
CLIENT_PID=$!
echo "client started with pid $CLIENT_PID"
;;
esac
}

waitForServer(){
echo "waiting for server"
wait $SERVER_PID
echo "server prog is done"

}

terminateClient(){
echo "killing client pid $CLIENT_PID after 5 seconds"
sleep 5
kill -15 $CLIENT_PID >/dev/null 2>&1
wait $CLIENT_PID >/dev/null 2>&1
echo "client terminated"
}

execProgram server && execProgram client
waitForServer && terminateClient

How to send arguments to bash script in GNU parallel

bash_script.sh

parallel scp "$1" xxx@{}.com: ::: {1..5}

Usage:

bash bash_script.sh argument


Related Topics



Leave a reply



Submit