Parallel Processes: Appending Outputs to an Array in a Bash Script

Parallel processes: appending outputs to an array in a bash script

GNU Parallel is good at doing stuff in parallel :-)

task (){ sleep 1;echo "hello $1"; }

# Make "task" known to sub shells
export -f task

# Do tasks in parallel
parallel -k task ::: {1..3}

Sample Output

hello 1
hello 2
hello 3

I am suggesting you do - but Charles kindly points out that this is a known bash pitfall:

array=( $(parallel -k task ::: {1..3}) )

Charles' suggested solution is:

IFS=$'\n' read -r -d '' -a array < <(parallel -k task ::: 1 2 3 && printf '\0')

running each element in array in parallel in bash script

The convenient thing to do is to push your background code into a separate script -- or an exported function. That way xargs can create a new shell, and access the function from its parent. (Be sure to export any other variables that need to be available in the child as well).

array=( 1 2 3 4 5 6 )
max_proc_count=8
log_file=out.txt

run_for_each() {
local each=$1
echo "Processing: $each" >&2
IFS=$' \t\n' read -r -d '' -a lags < <(yourcommand --arg1 "$each" && printf '\0')
for result in "${lags[@]}"; do
printf '%(%Y-%m-%dT%H:%M:%S)T\t%s\t%s\n' -1 "$each" "$result"
done >>"$log_file"
}

export -f run_for_each
export log_file # make log_file visible to subprocesses

printf '%s\0' "${array[@]}" |
xargs -P "$max_proc_count" -n 1 -0 bash -c 'run_for_each "$@"'

Some notes:

  • Using echo -e is bad form. See the APPLICATION USAGE and RATIONALE sections in the POSIX spec for echo, explicitly advising using printf instead (and not defining an -e option, and explicitly defining than echo must not accept any options other than -n).
  • We're including the each value in the log file so it can be extracted from there later.
  • You haven't specified whether the output of yourcommand is space-delimited, tab-delimited, line-delimited, or otherwise. I'm thus accepting all these for now; modify the value of IFS passed to the read to taste.
  • printf '%(...)T' to get a timestamp without external tools such as date requires bash 4.2 or newer. Replace with your own code if you see fit.
  • read -r -a arrayname < <(...) is much more robust than arrayname=( $(...) ). In particular, it avoids treating emitted values as globs -- replacing *s with a list of files in the current directory, or Foo[Bar] with FooB should any file by that name exist (or, if the failglob or nullglob options are set, triggering a failure or emitting no value at all in that case).
  • Redirecting stdout to your log_file once for the entire loop is somewhat more efficient than redirecting it every time you want to run printf once. Note that having multiple processes writing to the same file at the same time is only safe if all of them opened it with O_APPEND (which >> will do), and if they're writing in chunks small enough to individually complete as single syscalls (which is probably happening unless the individual lags values are quite large).

Add data to Bash array over multiple scripts

The arrays are not shared between the different shells. Each script will run as a separate process, and build its own private arrays, but these are lost when the process exits. @Upasana Shukla's suggestion of running the scripts with source will work (because it runs them in the main shell process, rather than as subshells/diferent processes), but will not allow you to run the scripts in parallel. If you want to run them in parallel, the simplest way is probably to have them output to temporary files instead of arrays:

export tmpdir="$(mktemp -d "/tmp/$(basename "$0").XXXXXX")" || {
echo "Error creating temporary directory" >&2
exit 1
}

for z in scripts/*; do # Please don't parse ls
sh "$z" &
done
wait

echo "Validating Script Output"
cat "$tmpdir/exeSuccess"
rm -R "$tmpdir"

And in the individual scripts:

echo "$OUTPUT" >>"$tmpdir/exeSuccess"

How do you run multiple programs in parallel from a bash script?

To run multiple programs in parallel:

prog1 &
prog2 &

If you need your script to wait for the programs to finish, you can add:

wait

at the point where you want the script to wait for them.

Collecting process ids of parallel process in bash file

Don't send the append operation itself to the background. Putting an & after the content you want to background but before the append suffices: The sleep and echo are still backgrounded, but the append is not.

process_ids=( )
append() { process_ids+=( "$1" ); } # POSIX-standard function declaration syntax

{ sleep 1 && echo 'one'; } & append "$!"
{ sleep 5 && echo 'two'; } & append "$!"
{ sleep 1 && echo 'three'; } & append "$!"
{ sleep 5 && echo 'four'; } & append "$!"

echo "Background processes:" # Demonstrate that our array was populated
printf ' - %s\n' "${process_ids[@]}"

wait

How do I assign the output of a command into an array?

To assign the output of a command to an array, you need to use a command substitution inside of an array assignment. For a general command command this looks like:

arr=( $(command) )

In the example of the OP, this would read:

arr=($(grep -n "search term" file.txt | sed 's/:.*//'))

The inner $() runs the command while the outer () causes the output to be an array. The problem with this is that it will not work when the output of the command contains spaces. To handle this, you can set IFS to \n.

IFS=$'\n' arr=($(grep -n "search term" file.txt | sed 's/:.*//'))

You can also cut out the need for sed by performing an expansion on each element of the array:

arr=($(grep -n "search term" file.txt))
arr=("${arr[@]%%:*}")


Related Topics



Leave a reply



Submit