Synchronizing Four Shell Scripts to Run One After Another in Unix

Synchronizing four shell scripts to run one after another in unix

You are experiencing a classical race condition. To solve this issue, you need a shared "lock" (or similar) between your 4 scripts.

There are several ways to implement this. One way to do this in bash is by using the flock command, and an agreed-upon filename to use as a lock. The flock man page has some usage examples which resemble this:

(
    flock -x 200  # try to acquire an exclusive lock on the file
    # do whatever check you want. You are guaranteed to be the only one
    # holding the lock
    if [ -f "$paramfile" ]; then
        # do something
    fi
) 200>/tmp/lock-life-for-all-scripts
# The lock is automatically released when the above block is exited

You can also ask flock to fail right away if the lock can't be acquired, or to fail after a timeout (e.g. to print "still trying to acquire the lock" and restart).

Depending on your use case, you could also put the lock on the 'informatica' binary (be sure to use 200< in that case, to open the file for reading instead of (over)writing)

What simple mechanism for synchronous Unix pooled processes?

There's definitely no need to write this tool yourself, there's several good choices.

`make`

make can do this pretty easy, but it does rely extensively on files to drive the process. (If you want to run some operation on every input file that produces an output file, this might be awesome.) The -j command line option will run the specified number of tasks and the -l load-average command line option will specify a system load average that must be met before starting new tasks. (Which might be nice if you wanted to do some work "in the background". Don't forget about the nice(1) command, which can also help here.)

So, a quick (and untested) Makefile for image converting:

ALL=$(patsubst cimg%.jpg,thumb_cimg%.jpg,$(wildcard *.jpg))

.PHONY: all

all: $(ALL)
        convert $< -resize 100x100 $@

If you run this with make, it'll run one-at-a-time. If you run with make -j8, it'll run eight separate jobs. If you run make -j, it'll start hundreds. (When compiling source code, I find that twice-the-number-of-cores is an excellent starting point. That gives each processor something to do while waiting for disk IO requests. Different machines and different loads might work differently.)

`xargs`

xargs provides the --max-procs command line option. This is best if the parallel processes can be divided apart based on a single input stream with either ascii NUL separated input commands or new-line separated input commands. (Well, the -d option lets you pick something else, but these two are common and easy.) This gives you the benefit of using find(1)'s powerful file-selection syntax rather than writing funny expressions like the Makefile example above, or lets your input be completely unrelated to files. (Consider if you had a program for factoring large composite numbers in prime factors -- making that task fit into make would be awkward at best. xargs could do it easily.)

The earlier example might look something like this:

find . -name '*jpg' -print0 | xargs -0 --max-procs 16 -I {} convert {} --resize 100x100 thumb_{}

`parallel`

The moreutils package (available at least on Ubuntu) provides the parallel command. It can run in two different ways: either running a specified command on different arguments, or running different commands in parallel. The previous example could look like this:

parallel -i -j 16 convert {} -resize 100x100 thumb_{} -- *.jpg

`beanstalkd`

The beanstalkd program takes a completely different approach: it provides a message bus for you to submit requests to, and job servers block on jobs being entered, execute the jobs, and then return to waiting for a new job on the queue. If you want to write data back to the specific HTTP request that initiated the job, this might not be very convenient, as you have to provide that mechanism yourself (perhaps a different 'tube' on the beanstalkd server), but if the end result is submitting data into a database, or email, or something similarly asynchronous, this might be the easiest to integrate into your existing application.

Pass all variables from one shell script to another?

You have basically two options:

Make the variable an environment variable (export TESTVARIABLE) before executing the 2nd script.
Source the 2nd script, i.e. . test2.sh and it will run in the same shell. This would let you share more complex variables like arrays easily, but also means that the other script could modify variables in the source shell.

UPDATE:

To use export to set an environment variable, you can either use an existing variable:

A=10
# ...
export A

This ought to work in both bash and sh. bash also allows it to be combined like so:

export A=10

This also works in my sh (which happens to be bash, you can use echo $SHELL to check). But I don't believe that that's guaranteed to work in all sh, so best to play it safe and separate them.

Any variable you export in this way will be visible in scripts you execute, for example:

a.sh:

#!/bin/sh

MESSAGE="hello"
export MESSAGE
./b.sh

b.sh:

#!/bin/sh

echo "The message is: $MESSAGE"

Then:

$ ./a.sh
The message is: hello

The fact that these are both shell scripts is also just incidental. Environment variables can be passed to any process you execute, for example if we used python instead it might look like:

a.sh:

#!/bin/sh

MESSAGE="hello"
export MESSAGE
./b.py

b.py:

#!/usr/bin/python

import os

print 'The message is:', os.environ['MESSAGE']

Sourcing:

Instead we could source like this:

a.sh:

#!/bin/sh

MESSAGE="hello"

. ./b.sh

b.sh:

#!/bin/sh

echo "The message is: $MESSAGE"

Then:

$ ./a.sh
The message is: hello

This more or less "imports" the contents of b.sh directly and executes it in the same shell. Notice that we didn't have to export the variable to access it. This implicitly shares all the variables you have, as well as allows the other script to add/delete/modify variables in the shell. Of course, in this model both your scripts should be the same language (sh or bash). To give an example how we could pass messages back and forth:

a.sh:

#!/bin/sh

MESSAGE="hello"

. ./b.sh

echo "[A] The message is: $MESSAGE"

b.sh:

#!/bin/sh

echo "[B] The message is: $MESSAGE"

MESSAGE="goodbye"

Then:

$ ./a.sh
[B] The message is: hello
[A] The message is: goodbye

This works equally well in bash. It also makes it easy to share more complex data which you could not express as an environment variable (at least without some heavy lifting on your part), like arrays or associative arrays.

How best to include other scripts?

I tend to make my scripts all be relative to one another.
That way I can use dirname:

#!/bin/sh

my_dir="$(dirname "$0")"

"$my_dir/other_script.sh"

Linux concurrency scripting with mutexes

I'm not a PHP programmer, but the documentation says it provides a portable version of flock that you can use. The first example snippet looks pretty close to what you want. Try this:

<?php

$fp = fopen("/tmp/lock.txt", "r+");

if (flock($fp, LOCK_EX)) {  // acquire an exclusive lock

    // Do your critical section here, while you hold the lock

    flock($fp, LOCK_UN);    // release the lock
} else {
    echo "Couldn't get the lock!";
}

fclose($fp);

?>

Note that by default flock waits until it can acquire the lock. You can use LOCK_EX | LOCK_NB if you want it to exit immediately in the case where another copy of the program is already running.

Using the name "/tmp/lock.txt" may be a security hole (I don't want to think hard enough to decide whether it truly is) so you should probably choose a directory that can only be written to by your program.

Synchronizing Current Directory Between Two Zsh Sessions

Caveat emptor: I'm doing this on Ubuntu 10.04 with gnome-terminal, but it should work on any *NIX platform running zsh.

I've also changed things slightly. Instead of mixing "pwd" and "cwd", I've stuck with "pwd" everywhere.

Recording the Present Working Directory

If you want to run a function every time you cd, the preferred way is to use the chpwd function or the more extensible chpwd_functions array. I prefer chpwd_functions since you can dynamically append and remove functions from it.

# Records $PWD to file 
function +record_pwd {
    echo "$(pwd)" > ~/.pwd
}

# Removes the PWD record file
function +clean_up_pwd_record {
    rm -f ~/.pwd
}

# Adds +record_pwd to the list of functions executed when "cd" is called
# and records the present directory
function start_recording_pwd {
    if [[ -z $chpwd_functions[(r)+record_pwd] ]]; then
        chpwd_functions=(${chpwd_functions[@]} "+record_pwd")
    fi
    +record_pwd
}

# Removes +record_pwd from the list of functions executed when "cd" is called
# and cleans up the record file
function stop_recording_pwd {
    if [[ -n $chpwd_functions[(r)+record_pwd] ]]; then
        chpwd_functions=("${(@)chpwd_functions:#+record_pwd}")
        +clean_up_pwd_record
    fi
}

Adding a + to the +record_pwd and +clean_up_pwd_record function names is a hack-ish way to hide it from normal use (similarly, the VCS_info hooks do this by prefixing everything with +vi).

With the above, you would simply call start_recording_pwd to start recording the present working directory every time you change directories. Likewise, you can call stop_recording_pwd to disable that behavior. stop_recording_pwd also removes the ~/.pwd file (just to keep things clean).

By doing things this way, synchronization be easily be made opt-in (since you may not want this for every single zsh session you run).

First Attempt: Using the `preexec` Hook

Similar to the suggestion of @Celada, the preexec hook gets run before executing a command. This seemed like an easy way to get the functionality you want:

autoload -Uz  add-zsh-hook

function my_preexec_hook {
    if [[-r ~/.pwd ]] && [[ $(pwd) != $(cat ~/.pwd) ]]; then
        cd "$(cat ~/.pwd)"
    fi
}
add-zsh-hook preexec my_preexec_hook

This works... sort of. Since the preexec hook runs before each command, it will automatically change directories before running your next command. However, up until then, the prompt stays in the last working directory, so it tab completes for the last directory, etc. (By the way, a blank line doesn't count as a command.) So, it sort of works, but it's not intuitive.

Second Attempt: Using signals and traps

In order to get a terminal to automatically cd and re-print the prompt, things got a lot more complicated.

After some searching, I found out that $$ (the shell's process ID) does not change in subshells. Thus, a subshell (or background job) can easily send signals to its parent. Combine this with the fact that zsh allows you to trap signals, and you have a means of polling ~/.pwd periodically:

# Used to make sure USR1 signals are not taken as synchronization signals
# unless the terminal has been told to do so
local _FOLLOWING_PWD

# Traps all USR1 signals
TRAPUSR1() {
    # If following the .pwd file and we need to change
    if (($+_FOLLOWING_PWD)) && [[ -r ~/.pwd ]] && [[ "$(pwd)" != "$(cat ~/.pwd)" ]]; then
        # Change directories and redisplay the prompt
        # (Still don't fully understand this magic combination of commands)
        [[ -o zle ]] && zle -R && cd "$(cat ~/.pwd)" && precmd && zle reset-prompt 2>/dev/null
    fi
}

# Sends the shell a USR1 signal every second
function +check_recorded_pwd_loop {
    while true; do
        kill -s USR1 "$$" 2>/dev/null
        sleep 1
    done
}

# PID of the disowned +check_recorded_pwd_loop job
local _POLLING_LOOP_PID

function start_following_recorded_pwd {
    _FOLLOWING_PWD=1
    [[ -n "$_POLLING_LOOP_PID" ]] && return

    # Launch signalling loop as a disowned process
    +check_recorded_pwd_loop &!
    # Record the signalling loop's PID
    _POLLING_LOOP_PID="$!"
}

function stop_following_recorded_pwd {
    unset _FOLLOWING_PWD
    [[ -z "$_POLLING_LOOP_PID" ]] && return

    # Kill the background loop
    kill "$_POLLING_LOOP_PID" 2>/dev/null
    unset _POLLING_LOOP_PID
}

If you call start_following_recorded_pwd, this launches +check_recorded_pwd_loop as a disowned background process. This way, you won't get an annoying "suspended jobs" warning when you go to close your shell. The PID of the loop is recorded (via $!) so it can be stopped later.

The loop just sends the parent shell a USR1 signal every second. This signal gets trapped by TRAPUSR1(), which will cd and reprint the prompt if necessary. I don't understand having to call both zle -R and zle reset-prompt, but that was the magic combination that worked for me.

There is also the _FOLLOWING_PWD flag. Since every terminal will have the TRAPUSR1 function defined, this prevents them from handling that signal (and changing directories) unless you actually specified that behavior.

As with recording the present working directory, you can call stop_following_posted_pwd to stop the whole auto-cd thing.

Putting both halves together:

function begin_synchronize {
    start_recording_pwd
    start_following_recorded_pwd
}

function end_synchronize {
    stop_recording_pwd
    stop_following_recorded_pwd
}

Finally, you will probably want to do this:

trap 'end_synchronize' EXIT

This will automatically clean up everything just before your terminal exits, thus preventing you from accidentally leaving orphaned signalling loops around.

Multithreading in Bash

Sure, just add & after the command:

read_cfg cfgA &
read_cfg cfgB &
read_cfg cfgC &
wait

all those jobs will then run in the background simultaneously. The optional wait command will then wait for all the jobs to finish.

Each command will run in a separate process, so it's technically not "multithreading", but I believe it solves your problem.

Limiting the number of simultaneous instances of a program executed within a Perl script (to 1)

I'll assume for a moment that your script is the only thing running rclone. If you wanted only 1 copy running, you would just use a lockfile.

For N instances (for small N), I would just have N lockfiles - have the program try each lock in turn, in a loop; pause if all the locks are already held and retry 1s later, in a loop. Once it has a lock, run rclone then release the lock when it is done.

A more sound approach would be to use SysV semaphores but, unless you want a large N, really care about response times or are worried about fairness between callers, it is not likely to be worth the time learning them.

If your script is not the only program calling rclone, then would need to intercept all calls - instead of putting this code in your program, could replace rclone by wrapper that implements the parallelism constraint as above and then calls the real program.

Synchronizing Four Shell Scripts to Run One After Another in Unix