How to Use the Parallel Command to Exploit Multi-Core Parallelism on My MACbook

How can I use the parallel command to exploit multi-core parallelism on my MacBook?

Parallel processing makes sense when your work is CPU bound (the CPU does the work, and the peripherals are mostly idle) but here, you are trying to improve the performance of a task which is I/O bound (the CPU is mostly idle, waiting for a busy peripheral). In this situation, adding parallelism will only add congestion, as multiple tasks will be fighting over the already-starved I/O bandwidth between them.

On macOS, the system already indexes all your data anyway (including the contents of word-processing documents, PDFs, email messages, etc); there's a friendly magnifying glass on the menu bar at the upper right where you can access a much faster and more versatile search, called Spotlight. (Though I agree that some of the more sophisticated controls of find are missing; and the "user friendly" design gets in the way for me when it guesses what I want, and guesses wrong.)

Some Linux distros offer a similar facility; I would expect that to be the norm for anything with a GUI these days, though the details will differ between systems.

A more traditional solution on any Unix-like system is the locate command, which performs a similar but more limited task; it will create a (very snappy) index on file names, so you can say

locate fnord

to very quickly obtain every file whose name matches fnord. The index is simply a copy of the results of a find run from last night (or however you schedule the backend to run). The command is already installed on macOS, though you have to enable the back end if you want to use it. (Just run locate locate to get further instructions.)

You could build something similar yourself if you find yourself often looking for files with a particular set of permissions and a particular owner, for example (these are not features which locate records); just run a nightly (or hourly etc) find which collects these features into a database -- or even just a text file -- which you can then search nearly instantly.

For running jobs in parallel, you don't really need GNU parallel, though it does offer a number of conveniences and enhancements for many use cases; you already have xargs -P. (The xargs on macOS which originates from BSD is more limited than GNU xargs which is what you'll find on many Linuxes; but it does have the -P option.)

For example, here's how to run eight parallel find instances with xargs -P:

printf '%s\n' */ | xargs -I {} -P 8 find {} -name '*.ogg'

(This assumes the wildcard doesn't match directories which contain single quotes or newlines or other shenanigans; GNU xargs has the -0 option to fix a large number of corner cases like that; then you'd use '%s\0' as the format string for printf.)


As the parallel documentation readily explains, its general syntax is

parallel -options command ...

where {} will be replaced with the current input line (if it is missing, it will be implicitly added at the end of command ...) and the (obviously optional) ::: special token allows you to specify an input source on the command line instead of as standard input.

Anything outside of those special tokens is passed on verbatim, so you can add find options at your heart's content just by specifying them literally.

parallel -j8 find {} -type f -name '*.ogg' ::: */

I don't speak zsh but refactored for regular POSIX sh your function could be something like

ff () {
parallel -j8 find {} -type f -iname "$2" ::: "$1"
}

though I would perhaps switch the arguments so you can specify a name pattern and a list of files to search, à la grep.

ff () {
# "local" is not POSIX but works in many sh versions
local pat=$1
shift
parallel -j8 find {} -type f -iname "$pat" ::: "$@"
}

But again, spinning your disk to find things which are already indexed is probably something you should stop doing, rather than facilitate.

Using GNU parallel command with gfind to gain in runtime for gupdatedb tool

You don't need ::: if there's nothing after it, and {} is pointless too if you don't have any sources. Without more information about what exactly you would want to parallelize, we can't really tell you what you should use instead.

But for example, if you want to run one find in each of /etc, /usr, /bin, and /opt, that would look like

parallel find {} -options ::: /etc /usr /bin /opt

This could equivalently be expressed without ::::

printf '%s\n' /etc /usr /bin /opt |
parallel find {} -options

So the purpose of ::: is basically to say "I want to specify the things to parallelize over on the command line instead of receiving them on standard input"; but if you don't provide this information, either way, parallel doesn't know what to replace {} with.

I'm not saying this particular use makes sense for your use case, just hopefully clarifying the documentation (again).

GNU parallel not working at all

As I was about to complete writing this question, I ran parallel --version to report the version, only to find:

WARNING: YOU ARE USING --tollef. IF THINGS ARE ACTING WEIRD USE --gnu.

It is not clear to me why that flag is set by default. Needless to say, using --gnu worked!

Thought I would post this to save someone hours of frustration and confusion.

EDIT:
To fix this permanently (in Ubuntu at least), delete the --tollef flag in /etc/parallel/config

Modify gupdatedb (GNU updatedb command) to insert parallel command

Updated Answer

The problem is on the line after the line containing A2 in the file /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb. Currently, it is of the form:

# : A2
$find $SEARCHPATHS $FINDOPTIONS \( $prunefs_exp -type d -regex "$PRUNEREGEX" \) -prune -o $print_option

whereas you want it to be of the form:

# : A2
parallel -j 32 --lb gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS

As you haven't given the paths you wish to search in parallel, the paths at the moment are just / which means nothing can be done in parallel. You will need to run with --localpaths set to a bunch of places that are worth searching parallel or hack the script even more extensively. Though, to be honest, I am not sure why you would want to speed this up because it should only be run relatively rarely and then only at times when the system is quiet.

Original Answer

Go to around line 250 of file /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb and comment it out with a hash sign so it looks like this:

for binary in $find $frcode
do
#checkbinary $binary
done

Are you concerned about multicore?

Are your programs typically CPU bound?

If not, forget it. It doesn't concern you, and gives your users a smoother experience without making any demands on you at all.

Cool, eh?

If you are CPU bound, and your problem is parallelizable, you might be able to leverage the multiple cores. That's the time to start worrying about it.


From the comments:

Suggestion for improving answer: give rough explanation
of how to tell if your program is CPU bound. – Earwicker

CPU bound means that the thing preventing the program from running faster is a lack of computational horse-power. Compare to IO bound (or sometimes network bound). A poor choice of motherboard and processor can result in machines being memory bound as well (yes, I'm looking at you, alpha).

So you'll need to know what your program is doing from moment to moment (and how busy the machine is...) To find out on a unix-like systems run top. On windows use the taskmanager (thanks Roboprog).

On a machine with a load less than 1 per core (i.e. your desktop machine when you're not doing much of anything), a CPU bound process will consistently have more that 50% of a processor (often more than 90%). When the load average is higher than that (i.e. you have three compiles, SETI@home, and two peer-to-peer networks running in the background) a CPU bound process will have a large fraction of (# of cores)/(load average).

Using 100% of all cores with the multiprocessing module

To use 100% of all cores, do not create and destroy new processes.

Create a few processes per core and link them with a pipeline.

At the OS-level, all pipelined processes run concurrently.

The less you write (and the more you delegate to the OS) the more likely you are to use as many resources as possible.

python p1.py | python p2.py | python p3.py | python p4.py ...

Will make maximal use of your CPU.



Related Topics



Leave a reply



Submit