How can I use the parallel command to exploit multi-core parallelism on my MacBook?
Parallel processing makes sense when your work is CPU bound (the CPU does the work, and the peripherals are mostly idle) but here, you are trying to improve the performance of a task which is I/O bound (the CPU is mostly idle, waiting for a busy peripheral). In this situation, adding parallelism will only add congestion, as multiple tasks will be fighting over the already-starved I/O bandwidth between them.
On macOS, the system already indexes all your data anyway (including the contents of word-processing documents, PDFs, email messages, etc); there's a friendly magnifying glass on the menu bar at the upper right where you can access a much faster and more versatile search, called Spotlight. (Though I agree that some of the more sophisticated controls of find
are missing; and the "user friendly" design gets in the way for me when it guesses what I want, and guesses wrong.)
Some Linux distros offer a similar facility; I would expect that to be the norm for anything with a GUI these days, though the details will differ between systems.
A more traditional solution on any Unix-like system is the locate
command, which performs a similar but more limited task; it will create a (very snappy) index on file names, so you can say
locate fnord
to very quickly obtain every file whose name matches fnord
. The index is simply a copy of the results of a find
run from last night (or however you schedule the backend to run). The command is already installed on macOS, though you have to enable the back end if you want to use it. (Just run locate locate
to get further instructions.)
You could build something similar yourself if you find yourself often looking for files with a particular set of permissions and a particular owner, for example (these are not features which locate
records); just run a nightly (or hourly etc) find
which collects these features into a database -- or even just a text file -- which you can then search nearly instantly.
For running jobs in parallel, you don't really need GNU parallel
, though it does offer a number of conveniences and enhancements for many use cases; you already have xargs -P
. (The xargs
on macOS which originates from BSD is more limited than GNU xargs
which is what you'll find on many Linuxes; but it does have the -P
option.)
For example, here's how to run eight parallel find
instances with xargs -P
:
printf '%s\n' */ | xargs -I {} -P 8 find {} -name '*.ogg'
(This assumes the wildcard doesn't match directories which contain single quotes or newlines or other shenanigans; GNU xargs
has the -0
option to fix a large number of corner cases like that; then you'd use '%s\0'
as the format string for printf
.)
As the parallel
documentation readily explains, its general syntax is
parallel -options command ...
where {}
will be replaced with the current input line (if it is missing, it will be implicitly added at the end of command ...
) and the (obviously optional) :::
special token allows you to specify an input source on the command line instead of as standard input.
Anything outside of those special tokens is passed on verbatim, so you can add find
options at your heart's content just by specifying them literally.
parallel -j8 find {} -type f -name '*.ogg' ::: */
I don't speak zsh
but refactored for regular POSIX sh
your function could be something like
ff () {
parallel -j8 find {} -type f -iname "$2" ::: "$1"
}
though I would perhaps switch the arguments so you can specify a name pattern and a list of files to search, à la grep
.
ff () {
# "local" is not POSIX but works in many sh versions
local pat=$1
shift
parallel -j8 find {} -type f -iname "$pat" ::: "$@"
}
But again, spinning your disk to find things which are already indexed is probably something you should stop doing, rather than facilitate.
Using GNU parallel command with gfind to gain in runtime for gupdatedb tool
You don't need :::
if there's nothing after it, and {}
is pointless too if you don't have any sources. Without more information about what exactly you would want to parallelize, we can't really tell you what you should use instead.
But for example, if you want to run one find
in each of /etc
, /usr
, /bin
, and /opt
, that would look like
parallel find {} -options ::: /etc /usr /bin /opt
This could equivalently be expressed without :::
:
printf '%s\n' /etc /usr /bin /opt |
parallel find {} -options
So the purpose of :::
is basically to say "I want to specify the things to parallelize over on the command line instead of receiving them on standard input"; but if you don't provide this information, either way, parallel
doesn't know what to replace {}
with.
I'm not saying this particular use makes sense for your use case, just hopefully clarifying the documentation (again).
GNU parallel not working at all
As I was about to complete writing this question, I ran parallel --version
to report the version, only to find:
WARNING: YOU ARE USING --tollef. IF THINGS ARE ACTING WEIRD USE --gnu.
It is not clear to me why that flag is set by default. Needless to say, using --gnu
worked!
Thought I would post this to save someone hours of frustration and confusion.
EDIT:
To fix this permanently (in Ubuntu at least), delete the --tollef
flag in /etc/parallel/config
Modify gupdatedb (GNU updatedb command) to insert parallel command
Updated Answer
The problem is on the line after the line containing A2
in the file /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb
. Currently, it is of the form:
# : A2
$find $SEARCHPATHS $FINDOPTIONS \( $prunefs_exp -type d -regex "$PRUNEREGEX" \) -prune -o $print_option
whereas you want it to be of the form:
# : A2
parallel -j 32 --lb gfind {} $FINDOPTIONS ... ::: BUNCH_OF_PATHS
As you haven't given the paths you wish to search in parallel, the paths at the moment are just /
which means nothing can be done in parallel. You will need to run with --localpaths
set to a bunch of places that are worth searching parallel or hack the script even more extensively. Though, to be honest, I am not sure why you would want to speed this up because it should only be run relatively rarely and then only at times when the system is quiet.
Original Answer
Go to around line 250 of file /usr/local/Cellar/findutils/4.7.0/libexec/bin/gupdatedb
and comment it out with a hash sign so it looks like this:
for binary in $find $frcode
do
#checkbinary $binary
done
Are you concerned about multicore?
Are your programs typically CPU bound?
If not, forget it. It doesn't concern you, and gives your users a smoother experience without making any demands on you at all.
Cool, eh?
If you are CPU bound, and your problem is parallelizable, you might be able to leverage the multiple cores. That's the time to start worrying about it.
From the comments:
Suggestion for improving answer: give rough explanation
of how to tell if your program is CPU bound. – Earwicker
CPU bound means that the thing preventing the program from running faster is a lack of computational horse-power. Compare to IO bound (or sometimes network bound). A poor choice of motherboard and processor can result in machines being memory bound as well (yes, I'm looking at you, alpha).
So you'll need to know what your program is doing from moment to moment (and how busy the machine is...) To find out on a unix-like systems run top
. On windows use the taskmanager (thanks Roboprog).
On a machine with a load less than 1 per core (i.e. your desktop machine when you're not doing much of anything), a CPU bound process will consistently have more that 50% of a processor (often more than 90%). When the load average is higher than that (i.e. you have three compiles, SETI@home, and two peer-to-peer networks running in the background) a CPU bound process will have a large fraction of (# of cores)/(load average)
.
Using 100% of all cores with the multiprocessing module
To use 100% of all cores, do not create and destroy new processes.
Create a few processes per core and link them with a pipeline.
At the OS-level, all pipelined processes run concurrently.
The less you write (and the more you delegate to the OS) the more likely you are to use as many resources as possible.
python p1.py | python p2.py | python p3.py | python p4.py ...
Will make maximal use of your CPU.
Related Topics
Shell Script: Hexadecimal Loop
How to Use the Parallel Command to Exploit Multi-Core Parallelism on My MACbook
Linux Command Line Using for Loop and Formatting Results
Terminal Closes When I Source My Script (Run with Dot at the Start)
Redirect Stdout and Stderr to File and Stderr to Stdout
Listing Files Using a Variable Filter
Linux Randomly Deleted My File While Compiling What Do I Do
Git Shows Random Files Changed on MAC Nfs Filesystem
Screen Command Disable the Control Key Ctrl-A to Use It in Vim
How to Get the Offset in a Block Device of an Inode in a Deleted Partition
How to Merge Similar Lines in Linux
Reading Complete Line in 'For' Loop with Spaces, Tabs with Multiple Input Files
Shebang Not Working to Run Bash Scripts in Linux