Bash while read loop extremely slow compared to cat, why?
The reason while read
is so slow is that the shell is required to make a system call for every byte. It cannot read a large buffer from the pipe, because the shell must not read more than one line from the input stream and therefore must compare each character against a newline. If you run strace
on a while read
loop, you can see this behavior. This behavior is desirable, because it makes it possible to reliably do things like:
while read size; do test "$size" -gt 0 || break; dd bs="$size" count=1 of=file$(( i++ )); done
in which the commands inside the loop are reading from the same stream that the shell reads from. If the shell consumed a big chunk of data by reading large buffers, the inner commands would not have access to that data. An unfortunate side-effect is that read
is absurdly slow.
while loop extremely slow read file
- You don't need to keep an iterator to add to arrays. You can simply do
array+=(item)
(notarray+=item
). - Getting the columns in the input is as simple as using
read
with multiple target variables. As a bonus, the last variable gets the Nth word and all subsequent words. Seehelp [r]ead
.
This saves a ton of forks, but I haven't tested how fast it is.
ogl_date=()
[...]
ogl_commands=()
while read -r date1 date2 time server id type pid commands
do
ogl_date+=("$date1 $date2")
[...]
ogl_commands+=("$commands")
done < /tmp/ftp_search.14-12-02
while read incredibly slow
You could speed it up by removing all the pipes and calls to subshells. The below greatly simplifies what you're doing:
while read -r par0 par1 par2 par3; do
echo $par0 $par1 $par2 $par3
done < data.txt
While Loop Performance : Extremely slow
When working toward optimization the first step is to time how long it takes just to read the input file, and do nothing with it. On my system that takes only a few hundredths of a second for a 10MB file.
So now we know the least amount of time it's going to take, we need to look at optimization strategies. In your example code you are opening parts.txt
and reading that file from the filesystem for every record in your input file. So you're expanding the amount of work needed considerably. It would be nicer if you could keep the parts file in memory and just grab a random element from it for each record from your input file.
The next optimization you can make is to avoid shuffling the list of parts each time you need a part. Better to grab a random element, than to shuffle the elements.
You can also skip any processing for any records that don't begin with CAR, but that seems to be a lesser advantage.
Anyway, the following accomplishes those objectives:
#!/usr/bin/env perl
use strict;
use warnings;
use Getopt::Long;
use Time::HiRes qw(time);
my ($parts_file, $input_file, $output_file) = ('parts.txt', 'input.txt', 'output.txt');
GetOptions(
"parts=s", \$parts_file,
"input=s", \$input_file,
"output=s", \$output_file,
);
my $t0 = time;
chomp(
my @parts = do {
open my $fh, '<', $parts_file or die "Cannot open $parts_file: $!\n";
<$fh>;
}
);
open my $input_fh, '<', $input_file or die "Cannot open $input_file for input: $!\n";
local $/ = '~';
open my $out_fh, '>', $output_file or die "Cannot open $output_file for output: $!\n";
my $rec_count = 0;
while (my $rec = <$input_fh>) {
chomp $rec;
$rec =~ s{^
(CAR\*(?:[^*]+\*){2})
[^*]+
}{
$1 . $parts[int(rand(@parts))]
}xe;
++$rec_count;
print $out_fh "$rec$/";
}
close $out_fh or die "Cannot close output file $output_file: $!\n";
printf "Elapsed time: %-.03f\nRecords: %d\n", time-$t0, $rec_count;
On my system a file consisting of 488321 records (approximately 10MB in size) takes 0.588 seconds to process.
For your own needs you will want to take this Perl script and modify it to have more robust handling of filenames and filesystem paths. That's not part of the question that was asked, though. The primary objective of this code is to demonstrate where optimizations can be taken; moving work out of the loop, for example; we only open the parts file once, we read it once, and we never shuffle; we just grab a random item from our in-memory list of parts.
Since command-line "one-liners" are so convenient, we should see if this can be boiled down to one. Mostly equivalent functionality can be achieved in a Perl "one-liner" by using the -l
, -a
, -p
, -F
, and -e
switches(I'm taking the liberty of letting it flow to multiple lines, though):
perl -l0176 -apF'\*' -e '
BEGIN{
local $/ = "\n";
chomp(@parts = do {open $fh, "<", shift(@ARGV); <$fh>})
}
$F[0] =~ m/^CAR/ && $F[3] =~ s/^\w+$/$parts[int(rand(@parts))]/e;
$_ = join("*", @F);
' parts.txt input.txt >output.txt
Here's how it works:
The -p
switch tells Perl to iterate over every line in the file specified on the command line, or if none is specified, over STDIN. For each line, place the line's value into $_
, and before moving on to the next line, print the contents of $_
to STDOUT. This gives us the opportunity to modify $_
such that changes get written to STDOUT. But we use the -l
switch which lets us specify an octal value representing a different record separator. In this case we use the octal value for the ~
character. This causes -p
to iterate over records separated by ~
instead of \n
. Also the -l
switch strips record separators on input, and replaces them on output.
However, we also use the -a
and -F
switches. -a
tells Perl to auto-split the input into the @F
array, and -F
lets us specify that we want to autosplit on the *
character. Because -F
accepts a PCRE pattern, and *
is considered a quantifier in PCRE, we escape it with a backslash.
Next the -e
switch says to evaluate the following string as code. Finally we can discuss the string of code. First there is a BEGIN{...}
block which shifts one value off of @ARGV
and uses it as a name of a file to open to read the parts list from. Once that filename has been shifted off, it won't be considered for reading by the -p
switch later in the script (the BEGIN block happens before the implicit -p
loop). So just consider that the code in the BEGIN{...}
block temporarily sets the record separator back to newlines, reads the parts file into an array, and then relinquishes the record separator back to being ~
again.
Now we can move on past the begin block. @F
has become the container holding the fields within a given record. The 4th field (offset 3) is the one you wish to swap. Check if the first field (offset 0) starts with CAR
. If it does, set the contents of the 4th field to a random element from our parts array, but only if that field consists of one or more characters.
Then we join back together the fields, delimited with an asterisk and assign that result back to $_
. Our work is done. Thanks to the -p
switch, Perl writes the contents of $_
to STDOUT and then appends the record separator, ~
.
Finally on the command line we first specify the path to the parts file, then the path to the input file, and then redirect STDOUT to our output file.
Bash while VERY slow
Let us analyse your script and try to explain why it is slow.
Let's first start with a micro-optimization of your first line. It's not going to speed up things, but this is merely educational.
cat /home/maillog |grep "Nov 13" |grep "from=<xxxx@xxxx.com>" |awk '{print $6}' > /home/output_1
In this line you make 4 calls to different binaries which in the end can be done by a single one. For readability, you could keep this line. However, here are two main points:
Useless use of
cat
. The programcat
is mainly used to concattenate files. If you just add a single file, then it is basically overkilling. Especially if you want to pass it togrep
.cat file | grep ... => grep ... file
- Useless use of cat?
- https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat
multiple greps in combination with awk ... can be written as a single awk
awk '/Nov 13/ && /from=<xxxx@xxxx.com>/ {print $6}'
So the entire line can be written as:
awk '/Nov 13/ && /from=<xxxx@xxxx.com>/ {print $6}' /home/maillog > /home/output_1
The second part is where things get slow:
while read line; do
awk -v line="$line" '$6 ~ line { print $0 }' /home/maillog >> /home/output_2 ;
done < /home/output_1
Why is this slow? Per line you read form /home/output_1
, you load the program awk
into memory, you open the file /home/maillog
, process every line of it and close the file /home/maillog
. At the same time, per line you process, you open /home/output_2
every time, put the file pointer to the end of the file, write to the file and close the file again.
The whole program can actually be done with a single awk:
awk '(NR==FNR) && /Nov 13/ && /from=<xxxx@xxxx.com>/ {a[$6];next}($6 in a)' /home/maillog /home/maillog > /home/output2
bash loop taking extremely long time
One of the key things to understand in looking at bash scripts from a performance perspective is that while the bash interpreter is somewhat slow, the act of spawning an external process is extremely slow. Thus, while it can often speed up your scripts to use a single invocation of awk
or sed
to process a large stream of input, starting those invocations inside a tight loop will greatly outweigh the performance of those tools once they're running.
Any command substitution -- $()
-- causes a second copy of the interpreter to be fork()
ed off as a subshell. Invoking any command not built into bash -- date
, sed
, etc -- then causes a subprocess to be fork()
ed off for that process, and then the executable associated with that process to be exec()
'd -- something involves a great deal of OS-level overhead (the binary needs to be linked, loaded, etc).
This loop would be better written as:
IFS=: read -r currentHours currentMinutes < <(date +"%H:%M")
while IFS=: read -r hours minutes _; do
if (( hours >= currentHours )) && (( minutes >= currentMinutes )); then
break
fi
done <file.txt
In this form only one external command is run, date +"%H:%M"
, outside the loop. If you were only targeting bash 4.2 and newer (with built-in time formatting support), even this would be unnecessary:
printf -v currentHours '%(%H)T' -1
printf -v currentMinutes '%(%M)T' -1
...will directly place the current hour and minute into the variables currentHours
and currentMinutes
using only functionality built into modern bash releases.
See:
- BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
- BashFAQ #100 - How can I do native string manipulations in bash? (Subsection: "Splitting a string into fields")
How to nested loop 2 files then compare columns without using `while read`?
The term "folder" is from Windows. In Unix the equivalent is a "directory". The following will accommodate spaces in your directory names (as you have in your sample input with /home/me/file 2
but that's not adequate to test that a given script accommodates it) and will work using any awk in any shell on every Unix box:
$ cat tst.sh
#!/usr/bin/env bash
result_dir='/home/directory1'
mkdir -p "$result_dir" || exit
#2 test files
cat << EOF > "$result_dir/old"
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF
cat << EOF > "$result_dir/new"
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF
awk '
{
match($0,/^([^ ]+ ){2}/)
dir = substr($0,RLENGTH+1)
$0 = substr($0,1,RLENGTH-1)
}
NR==FNR {
olds[dir] = $0
next
}
dir in olds {
split(olds[dir],old)
for (i=1; i<=NF; i++) {
if ($i != old[i]) {
print dir, $i
}
}
}
' "$result_dir/old" "$result_dir/new"
$ ./tst.sh
/home/me 4
/home/me/file 2 f
Related Topics
How to Fix Java.Lang.Module.Findexception: Module Java.Se.Ee Not Found
How to Extract Duration Time from Ffmpeg Output
Iterate Over a List of Files With Spaces
Using Openssl to Get the Certificate from a Server
What Does the Number in Parentheses Shown After Unix Command Names in Manpages Mean
Is There a "Goto" Statement in Bash
Linux Default Behavior of Executable .Data Section Changed Between 5.4 and 5.9
How to Get Cmake to Find My Alternative Boost Installation
Return Value of Sed For No Match
Minimal Executable Size Now 10X Larger After Linking Than 2 Years Ago, For Tiny Programs
What Killed My Process and Why
When Should I Wrap Quotes Around a Shell Variable
Are There Any Standard Exit Status Codes in Linux
How to Make a Program Continue to Run After Log Out from Ssh
How to Use Sed to Change My Configuration Files, With Flexible Keys and Values
Why Do You Need to Put #!/Bin/Bash At the Beginning of a Script File
Encryption/Decryption Doesn't Work Well Between Two Different Openssl Versions