Bash While Read Loop Extremely Slow Compared to Cat, Why

Bash while read loop extremely slow compared to cat, why?

The reason while read is so slow is that the shell is required to make a system call for every byte. It cannot read a large buffer from the pipe, because the shell must not read more than one line from the input stream and therefore must compare each character against a newline. If you run strace on a while read loop, you can see this behavior. This behavior is desirable, because it makes it possible to reliably do things like:

while read size; do test "$size" -gt 0 || break; dd bs="$size" count=1 of=file$(( i++ )); done

in which the commands inside the loop are reading from the same stream that the shell reads from. If the shell consumed a big chunk of data by reading large buffers, the inner commands would not have access to that data. An unfortunate side-effect is that read is absurdly slow.

while loop extremely slow read file

  • You don't need to keep an iterator to add to arrays. You can simply do array+=(item) (not array+=item).
  • Getting the columns in the input is as simple as using read with multiple target variables. As a bonus, the last variable gets the Nth word and all subsequent words. See help [r]ead.

This saves a ton of forks, but I haven't tested how fast it is.

ogl_date=()
[...]
ogl_commands=()

while read -r date1 date2 time server id type pid commands
do
ogl_date+=("$date1 $date2")
[...]
ogl_commands+=("$commands")
done < /tmp/ftp_search.14-12-02

while read incredibly slow

You could speed it up by removing all the pipes and calls to subshells. The below greatly simplifies what you're doing:

while read -r par0 par1 par2 par3; do 
echo $par0 $par1 $par2 $par3
done < data.txt

While Loop Performance : Extremely slow

When working toward optimization the first step is to time how long it takes just to read the input file, and do nothing with it. On my system that takes only a few hundredths of a second for a 10MB file.

So now we know the least amount of time it's going to take, we need to look at optimization strategies. In your example code you are opening parts.txt and reading that file from the filesystem for every record in your input file. So you're expanding the amount of work needed considerably. It would be nicer if you could keep the parts file in memory and just grab a random element from it for each record from your input file.

The next optimization you can make is to avoid shuffling the list of parts each time you need a part. Better to grab a random element, than to shuffle the elements.

You can also skip any processing for any records that don't begin with CAR, but that seems to be a lesser advantage.

Anyway, the following accomplishes those objectives:

#!/usr/bin/env perl

use strict;
use warnings;
use Getopt::Long;
use Time::HiRes qw(time);

my ($parts_file, $input_file, $output_file) = ('parts.txt', 'input.txt', 'output.txt');

GetOptions(
"parts=s", \$parts_file,
"input=s", \$input_file,
"output=s", \$output_file,
);

my $t0 = time;
chomp(
my @parts = do {
open my $fh, '<', $parts_file or die "Cannot open $parts_file: $!\n";
<$fh>;
}
);

open my $input_fh, '<', $input_file or die "Cannot open $input_file for input: $!\n";
local $/ = '~';

open my $out_fh, '>', $output_file or die "Cannot open $output_file for output: $!\n";

my $rec_count = 0;
while (my $rec = <$input_fh>) {
chomp $rec;
$rec =~ s{^
(CAR\*(?:[^*]+\*){2})
[^*]+
}{
$1 . $parts[int(rand(@parts))]
}xe;
++$rec_count;
print $out_fh "$rec$/";
}

close $out_fh or die "Cannot close output file $output_file: $!\n";
printf "Elapsed time: %-.03f\nRecords: %d\n", time-$t0, $rec_count;

On my system a file consisting of 488321 records (approximately 10MB in size) takes 0.588 seconds to process.

For your own needs you will want to take this Perl script and modify it to have more robust handling of filenames and filesystem paths. That's not part of the question that was asked, though. The primary objective of this code is to demonstrate where optimizations can be taken; moving work out of the loop, for example; we only open the parts file once, we read it once, and we never shuffle; we just grab a random item from our in-memory list of parts.

Since command-line "one-liners" are so convenient, we should see if this can be boiled down to one. Mostly equivalent functionality can be achieved in a Perl "one-liner" by using the -l, -a, -p, -F, and -e switches(I'm taking the liberty of letting it flow to multiple lines, though):

perl -l0176  -apF'\*' -e '
BEGIN{
local $/ = "\n";
chomp(@parts = do {open $fh, "<", shift(@ARGV); <$fh>})
}
$F[0] =~ m/^CAR/ && $F[3] =~ s/^\w+$/$parts[int(rand(@parts))]/e;
$_ = join("*", @F);
' parts.txt input.txt >output.txt

Here's how it works:

The -p switch tells Perl to iterate over every line in the file specified on the command line, or if none is specified, over STDIN. For each line, place the line's value into $_, and before moving on to the next line, print the contents of $_ to STDOUT. This gives us the opportunity to modify $_ such that changes get written to STDOUT. But we use the -l switch which lets us specify an octal value representing a different record separator. In this case we use the octal value for the ~ character. This causes -p to iterate over records separated by ~ instead of \n. Also the -l switch strips record separators on input, and replaces them on output.

However, we also use the -a and -F switches. -a tells Perl to auto-split the input into the @F array, and -F lets us specify that we want to autosplit on the * character. Because -F accepts a PCRE pattern, and * is considered a quantifier in PCRE, we escape it with a backslash.

Next the -e switch says to evaluate the following string as code. Finally we can discuss the string of code. First there is a BEGIN{...} block which shifts one value off of @ARGV and uses it as a name of a file to open to read the parts list from. Once that filename has been shifted off, it won't be considered for reading by the -p switch later in the script (the BEGIN block happens before the implicit -p loop). So just consider that the code in the BEGIN{...} block temporarily sets the record separator back to newlines, reads the parts file into an array, and then relinquishes the record separator back to being ~ again.

Now we can move on past the begin block. @F has become the container holding the fields within a given record. The 4th field (offset 3) is the one you wish to swap. Check if the first field (offset 0) starts with CAR. If it does, set the contents of the 4th field to a random element from our parts array, but only if that field consists of one or more characters.

Then we join back together the fields, delimited with an asterisk and assign that result back to $_. Our work is done. Thanks to the -p switch, Perl writes the contents of $_ to STDOUT and then appends the record separator, ~.

Finally on the command line we first specify the path to the parts file, then the path to the input file, and then redirect STDOUT to our output file.

Bash while VERY slow

Let us analyse your script and try to explain why it is slow.

Let's first start with a micro-optimization of your first line. It's not going to speed up things, but this is merely educational.

cat /home/maillog |grep "Nov 13" |grep "from=<xxxx@xxxx.com>" |awk '{print $6}' > /home/output_1 

In this line you make 4 calls to different binaries which in the end can be done by a single one. For readability, you could keep this line. However, here are two main points:

  1. Useless use of cat. The program cat is mainly used to concattenate files. If you just add a single file, then it is basically overkilling. Especially if you want to pass it to grep.

    cat file | grep ... => grep ... file
    • Useless use of cat?
    • https://en.wikipedia.org/wiki/Cat_(Unix)#Useless_use_of_cat
  2. multiple greps in combination with awk ... can be written as a single awk

    awk '/Nov 13/ && /from=<xxxx@xxxx.com>/ {print $6}'

So the entire line can be written as:

awk '/Nov 13/ && /from=<xxxx@xxxx.com>/ {print $6}' /home/maillog > /home/output_1

The second part is where things get slow:

while read line; do 
awk -v line="$line" '$6 ~ line { print $0 }' /home/maillog >> /home/output_2 ;
done < /home/output_1

Why is this slow? Per line you read form /home/output_1, you load the program awk into memory, you open the file /home/maillog, process every line of it and close the file /home/maillog. At the same time, per line you process, you open /home/output_2 every time, put the file pointer to the end of the file, write to the file and close the file again.

The whole program can actually be done with a single awk:

awk '(NR==FNR) && /Nov 13/ && /from=<xxxx@xxxx.com>/ {a[$6];next}($6 in a)' /home/maillog /home/maillog > /home/output2

bash loop taking extremely long time

One of the key things to understand in looking at bash scripts from a performance perspective is that while the bash interpreter is somewhat slow, the act of spawning an external process is extremely slow. Thus, while it can often speed up your scripts to use a single invocation of awk or sed to process a large stream of input, starting those invocations inside a tight loop will greatly outweigh the performance of those tools once they're running.

Any command substitution -- $() -- causes a second copy of the interpreter to be fork()ed off as a subshell. Invoking any command not built into bash -- date, sed, etc -- then causes a subprocess to be fork()ed off for that process, and then the executable associated with that process to be exec()'d -- something involves a great deal of OS-level overhead (the binary needs to be linked, loaded, etc).


This loop would be better written as:

IFS=: read -r currentHours currentMinutes < <(date +"%H:%M")
while IFS=: read -r hours minutes _; do
if (( hours >= currentHours )) && (( minutes >= currentMinutes )); then
break
fi
done <file.txt

In this form only one external command is run, date +"%H:%M", outside the loop. If you were only targeting bash 4.2 and newer (with built-in time formatting support), even this would be unnecessary:

printf -v currentHours '%(%H)T' -1
printf -v currentMinutes '%(%M)T' -1

...will directly place the current hour and minute into the variables currentHours and currentMinutes using only functionality built into modern bash releases.


See:

  • BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
  • BashFAQ #100 - How can I do native string manipulations in bash? (Subsection: "Splitting a string into fields")

How to nested loop 2 files then compare columns without using `while read`?

The term "folder" is from Windows. In Unix the equivalent is a "directory". The following will accommodate spaces in your directory names (as you have in your sample input with /home/me/file 2 but that's not adequate to test that a given script accommodates it) and will work using any awk in any shell on every Unix box:

$ cat tst.sh
#!/usr/bin/env bash

result_dir='/home/directory1'
mkdir -p "$result_dir" || exit

#2 test files
cat << EOF > "$result_dir/old"
1 a /home
5 b /home/me
6 e /home/me/file 2
3 c /home/oth
EOF

cat << EOF > "$result_dir/new"
1 a /home
4 b /home/me
6 f /home/me/file 2
5 c /home/oth/file
EOF

awk '
{
match($0,/^([^ ]+ ){2}/)
dir = substr($0,RLENGTH+1)
$0 = substr($0,1,RLENGTH-1)
}
NR==FNR {
olds[dir] = $0
next
}
dir in olds {
split(olds[dir],old)
for (i=1; i<=NF; i++) {
if ($i != old[i]) {
print dir, $i
}
}
}
' "$result_dir/old" "$result_dir/new"


$ ./tst.sh
/home/me 4
/home/me/file 2 f


Related Topics



Leave a reply



Submit