Comparison of Cat Pipe Awk Operation to Awk Command on a File

Comparison of cat pipe awk operation to awk command on a file

There are 3 ways to open a file and have awk operate on it's contents:

  1. cat opens the file:

    cat file | awk '...'
  2. shell redirection opens the file:

    awk '...' < file
  3. awk opens the file

    awk '...' file

Of those choices:

  1. is always to be avoided as the cat and pipe are using resources and providing no value, google UUOC (Useless Use Of Cat) for details.

Which of the other 2 to use is debatable:


  1. has the advantage that the shell is opening the file rather than the tool so you can rely on consistent error handling if you do this for all tools
  2. has the advantage that the tool knows the name of the file it is operating on (e.g. FILENAME in awk) so you can use that internally.

To see the difference, consider these 2 files:

$ ls -l file1 file2
-rw-r--r-- 1 Ed None 4 Mar 30 09:55 file1
--w------- 1 Ed None 0 Mar 30 09:55 file2
$ cat file1
a
b
$ cat file2
cat: file2: Permission denied

and see what happens when you try to run awk on the contents of both using both methods of opening them:

$ awk '{print FILENAME, $0}' < file1
- a
- b

$ awk '{print FILENAME, $0}' file1
file1 a
file1 b

$ awk '{print FILENAME, $0}' < file2
-bash: file2: Permission denied

$ awk '{print FILENAME, $0}' file2
awk: fatal: cannot open file `file2' for reading (Permission denied)

Note that the error message for opening the unreadable file, file2, when you use redirection came from the shell and so looked exactly like the error message when I first tried to cat it while the error message when letting awk open it came from awk and is different from the shell message and would be different across various awks.

Note that when using awk to open the file, FILENAME was populated with the name of the file being operated on but when using redirection to open the file it was set to -.

I personally think that the benefit of "3" (populated FILENAME) vastly outweighs the benefit of "2" (consistent error handling of file open errors) and so I would always use:

awk '...' file

and for your particular problem you'd use:

awk -F':' '{cnt[$1]++} END{for (i in cnt) print cnt[i], i}' fname

Performance considerations when using pipe | within awk

tl;dr Using a pipe within awk can be twice as slow.

I went and had a quick read through of io.c in the gawk source.

Piping with awk is POSIX as long as you don't use co-processes. ie |&

If you have an OS that doesn't support pipes (this came up in the comments), gawk will simulate them by writing to files like you'd expect. That will take a while but at least you have pipes when you didn't.

If you have a real OS, it will fork children and write the output there, so you wouldn't expect a huge performance drop by using the pipe within awk.

Interestingly though gawk has some optimisations for simple cases like

awk '{print $1}'

so I ran a test case.

for i in $(seq 1 10000000); do echo $(( 10000000-$i )) " " $i;done > infile

Ten million records seemed like enough to smooth out variance from other jobs on the system.

Then

time awk '{ print $1 }' infile | sort -n > /dev/null

real 0m10.350s
user 0m7.770s
sys 0m3.000s

or thereabouts on average.

but

time awk '{ print $1 | " sort -n " }' infile > /dev/null

real 0m25.870s
user 0m13.880s
sys 0m13.030s

As you can see this is quite a dramatic difference.

So the conclusion:
Although it can be potentially much slower there are plenty of use cases where the gains far outweigh the extra performance hit. It really is only in simple cases like the MVCE where you should keep the pipe outside.

There is a discussion here about the difference between redirecting into awk versus calling awk with a filename. Although not directly related, it might be of interest if you have bothered to read this far.

awk: process input from pipe, insert result before pattern in output file

Here's a modified version of your executable awk script that produces the ordering you want:

#!/usr/bin/awk -f

BEGIN { FS="[{}]"; mils="0.3527"; built=1 }

FNR==NR {
if( $1 !~ /set lineno/ ) {
if( lineno != "" ) { footer[++cnt]=$0; if(cnt==3) { FS = "[\" ]+" } }
else print
}
else { lineno=$2 }
next
}

FNR!=NR && NF > 0 { built += buildObjs( built+1 ) }

END {
print "set lineno {" built "}"
for(i=1;i<=cnt;i++ ) {
print footer[i]
}
}

function buildObjs( n )
{
x=$4*mils; y=-$5*mils; w=$6*mils; h=$7*mils
print "## element" n " [x]=" x " [y]=" y " [width]=" w " [height]=" h
print "set fsize(" n ") {FALSE}"
print "set fmargin(" n ") {FALSE}"
print "set fmaster(" n ") {TRUE}"
print "set ftype(" n ") {box}"
print "set fname(" n ") {" w " " h "}"
print "set fatt(" n ") {1}"
print "set dplObjectSetup(" n ",TRA) {" x " " y "}"
print "set fnum(" n ") {}"
return 1
}

When put into a file called awko it would be run like:

hunspell -L -H ./text.xml | ./awko ./output.xml -

I don't have hunspell installed, so I tested this by running the Edit3 piped output from a file via cat:

cat ./pipeddata | ./awko ./output.xml -

Notice the - at after the output file. It's telling awk to read from stdin as the 2nd input to the awk script, which lets me deal with the first file with the standard FNR==NR { do stuff; next } logic.

Here's the breakdown:

  • For personal preferences, I moved the buildObjs() function to the end of the script. Notice I added a n argument to it - NR won't be used in the output. I dropped the a array because it didn't seem to be necessary and changed it's return from 0 to 1.
  • In the BEGIN block, setup output.xml file parsing, and mils
  • Whenever the FILENAME changes to -, change FS for parsing that input. The piped data FS could instead be set on the command line between the output file and the -.
  • When FNR==NR handle the first file
  • Basically, print the "header" info when your anchor hasn't been read
  • When the anchor is read, store it's value in lineno
  • After the anchor is read, store the last of the output file into the footer array in cnt order. Knowing there are only 3 lines at the end, I "cheated" to adjust the FS before the first record is read from STDIN.
  • When FNR!=NR and the line isn't blank (NF>0), process the piped input, incrementing built and passing it with a offset of 1 as an arg to buildObjs() ( as built starts with a value of 0 ).
  • In the END, the set lineno line is reconstructed/printed with the sum of lineno and built.
  • Then the footer from the first file is printed in order based on the cnt variable

Using the cat form, I get following:

#    file.encoding: UTF-8
# sun.jnu.encoding: UTF-8

set toolVersion {1.20}
set ftype(0) {pgs}
set fsize(0) {FALSE}
set fmargin(0) {FALSE}
set fsize(1) {TRUE}
set fmargin(1) {TRUE}
set fmaster(1) {FALSE}
set ftype(1) {pgs}
set fname(1) {}
set fatt(1) {0}
set dplObjectSetup(1,TRA) {}
set fnum(1) {}
## element2 [x]=32.6389 [y]=-21.7 [width]=3.35171 [height]=0
set fsize(2) {FALSE}
set fmargin(2) {FALSE}
set fmaster(2) {TRUE}
set ftype(2) {box}
set fname(2) {3.35171 0}
set fatt(2) {1}
set dplObjectSetup(2,TRA) {32.6389 -21.7}
set fnum(2) {}
## element3 [x]=32.3073 [y]=-38.0119 [width]=3.68325 [height]=0
set fsize(3) {FALSE}
set fmargin(3) {FALSE}
set fmaster(3) {TRUE}
set ftype(3) {box}
set fname(3) {3.68325 0}
set fatt(3) {1}
set dplObjectSetup(3,TRA) {32.3073 -38.0119}
set fnum(3) {}
## element4 [x]=46.7197 [y]=-11.5499 [width]=2.58776 [height]=0
set fsize(4) {FALSE}
set fmargin(4) {FALSE}
set fmaster(4) {TRUE}
set ftype(4) {box}
set fname(4) {2.58776 0}
set fatt(4) {1}
set dplObjectSetup(4,TRA) {46.7197 -11.5499}
set fnum(4) {}
set lineno {4}
set mode {1}
set preservePDF {1}
set preservePDFAction {Continue}

Seems like your buildObj() function logic needs some attention to get things just the way you want (I suspect the indexes you've chosen need shifting).

When programs like awk gets an input through pipe, does it read it line by ine?

Both ways you write your code:

while IFS=, read a b c
echo $a $b $c
done < textfile.txt

OR

cat textfile.txt | awk '{print $1 $2 $3}'

are wrong. The shell loop will be very slow and produce bizarre results based on the content of your input file. The correct way to write it to avoid the bizarre results is (you should arguably use printf instead of echo too):

while IFS=, read -r a b c
echo "$a $b $c"
done < textfile.txt

but it'd still be incredibly slow. The shell is an environment from which to call tools with a language to sequence those calls, it is NOT a tool for text processing - the UNIX text-processing is awk.

The cat | awk command should be written as:

awk '{print $1, $2, $3}' textfile.awk

since awk is perfectly capable of opening files on it's own and NO UNIX command EVER needs cat to open the file for them, they can all either open the file themselves (cmd file) or have the shell open it for them cmd < file).

awk processes each input record one at a time, where an input record is any chunk of text separated by the value of awks RS variable (a newline by default). Doesn't matter how/where those records are coming from. The only thing you also [rarely] need to consider is buffering - see your awk and shell man pages for info on that.

One way to set shell variables from awk output:

$ cat file
the quick brown fox

$ array=( $(awk '{print $1, $2, $3}' file) )

$ echo "${array[0]}"
the
$ echo "${array[1]}"
quick
$ echo "${array[2]}"
brown

Set individual shell variables from the array contents if you like or just use the array.

Another way:

$ set -- $(awk '{print $1, $2, $3}' file)

$ echo "$1"
the
$ echo "$2"
quick
$ echo "$3"
brown

not equal to operator with awk

For this you just need grep:

$ grep -vf fileA fileB
DaDa 43 Gk
PkPk 22 Aa

This uses fileA to obtain the patterns from. Then, -v inverts the match.

AwkMan addresses very well why you are not matching lines properly. Now, let's see where your solution needs polishing:

Your code is:

for i in `cat FileA`
do
cat FileB | awk '{ if ($1!='$i') print $0_}'>> Result
done

Why you don't read lines with "for" explains it well. So you would need to say something like the described in Read a file line by line assigning the value to a variable:

while IFS= read -r line
do
cat FileB | awk '{ if ($1!='$i') print $0_}'>> Result
done < fileA

Then, you are saying cat file | awk '...'. For this, awk '...' file is enough:

while IFS= read -r line
do
awk '{ if ($1!='$i') print $0_}' FileB >> Result
done < fileA

Also, the redirection could be done at the end of the done, so you have a clearer command:

while IFS= read -r line
do
awk '{ if ($1!='$i') print $0_}' FileB
done < fileA >> Result

Calling awk so many times is not useful and you can use the FNR==NR trick to process two files together.

Let's now enter in awk. Here you want to use some kind of variable to compare results. However, $i is nothing to awk.

Also, when you have a sentence like:

awk '{if (condition) print $0}' file

It is the same to say:

awk 'condition' file

Because {print $0} is the default action to perform when a condition evaluates to true.

Also, to let awk use a bash variable you need to use awk -v var="$shell_var" and then use var internally-

All together, you should say something like:

while IFS= read -r line
do
awk -v var="$line" '$1 != var' FileB
done < fileA >> Result

But since you are looping through the file many times, it will print the lines many, many times. That's why you have to go all the way up to this answer and use grep -vf fileA fileB.

Arithmetic operations with awk

Your awk is behaving correctly, the problem is your locale setting which currently is using , instead of . as the decimal point and that contradicts your data so the string 0.5 will be treated as 0 in numerical operations since the intended number would have been 0,5.

Use:

LC_ALL=C awk '{$1=1-$1}1' in_file > out_file

instead (or export LC_ALL=C in your environment to use that setting for all commands) and see https://unix.stackexchange.com/a/87763/133219 for information on locales and LC_ALL.



Related Topics



Leave a reply



Submit