Comparison of cat pipe awk operation to awk command on a file
There are 3 ways to open a file and have awk operate on it's contents:
cat opens the file:
cat file | awk '...'
shell redirection opens the file:
awk '...' < file
awk opens the file
awk '...' file
Of those choices:
- is always to be avoided as the
cat
and pipe are using resources and providing no value, google UUOC (Useless Use Of Cat) for details.
Which of the other 2 to use is debatable:
- has the advantage that the shell is opening the file rather than the tool so you can rely on consistent error handling if you do this for all tools
- has the advantage that the tool knows the name of the file it is operating on (e.g. FILENAME in awk) so you can use that internally.
To see the difference, consider these 2 files:
$ ls -l file1 file2
-rw-r--r-- 1 Ed None 4 Mar 30 09:55 file1
--w------- 1 Ed None 0 Mar 30 09:55 file2
$ cat file1
a
b
$ cat file2
cat: file2: Permission denied
and see what happens when you try to run awk on the contents of both using both methods of opening them:
$ awk '{print FILENAME, $0}' < file1
- a
- b
$ awk '{print FILENAME, $0}' file1
file1 a
file1 b
$ awk '{print FILENAME, $0}' < file2
-bash: file2: Permission denied
$ awk '{print FILENAME, $0}' file2
awk: fatal: cannot open file `file2' for reading (Permission denied)
Note that the error message for opening the unreadable file, file2, when you use redirection came from the shell and so looked exactly like the error message when I first tried to cat
it while the error message when letting awk open it came from awk and is different from the shell message and would be different across various awks.
Note that when using awk to open the file, FILENAME was populated with the name of the file being operated on but when using redirection to open the file it was set to -
.
I personally think that the benefit of "3" (populated FILENAME) vastly outweighs the benefit of "2" (consistent error handling of file open errors) and so I would always use:
awk '...' file
and for your particular problem you'd use:
awk -F':' '{cnt[$1]++} END{for (i in cnt) print cnt[i], i}' fname
Performance considerations when using pipe | within awk
tl;dr Using a pipe within awk can be twice as slow.
I went and had a quick read through of io.c in the gawk source.
Piping with awk is POSIX as long as you don't use co-processes. ie |&
If you have an OS that doesn't support pipes (this came up in the comments), gawk will simulate them by writing to files like you'd expect. That will take a while but at least you have pipes when you didn't.
If you have a real OS, it will fork children and write the output there, so you wouldn't expect a huge performance drop by using the pipe within awk.
Interestingly though gawk has some optimisations for simple cases like
awk '{print $1}'
so I ran a test case.
for i in $(seq 1 10000000); do echo $(( 10000000-$i )) " " $i;done > infile
Ten million records seemed like enough to smooth out variance from other jobs on the system.
Then
time awk '{ print $1 }' infile | sort -n > /dev/null
real 0m10.350s
user 0m7.770s
sys 0m3.000s
or thereabouts on average.
but
time awk '{ print $1 | " sort -n " }' infile > /dev/null
real 0m25.870s
user 0m13.880s
sys 0m13.030s
As you can see this is quite a dramatic difference.
So the conclusion:
Although it can be potentially much slower there are plenty of use cases where the gains far outweigh the extra performance hit. It really is only in simple cases like the MVCE where you should keep the pipe outside.
There is a discussion here about the difference between redirecting into awk
versus calling awk
with a filename. Although not directly related, it might be of interest if you have bothered to read this far.
awk: process input from pipe, insert result before pattern in output file
Here's a modified version of your executable awk script that produces the ordering you want:
#!/usr/bin/awk -f
BEGIN { FS="[{}]"; mils="0.3527"; built=1 }
FNR==NR {
if( $1 !~ /set lineno/ ) {
if( lineno != "" ) { footer[++cnt]=$0; if(cnt==3) { FS = "[\" ]+" } }
else print
}
else { lineno=$2 }
next
}
FNR!=NR && NF > 0 { built += buildObjs( built+1 ) }
END {
print "set lineno {" built "}"
for(i=1;i<=cnt;i++ ) {
print footer[i]
}
}
function buildObjs( n )
{
x=$4*mils; y=-$5*mils; w=$6*mils; h=$7*mils
print "## element" n " [x]=" x " [y]=" y " [width]=" w " [height]=" h
print "set fsize(" n ") {FALSE}"
print "set fmargin(" n ") {FALSE}"
print "set fmaster(" n ") {TRUE}"
print "set ftype(" n ") {box}"
print "set fname(" n ") {" w " " h "}"
print "set fatt(" n ") {1}"
print "set dplObjectSetup(" n ",TRA) {" x " " y "}"
print "set fnum(" n ") {}"
return 1
}
When put into a file called awko
it would be run like:
hunspell -L -H ./text.xml | ./awko ./output.xml -
I don't have hunspell installed, so I tested this by running the Edit3 piped output from a file via cat:
cat ./pipeddata | ./awko ./output.xml -
Notice the -
at after the output file. It's telling awk
to read from stdin
as the 2nd input to the awk script, which lets me deal with the first file with the standard FNR==NR { do stuff; next }
logic.
Here's the breakdown:
- For personal preferences, I moved the
buildObjs()
function to the end of the script. Notice I added an
argument to it -NR
won't be used in the output. I dropped thea
array because it didn't seem to be necessary and changed it's return from0
to1
. - In the
BEGIN
block, setupoutput.xml
file parsing, andmils
- Whenever the
FILENAME
changes to-
, changeFS
for parsing that input. The piped dataFS
could instead be set on the command line between the output file and the-
. - When
FNR==NR
handle the first file - Basically, print the "header" info when your anchor hasn't been read
- When the anchor is read, store it's value in
lineno
- After the anchor is read, store the last of the
output
file into thefooter
array incnt
order. Knowing there are only 3 lines at the end, I "cheated" to adjust theFS
before the first record is read fromSTDIN
. - When
FNR!=NR
and the line isn't blank (NF>0
), process the piped input, incrementingbuilt
and passing it with a offset of1
as an arg to buildObjs() ( asbuilt
starts with a value of 0 ). - In the
END
, theset lineno
line is reconstructed/printed with the sum oflineno
andbuilt
. - Then the footer from the first file is printed in order based on the
cnt
variable
Using the cat
form, I get following:
# file.encoding: UTF-8
# sun.jnu.encoding: UTF-8
set toolVersion {1.20}
set ftype(0) {pgs}
set fsize(0) {FALSE}
set fmargin(0) {FALSE}
set fsize(1) {TRUE}
set fmargin(1) {TRUE}
set fmaster(1) {FALSE}
set ftype(1) {pgs}
set fname(1) {}
set fatt(1) {0}
set dplObjectSetup(1,TRA) {}
set fnum(1) {}
## element2 [x]=32.6389 [y]=-21.7 [width]=3.35171 [height]=0
set fsize(2) {FALSE}
set fmargin(2) {FALSE}
set fmaster(2) {TRUE}
set ftype(2) {box}
set fname(2) {3.35171 0}
set fatt(2) {1}
set dplObjectSetup(2,TRA) {32.6389 -21.7}
set fnum(2) {}
## element3 [x]=32.3073 [y]=-38.0119 [width]=3.68325 [height]=0
set fsize(3) {FALSE}
set fmargin(3) {FALSE}
set fmaster(3) {TRUE}
set ftype(3) {box}
set fname(3) {3.68325 0}
set fatt(3) {1}
set dplObjectSetup(3,TRA) {32.3073 -38.0119}
set fnum(3) {}
## element4 [x]=46.7197 [y]=-11.5499 [width]=2.58776 [height]=0
set fsize(4) {FALSE}
set fmargin(4) {FALSE}
set fmaster(4) {TRUE}
set ftype(4) {box}
set fname(4) {2.58776 0}
set fatt(4) {1}
set dplObjectSetup(4,TRA) {46.7197 -11.5499}
set fnum(4) {}
set lineno {4}
set mode {1}
set preservePDF {1}
set preservePDFAction {Continue}
Seems like your buildObj()
function logic needs some attention to get things just the way you want (I suspect the indexes you've chosen need shifting).
When programs like awk gets an input through pipe, does it read it line by ine?
Both ways you write your code:
while IFS=, read a b c
echo $a $b $c
done < textfile.txt
OR
cat textfile.txt | awk '{print $1 $2 $3}'
are wrong. The shell loop will be very slow and produce bizarre results based on the content of your input file. The correct way to write it to avoid the bizarre results is (you should arguably use printf
instead of echo
too):
while IFS=, read -r a b c
echo "$a $b $c"
done < textfile.txt
but it'd still be incredibly slow. The shell is an environment from which to call tools with a language to sequence those calls, it is NOT a tool for text processing - the UNIX text-processing is awk.
The cat | awk
command should be written as:
awk '{print $1, $2, $3}' textfile.awk
since awk is perfectly capable of opening files on it's own and NO UNIX command EVER needs cat
to open the file for them, they can all either open the file themselves (cmd file
) or have the shell open it for them cmd < file
).
awk processes each input record one at a time, where an input record is any chunk of text separated by the value of awks RS
variable (a newline by default). Doesn't matter how/where those records are coming from. The only thing you also [rarely] need to consider is buffering - see your awk and shell man pages for info on that.
One way to set shell variables from awk output:
$ cat file
the quick brown fox
$ array=( $(awk '{print $1, $2, $3}' file) )
$ echo "${array[0]}"
the
$ echo "${array[1]}"
quick
$ echo "${array[2]}"
brown
Set individual shell variables from the array contents if you like or just use the array.
Another way:
$ set -- $(awk '{print $1, $2, $3}' file)
$ echo "$1"
the
$ echo "$2"
quick
$ echo "$3"
brown
not equal to operator with awk
For this you just need grep
:
$ grep -vf fileA fileB
DaDa 43 Gk
PkPk 22 Aa
This uses fileA
to obtain the patterns from. Then, -v
inverts the match.
AwkMan addresses very well why you are not matching lines properly. Now, let's see where your solution needs polishing:
Your code is:
for i in `cat FileA`
do
cat FileB | awk '{ if ($1!='$i') print $0_}'>> Result
done
Why you don't read lines with "for" explains it well. So you would need to say something like the described in Read a file line by line assigning the value to a variable:
while IFS= read -r line
do
cat FileB | awk '{ if ($1!='$i') print $0_}'>> Result
done < fileA
Then, you are saying cat file | awk '...'
. For this, awk '...' file
is enough:
while IFS= read -r line
do
awk '{ if ($1!='$i') print $0_}' FileB >> Result
done < fileA
Also, the redirection could be done at the end of the done
, so you have a clearer command:
while IFS= read -r line
do
awk '{ if ($1!='$i') print $0_}' FileB
done < fileA >> Result
Calling awk
so many times is not useful and you can use the FNR==NR
trick to process two files together.
Let's now enter in awk
. Here you want to use some kind of variable to compare results. However, $i
is nothing to awk
.
Also, when you have a sentence like:
awk '{if (condition) print $0}' file
It is the same to say:
awk 'condition' file
Because {print $0}
is the default action to perform when a condition evaluates to true.
Also, to let awk
use a bash variable you need to use awk -v var="$shell_var"
and then use var
internally-
All together, you should say something like:
while IFS= read -r line
do
awk -v var="$line" '$1 != var' FileB
done < fileA >> Result
But since you are looping through the file many times, it will print the lines many, many times. That's why you have to go all the way up to this answer and use grep -vf fileA fileB
.
Arithmetic operations with awk
Your awk is behaving correctly, the problem is your locale setting which currently is using ,
instead of .
as the decimal point and that contradicts your data so the string 0.5
will be treated as 0
in numerical operations since the intended number would have been 0,5
.
Use:
LC_ALL=C awk '{$1=1-$1}1' in_file > out_file
instead (or export LC_ALL=C in your environment to use that setting for all commands) and see https://unix.stackexchange.com/a/87763/133219 for information on locales and LC_ALL
.
Related Topics
Adding Timestamps to Packet Payload with Tcpreplay
Convert Environment Variables into a JSON File
How to Fix Numpy Dependencies Path on a Python 3.7.3 Script on Linux Frozen with Cx_Freeze 6.0B1
Cmake Test for Processor Feature
Shebang Not Working to Run Bash Scripts in Linux
Specify Git Tool to Use in Multi-Os Agent Jenkins Environment for Declarative Pipeline
How to Show Dialog Gauge for Wget
Killing Linux Process by Piping the Id
How to Find Common Rows in Multiple Files Using Awk
Rsync, 'Uid/Gid Impossible to Set' Cases Cause Future Hard Link Failure, How to Fix
Reading Gnu-Screen Logs with Vim
After Segfault: Is There a Way, to Check If Pointer Is Still Valid
Linux Ipc: Shared Memory Recovery