Awk to Read File as a Whole

Awk to read file as a whole

This is a gawk solution

From the docs:

There are times when you might want to treat an entire data file as a single record.
The only way to make this happen is to give RS a value that you know doesn’t occur in the input file.
This is hard to do in a general way, such that a program always works for arbitrary input files.


$ cat file
abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq

The RS must be set to a pattern not present in archive, following Denis Shirokov suggestion on the docs (Thanks @EdMorton):

$ gawk '{print ">>>"$0"<<<<"}' RS='^$' file
>>>abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq

abcdefghijklmn
pqrstuvwxyzabc
defghijklmnopq
<<<<

The trick is in bold font:

It works by setting RS to ^$, a regular expression that will never
match if the file has contents. gawk reads data from the file into
tmp, attempting to match RS. The match fails after each read, but fails quickly, such that gawk fills tmp with the entire contents of the file


So:

$ gawk '{gsub(/\n/,"");print substr($0,8,10)}' RS='^$' file

Returns:

hijklmnpqr

AWK: Reading all lines & manipulating one file ENTIRELY based each line of another file

Like you discovered, Awk can really only process one line at a time. But we can turn things around and read the input file into memory, then loop over its lines repeatedly as we read the other file.

Your example has a comma and a space between the items in file1.txt but I assumed this is not a hard requirement, and so this script expects tab-delimited input instead.

awk -F "\t" 'BEGIN { split(":LSmall:Roman:LCaps", k, /:/) }
NR==FNR { a[NR] = $0; n=NR; next }
FNR==1 { next } # skip header
{
system("mkdir "$1)
filename=$1"/"$1".txt"
for(i=1; i<=n; i++) {
line = a[i]
for (j=2; j<=NF; ++j) {
if (line ~ k[j]) {
gsub(/here/, $j, line)
break
}
}
print line >>filename }
}' file2.txt file1.txt

The BEGIN block initializes an array with substitution key names k. To keep it in sync with the fields in file1.txt, the first item k[1] is empty (it doesn't specify a substitution key).

When NR==FNR we are reading the first input file. We simply collect its lines into the array a.

When we fall through, we are reading the second file, which is the mapping with directory names and substitutions. For each input line, we loop over all the lines in a and perform any substitution specified in the fields in the current line (as soon as one is found, we consider ourselves done. Maybe you want to change this so that multiple keys can trigger on the same line) and finally print the result to the specified output file.

You'll notice how we pull the first field and loop over the subsequent fields, looking up their corresponding key in k by index.

Demo: https://ideone.com/syTv99

If you want to do this on hundreds of files, perhaps refactor some or all of the surrounding loop out into a shell script and concentrate on the substitution actions in the Awk script. The shell can easily loop over the data in file1.txt just as well, which will simplify the Awk script somewhat and make the overall process easier to understand.

# Trim the obnoxious header
tail -n +2 file1.txt |
while read -r directory LSmall Roman LCaps; do
mkdir "$directory"
awk -v LSmall="$LSmall" -v Roman="$Roman" -v LCaps="$LCaps" '
BEGIN { split("LSmall:Roman:LCaps", k, /:/)
split(LSmall ":" Roman ":" LCaps, r, /:/) }
{
for (j=1; j<=3; ++j)
if ($0 ~ k[j]) {
gsub(/here/, r[j])
break
}
}1' file2.txt >"$directory"/"$directory".txt
done

Demo: https://ideone.com/RUhsUS

How can I use awk to continously read a file as it grows?

How can I use awk (and awk only) to continously read a file as it grows?

Awk has no interface to do that - once it reaches EOF the file is done for. You can write your own extension where you POSIX open() the file and block on read() from the file to a buffer up until FS is received and expose that as an interface to awk (basically same as tail source code). See: https://www.gnu.org/software/gawk/manual/html_node/Dynamic-Extensions.html .

How can I use awk to continously read a file as it grows?

You would use it with awk.

tail -f file | awk 'script'

Using awk to parse sections of a text file

Any time you write a loop in shell just to manipulate text you have the wrong approach.

In this case, it LOOKS like all you really need for the whole thing is:

awk 'NF==1{out=$1".txt"} {print > out}' states.txt

If that's not it, please clarify. Oh, and with non-gawk you might need to add close(out) right before out=....

Why is AWK printing the whole line in the default read record from file action when i specify the fields to be printed

The answer is remove the $1 from the end of the script - it is a left over from running AWK from within a ksh script

Is it possible to use awk to print all line in a file and then do a command on a single column?

$ cat input.txt         
7051,95230163,-1,53200703
7051,95230163,-1,53200703
7051,95230163,-1,53200703
53200703,2286,Mon Jul 01 13:30:03 PDT 2013
53200703,2286,Mon Jul 01 13:30:03 PDT 2013
53200703,2286,Mon Jul 01 13:30:03 PDT 2013
$
$ cat trial.sh
gawk -F',' '
function hash(val, var) {
if (val == "") {
var = "None"
}
else {
cmd = "echo \"" val "\" | openssl dgst -sha1"
cmd | getline var
close(cmd)
sub(/.* /,"",var)
}
return var
}
{ printf "%s, Hash Value: %s\n", $0, hash($2) }
'
$
$ ./trial.sh < input.txt
7051,95230163,-1,53200703, Hash Value: c9b674deec9973f4d0feb83433d6db0b4ea5012a
7051,95230163,-1,53200703, Hash Value: c9b674deec9973f4d0feb83433d6db0b4ea5012a
7051,95230163,-1,53200703, Hash Value: c9b674deec9973f4d0feb83433d6db0b4ea5012a
53200703,2286,Mon Jul 01 13:30:03 PDT 2013, Hash Value: 2a8db89cc6f4ccdc0ce423011e869cb8b29b1003
53200703,2286,Mon Jul 01 13:30:03 PDT 2013, Hash Value: 2a8db89cc6f4ccdc0ce423011e869cb8b29b1003
53200703,2286,Mon Jul 01 13:30:03 PDT 2013, Hash Value: 2a8db89cc6f4ccdc0ce423011e869cb8b29b1003

Note that above is GNU-awk specific as it uses coprocesses to pipe the output of the shell command into being read by getline.

Also, now that I see your sample input contains many duplicates, this would probably be more efficient by avoiding the external command and pipes for duplicate key fields by just storing the hash value the first time it's calculated and using it thereafter:

$ cat trial.sh               
gawk -F',' '
function hash(val) {
if ( !(val in map) ) {
if (val == "") {
map[val] = "None"
}
else {
cmd = "echo \"" val "\" | openssl dgst -sha1"
cmd | getline map[val]
close(cmd)
sub(/.* /,"",map[val])
}
}
return map[val]
}
{ printf "%s, Hash Value: %s\n", $0, hash($2) }
'

And yes, of course you can use awk to print whatever you want from all files in a directory:

awk '{ print <whatever> }' /dir/*

Bash script using awk doesn't read the entire line, just the first column

Failing to wrap $line in double quotes causes the \t characters to be replaced with spaces, which in turn screws up the awk -F'\t'.

Consider:

$ line=$(head -1 system.log)

# double quoting ${line} maintains the \t characters:

$ echo "${line}" | od -c
0000000 2 \t c a m i l a \t c r e a t e
0000020 d b \n
0000023

# no (double) quoting of ${line} replaces the \t with spaces:

$ echo ${line} | od -c
0000000 2 c a m i l a c r e a t e
0000020 d b \n
0000023

The issue is further compounded by how printf handles the unquoted ${line}, eg:

$ printf ${line}
2

$ printf "${line}"
2 camila create db

As for the whole while loop, and assuming the sole purpose of the while loop is to send the modified file contents to stdout (ie, you're not using ${line} for other bash-level operations), you could replace the whole thing with a single awk call, eg:

$ awk -F '\t' '{ print "Entry No. ", $1, ": ", $2, " (action: ", $3, ")" }' system.log
Entry No. 2 : camila (action: create db )
Entry No. 3 : andrew (action: create table )
Entry No. 5 : greg (action: update table )
Entry No. 6 : nataly (action: update view )
Entry No. 7 : greg (action: delete table )
Entry No. 9 : camila (action: update table )
Entry No. 11 : nataly (action: create view )
Entry No. 12 : peter (action: link table )
Entry No. 14 : andrew (action: update view )
Entry No. 15 : greg (action: update db )

NOTE: the extra spaces in the output are due to how the print command is being built; separating each argument with a , adds the default awk/OFS delimiter (a space) between each argument; removing the comma (awk/OFS delimiter) generates:

$ awk -F '\t' '{ print "Entry No. " $1 ": " $2 " (action: " $3 ")" }' system.log
Entry No. 2: camila (action: create db)
Entry No. 3: andrew (action: create table)
Entry No. 5: greg (action: update table)
Entry No. 6: nataly (action: update view)
Entry No. 7: greg (action: delete table)
Entry No. 9: camila (action: update table)
Entry No. 11: nataly (action: create view)
Entry No. 12: peter (action: link table)
Entry No. 14: andrew (action: update view)
Entry No. 15: greg (action: update db)

Using awk to print all columns from the nth to the last

Print all columns:

awk '{print $0}' somefile

Print all but the first column:

awk '{$1=""; print $0}' somefile

Print all but the first two columns:

awk '{$1=$2=""; print $0}' somefile

How to use the value in a file as input for a calculation in awk - in bash?

Would you please try the following:

awk -v OFS="\t" '
NR==FNR { # this block is executed in the 1st pass only
if (FNR > 1) sum[$1] += $3
# accumulate the "count" for each "SampleID"
next
}
# the following block is executed in the 2nd pass only
FNR > 1 { # skip the header line
if ($1 != prev_id) {
# SampleID has changed. then update the output filename and print the header line
if (outfile) close(outfile)
# close previous outfile
outfile = $1 "_summary"
print "ASV_ID", "ASV_in_sample", "total_ASVs_inSample", "treshold_for_30%", "ASV_over30%" >> outfile
prev_id = $1
}
mark = ($3 > sum[$1] * 0.3) ? 1 : 0
# set the mark to "1" if the "Count" exceeds 30% of sum
print $2, $3, sum[$1], sum[$1] * 0.3, mark >> outfile
# append the line to the summary file
}
' data.csv data.csv

data.csv:

SampleID    ASV    Count
1000A ASV_1216 14
1000A ASV_12580 150
1000A ASV_12691 260
1000A ASV_135 434
1000A ASV_147 79
1000A ASV_15 287
1000A ASV_16 361
1000A ASV_184 8
1000A ASV_19 42
1000B ASV_1 90
1000B ASV_2 90
1000B ASV_3 20
1000C ASV_4 100
1000C ASV_5 10
1000C ASV_6 10

In the following output examples, the last field ASV_over30% indicates 1 if the count exceeds 30% of the sum value.

1000A_summary:

ASV_ID  ASV_in_sample   total_ASVs_inSample     treshold_for_30%        ASV_over30%
ASV_1216 14 1635 490.5 0
ASV_12580 150 1635 490.5 0
ASV_12691 260 1635 490.5 0
ASV_135 434 1635 490.5 0
ASV_147 79 1635 490.5 0
ASV_15 287 1635 490.5 0
ASV_16 361 1635 490.5 0
ASV_184 8 1635 490.5 0
ASV_19 42 1635 490.5 0

1000B_summary:

ASV_ID  ASV_in_sample   total_ASVs_inSample     treshold_for_30%        ASV_over30%
ASV_1 90 200 60 1
ASV_2 90 200 60 1
ASV_3 20 200 60 0

1000C_summary:

ASV_ID  ASV_in_sample   total_ASVs_inSample     treshold_for_30%        ASV_over30%
ASV_4 100 120 36 1
ASV_5 10 120 36 0
ASV_6 10 120 36 0

[Explanations]

When calculating the average of the input data, we need to go through until
the end of the data. If we want to print out the input record and the average
value (or other information based on the average) at the same time, we need to
use a trick:

  • To store the whole input records in memory.
  • To read the input data twice.

As awk is suitable for reading multiple files changing the proceduce
depending the order of files, I have picked the 2nd method.

  • The condition NR==FNR returns TRUE while reading the 1st file only.
    We calculate the sum of count field within this block as a 1st pass.
  • The next statement at the end of the block skips the following codes.
  • If the 1st file is done, the script reads the 2nd file which is
    same as the 1st file, of course.
  • While reading the 2nd file, the condition NR==FNR no longer returns
    TRUE and the 1st block is skipped.
  • The 2nd block reads the input file again, opening a file to print the
    output, reading the input data line by line, and adding information
    such as average value obtained in the 1st pass.


Related Topics



Leave a reply



Submit