Awk: Words Frequency from One Text File, How to Ouput into Myfile.Txt

Awk: Words frequency from one text file, how to ouput into myFile.txt?

Your pipeline isn't very efficient you should do the whole thing in awk instead:

awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file > myfile

If you want the output in sorted order:

awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort > myfile

The actual output given by your pipeline is:

$ tr ' ' '\n' < file | sort | uniq -c | awk '{print $2"@"$1}'
Bastard@1
But@2
Esope@1
holly@1
is@2
the@1
where@2

Note: using cat is useless here we can just redirect the input with <. The awk script doesn't make sense either, it's just reversing the order of the words and words frequency and separating them with an @. If we drop the awk script the output is closer to the desired output (notice the preceding spacing however and it's unsorted):

$ tr ' ' '\n' < file | sort | uniq -c 
1 Bastard
2 But
1 Esope
1 holly
2 is
1 the
2 where

We could sort again a remove the leading spaces with sed:

$ tr ' ' '\n' < file | sort | uniq -c | sort | sed 's/^\s*//'
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where

But like I mention at the start let awk handle it:

$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where

Awk: Characters-frequency from one text file?

One method:

$ grep -o '\S' file | awk '{a[$1]++}END{for(k in a)print a[k],k}' 
3 옥
4 h
2 u
2 i
3 B
5 !
2 w
4 爸
1 군
4 지
1 y
2 l
1 E
1 會
2 你
1 是
2 a
1 不
2 이
2 o
1 p
2 的
1 d
1 생
3 r
6 e
4 s
1 我
4 t

Use redirection to save the output to a file:

$ grep -o '\S' file | awk '{a[$1]++}END{for(k in a)print a[k],k}' > output

And for sorted output:

$ grep -o '\S' file | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort > output

How to create a frequency list of every word in a file?

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn

Add frequency (number of occurrences) to my table of text through awk

Just read the file twice: first to count the values and store them in an array, then to print its values:

$ awk 'FNR==NR {col1[$1]++; col2[$2]++; next} {print $0, col2[$2] "/" col1[$1]}' file file
pac1 xxx 2/3
pac1 yyy 1/3
pac1 zzz 3/3
pac2 xxx 2/2
pac2 uuu 2/2
pac3 zzz 3/2
pac3 uuu 2/2
pac4 zzz 3/1

The FNR==NR {things; next} is a trick to do things just when reading the first file. It is based on using FNR and NR: the former means Field Number of Record and the latter Number of Record. This means that FNR contains the number of line of the current file, while NR contains the number of lines that have been read so far overall, making FNR==NR true just when reading the first file. By adding next we skip the current line and jump to the next one.

Find more info in Idiomatic awk.


Regarding your update: if you want the last item to contain the count of different values in the first column, just check the length of the array that was created. This will tell you many different indexes it contains, and hence the value you want:

$ awk 'FNR==NR {col1[$1]++; col2[$2]++; next} {print $0, col2[$2] "/" length(col1)}' file file
pac1 xxx 2/4
pac1 yyy 1/4
pac1 zzz 3/4
pac2 xxx 2/4
pac2 uuu 2/4
pac3 zzz 3/4
pac3 uuu 2/4
pac4 zzz 3/4

Awk: How to work on multiple files.txt in folder and subfolders?

You can use the find command for that. Like this:

find -iname '*.txt' -exec cat {} \; | grep -o '\w*' | awk '{a[$1]++}END{for(k in a)print a[k],k}' | sort 

I'm using the the option -exec to cat every *.txt file in the current directory and it's subdirs. The output will get piped to your grep|awk|sort pipe.

How to analyze frequency of characters in a text file

I've solved my problem with the code below:

sed abc.txt <abc.txt | cut -c 1-5 | sort | uniq -cd | sort -nbr > pre5.txt

Awk: What wrong with CJK characters? #Korean

A single awk script can handle this easily and will be far more efficient than your current pipeline:

$ awk '{a[$1]++}END{for(k in a)print k,a[k]}' RS=" |\n" file 
옥 3
Bastard 1
! 5
爸 4
군 1
지 4
But 2
會 1
你 2
the 1
是 1
不 1
이 2
Esope 1
的 2
holly 1
where 2
생 1
我 1
is 2

If you want to store the results into another file you can use redirection like:

$ awk '{a[$1]++}END{for(k in a)print k,a[k]}' RS=" |\n" file > outfile

How to print words that contain only letters?

To get the words contain alphabet only:

$ tr -cs '[:alnum:]' '[\n*]' <file | grep -E '^[[:alpha:]]+$'
ble
ach
cop
alo

To get your desired output:

$ tr -cs '[:alnum:]' '[\n*]' <file |
grep -E '^[[:alpha:]]+$' |
sort |
paste -sd ' ' -

How can I extract a predetermined range of lines from a text file on Unix?

sed -n '16224,16482p;16483q' filename > newfile

From the sed manual:

p -
Print out the pattern space (to the standard output). This command is usually only used in conjunction with the -n command-line option.

n -
If auto-print is not disabled, print the pattern space, then, regardless, replace the pattern space with the next line of input. If
there is no more input then sed exits without processing any more
commands.

q -
Exit sed without processing any more commands or input.
Note that the current pattern space is printed if auto-print is not disabled with the -n option.

and

Addresses in a sed script can be in any of the following forms:

number
Specifying a line number will match only that line in the input.

An address range can be specified by specifying two addresses
separated by a comma (,). An address range matches lines starting from
where the first address matches, and continues until the second
address matches (inclusively).

Count the number of errors in a file

Another, similar approach, that simply uses brute-force for the output format opposed to setting the output-record-separator could be similar to:

awk -F'|' '
NR > 1 { gsub(/ *$/,"",$2 ); a[$2]++ }
END { for (i in a) {
n = n + a[i]
printf "%-4s: %d\n", i, a[i]
}
printf "Total number of errors : %d\n", n}
' errors

Where for all records greater than 1 (not the headings record), remove all trailing spaces in field and add to array a[] and increment value at that element.

In the END rule, you just loop over all fields as indexes in the array outputting the symbol and number of associated errors. You sum the errors in that same loop in n.

Example Use/Output

With your input in the file errors, you can just select-copy the expression above and middle-mouse-paste in a terminal to check the result, e.g.

$ awk -F'|' '
> NR > 1 { gsub(/ *$/,"",$2 ); a[$2]++ }
> END { for (i in a) {
> n = n + a[i]
> printf "%-4s: %d\n", i, a[i]
> }
> printf "Total number of errors : %d\n", n}
> ' errors
CN : 1
NS : 1
PQ : 2
TNP: 1
TP : 1
Total number of errors : 6

(note: a leading space is left before each of the symbols in the output. If you do not want it there, then substr() as used by @Cyrus will remove them without any fuss. Or you can simply remove all but a space from your gsub() regex.

The formatting is handled in the printf() format strings alone. But pay attention to the special variables noted with the like by @Cyrus. They can provide a shorter and much more elegant solution in complex cases.

Let me know if you have further questions.



Related Topics



Leave a reply



Submit