Bash Script to Find The Frequency of Every Letter in a File

Bash script to find the frequency of every letter in a file

Just one awk command

awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file

if you want case insensitive, add tolower()

awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' file

and if you want only characters,

awk -vFS="" '{for(i=1;i<=NF;i++){ if($i~/[a-zA-Z]/) { w[tolower($i)]++} } }END{for(i in w) print i,w[i]}' file

and if you want only digits, change /[a-zA-Z]/ to /[0-9]/

if you do not want to show unicode, do export LC_ALL=C

How to create a frequency list of every word in a file?

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn

How can I count the frequency of letters

Counting characters in strings can easily be done with awk. To do this, you make use of the function gsub:

gsub(ere, repl[, in])
Behave like sub (see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in $0 or in the in argument when specified.

sub(ere, repl[, in ])
Substitute the string repl in place of the first instance of the extended regular expression ERE in string in and return the number of substitutions. <snip> If in is omitted, awk shall use the current record ($0) in its place.

source: Awk Posix Standard

The following two functions perform the counting in this way:

function countCharacters(str) {
while(str != "") { c=substr(str,1,1); a[toupper[c]]+=gsub(c,"",str) }
}

or if there might appear a lot of equal consecutive characters, the following solution might shave off a couple of seconds.

function countCharacters2(str) {
n=length(str)
while(str != "") { c=substr(str,1,1); gsub(c"+","",str);
m=length(str); a[toupper[c]]+=n-m; n=m
}
}

Below you find 4 implementations based on the first function. The first two run on a standard awk, the latter two on an optimized version for fasta-files:

1. Read sequence and processes it line by line:

awk '!/^>/{s=$0; while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) } }
END {for(c in a) print c,a[c]}' file

2. concatenate all sequences and process it in the end:

awk '!/^>/{s=s $0 }
END {while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file

3. Same as 1 but use bioawk:

bioawk -c fastx '{while ($seq!=""){ c=substr($seq,1,1);a[c]+=gsub(c,"",$seq) } }
END{ for(c in a) print c,a[c] }' file

4. Same as 2 but use bioawk:

bioawk -c fastx '{s=s $seq}
END { while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file

Here are some timing results based on this fasta-file

OP            : grep,sort,uniq : 47.548 s
EdMorton 1 : awk : 39.992 s
EdMorton 2 : awk,sort,uniq : 53.965 s
kvantour 1 : awk : 18.661 s
kvantour 2 : awk : 9.309 s
kvantour 3 : bioawk : 1.838 s
kvantour 4 : bioawk : 1.838 s
karafka : awk : 38.139 s
stack0114106 1: perl : 22.754 s
stack0114106 2: perl : 13.648 s
stack0114106 3: perl (zdim) : 7.759 s

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.

Bash script frequency analysis of unique letters and repeating letter pairs how should i build this script?

Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.

If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.

For example the easy but costly gawk way to count frequences:

awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'

As for transliterating, there is tr utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).

frequency count for file column in bash

You can just awk to do this:

awk -F '|' '{freq[$8]++} END{for (i in freq) print freq[i], i}' file

This awk command uses | as delimiter and uses an array seen with key as $8. When it finds a key $8 increments the frequency (value) by 1.
Btw you need to add custom delimiter | in your command and use it like this:

awk -F '|' '{print $8}' file | sort | uniq -c

Shell script to show frequency of each word in file and in a directory

The major limitation is that the script assumes there is exactly one word per line. c[$1]++ just increments the occurrence of the first field of each line.

The question didn't mention anything about the number of words in a line, so I'd assume this wasn't the intention - you need to go through each word in a line. Also, what about empty lines? With an empty line, $1 will be the empty string, so your script will end up counting "empty" words (which it will happily show as part of the output).

In awk, the number of fields in a line is stored in the built-in variable NF; thus it is easy to write code to loop through the words and increment the corresponding count (and it has the nice side effect of implicitly ignoring lines without words).

So, I would do something like this instead:

find . -type f -exec cat '{}' \; | awk '{ for (i = 1; i <= NF; i++) w[$i]++ } END { for (i in w) printf("%-10s %10d\n", i, w[i]) }'

I removed the directory names constraints in the argument to find(1) for the sake of conciseness, and to make it more general.

This is (probably) the main issue with your solution, but the question is (intentionally) vague and there are many things left to discuss:

  • Is it case-sensitive? This solution treats World and world as different words. Is this desired?
  • What about punctuation? Should hello and hello! be treated as the same word? What about commas? That is, do we need to parse and ignore punctuation?
  • Speaking of which - what about things like what's vs. what? Do we consider them different words? And it's vs. its? English is tricky!
  • Most important of all (and related to the points above), what exactly defines a word? We assumed a word is a sequence of non-blanks (the default in awk). Is this accurate?
  • If there are no words in the input, what do we do? This solution prints nothing - maybe we should print a warning message?
  • Is there a fixed number of words in a line? Or is it arbitrary? (E.g. if there's exactly one word per line, your solution would be enough)

FWIW, always remember that your success in an interview is not a binary yes/no. It's not like: Oops, you can't do X, so I'm going to reject you. Or: Oops, wrong answer, you're out. More important than the answer is the process that gets you there, and whether or not you are aware of (a) the assumptions you made; and (b) your solution's limitations. The questions above show ability to consider edge cases, ability to clarify assumptions and requirements, etc, which is way more important than getting the "right" script (and probably there's no such thing as The Right Script).

Write the frequency of each number in column into next column in bash

Using Awk is an overkill IMO here, the built-in tools will do the work just fine:

sort -n file | uniq -c | sort

Output:

1 0.32832977

2 0.31447645

4 0.27645657

How can I use the UNIX shell to count the number of times a letter appears in a text file?

grep char -o filename | wc -l

How to count frequency of a word without counting compound words in bash?

With a GNU grep, you can use the following command to count and words that are not enclosed with hyphens:

grep -ioP '\b(?<!-)and\b(?!-)' "$1" | wc -l

Details:

  • P option enables the PCRE regex syntax
  • \b(?<!-)and\b(?!-) matches
    • \b - a word boundary
    • (?<!-) - a negative lookbehind that fails the match if there is a hyphen immediately to the left of the current location
    • and - a fixed string
    • \b - a word boundary
    • (?!-) - a negative lookahead that fails the match if there is a hyphen immediately to the right of the current location.

See the online demo:

#!/bin/bash
s='jerry-and-jeorge, and, aNd, And.'
grep -ioP '\b(?<!-)and\b(?!-)' <<< "$s" | wc -l
# => 3 (not 4)

generating frequency table from file

You mean you want a count of how many times an item appears in the input file? First sort it (using -n if the input is always numbers as in your example) then count the unique results.

sort -n input.txt | uniq -c


Related Topics



Leave a reply



Submit