Bash script to find the frequency of every letter in a file
Just one awk command
awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file
if you want case insensitive, add tolower()
awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' file
and if you want only characters,
awk -vFS="" '{for(i=1;i<=NF;i++){ if($i~/[a-zA-Z]/) { w[tolower($i)]++} } }END{for(i in w) print i,w[i]}' file
and if you want only digits, change /[a-zA-Z]/
to /[0-9]/
if you do not want to show unicode, do export LC_ALL=C
How to create a frequency list of every word in a file?
Not sed
and grep
, but tr
, sort
, uniq
, and awk
:
% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF
a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:
sed -e 's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn
How can I count the frequency of letters
Counting characters in strings can easily be done with awk. To do this, you make use of the function gsub
:
gsub(ere, repl[, in])
Behave likesub
(see below), except that it shall replace all occurrences of the regular expression (like the ed utility global substitute) in$0
or in thein
argument when specified.
sub(ere, repl[, in ])
Substitute the stringrepl
in place of the first instance of the extended regular expressionERE
in string in and return the number of substitutions. <snip> Ifin
is omitted, awk shall use the current record ($0
) in its place.source: Awk Posix Standard
The following two functions perform the counting in this way:
function countCharacters(str) {
while(str != "") { c=substr(str,1,1); a[toupper[c]]+=gsub(c,"",str) }
}
or if there might appear a lot of equal consecutive characters, the following solution might shave off a couple of seconds.
function countCharacters2(str) {
n=length(str)
while(str != "") { c=substr(str,1,1); gsub(c"+","",str);
m=length(str); a[toupper[c]]+=n-m; n=m
}
}
Below you find 4 implementations based on the first function. The first two run on a standard awk, the latter two on an optimized version for fasta-files:
1. Read sequence and processes it line by line:
awk '!/^>/{s=$0; while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) } }
END {for(c in a) print c,a[c]}' file
2. concatenate all sequences and process it in the end:
awk '!/^>/{s=s $0 }
END {while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file
3. Same as 1 but use bioawk
:
bioawk -c fastx '{while ($seq!=""){ c=substr($seq,1,1);a[c]+=gsub(c,"",$seq) } }
END{ for(c in a) print c,a[c] }' file
4. Same as 2 but use bioawk
:
bioawk -c fastx '{s=s $seq}
END { while(s!="") { c=substr(s,1,1); a[c]+=gsub(c,"",s) }
for(c in a) print c,a[c]}' file
Here are some timing results based on this fasta-file
OP : grep,sort,uniq : 47.548 s
EdMorton 1 : awk : 39.992 s
EdMorton 2 : awk,sort,uniq : 53.965 s
kvantour 1 : awk : 18.661 s
kvantour 2 : awk : 9.309 s
kvantour 3 : bioawk : 1.838 s
kvantour 4 : bioawk : 1.838 s
karafka : awk : 38.139 s
stack0114106 1: perl : 22.754 s
stack0114106 2: perl : 13.648 s
stack0114106 3: perl (zdim) : 7.759 s
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
Bash script frequency analysis of unique letters and repeating letter pairs how should i build this script?
Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.
If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.
For example the easy but costly gawk way to count frequences:
awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'
As for transliterating, there is tr
utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).
frequency count for file column in bash
You can just awk
to do this:
awk -F '|' '{freq[$8]++} END{for (i in freq) print freq[i], i}' file
This awk command uses |
as delimiter and uses an array seen
with key as $8
. When it finds a key $8
increments the frequency (value) by 1
.
Btw you need to add custom delimiter |
in your command and use it like this:
awk -F '|' '{print $8}' file | sort | uniq -c
Shell script to show frequency of each word in file and in a directory
The major limitation is that the script assumes there is exactly one word per line. c[$1]++
just increments the occurrence of the first field of each line.
The question didn't mention anything about the number of words in a line, so I'd assume this wasn't the intention - you need to go through each word in a line. Also, what about empty lines? With an empty line, $1
will be the empty string, so your script will end up counting "empty" words (which it will happily show as part of the output).
In awk, the number of fields in a line is stored in the built-in variable NF
; thus it is easy to write code to loop through the words and increment the corresponding count (and it has the nice side effect of implicitly ignoring lines without words).
So, I would do something like this instead:
find . -type f -exec cat '{}' \; | awk '{ for (i = 1; i <= NF; i++) w[$i]++ } END { for (i in w) printf("%-10s %10d\n", i, w[i]) }'
I removed the directory names constraints in the argument to find(1)
for the sake of conciseness, and to make it more general.
This is (probably) the main issue with your solution, but the question is (intentionally) vague and there are many things left to discuss:
- Is it case-sensitive? This solution treats World and world as different words. Is this desired?
- What about punctuation? Should hello and hello! be treated as the same word? What about commas? That is, do we need to parse and ignore punctuation?
- Speaking of which - what about things like what's vs. what? Do we consider them different words? And it's vs. its? English is tricky!
- Most important of all (and related to the points above), what exactly defines a word? We assumed a word is a sequence of non-blanks (the default in awk). Is this accurate?
- If there are no words in the input, what do we do? This solution prints nothing - maybe we should print a warning message?
- Is there a fixed number of words in a line? Or is it arbitrary? (E.g. if there's exactly one word per line, your solution would be enough)
FWIW, always remember that your success in an interview is not a binary yes/no. It's not like: Oops, you can't do X, so I'm going to reject you. Or: Oops, wrong answer, you're out. More important than the answer is the process that gets you there, and whether or not you are aware of (a) the assumptions you made; and (b) your solution's limitations. The questions above show ability to consider edge cases, ability to clarify assumptions and requirements, etc, which is way more important than getting the "right" script (and probably there's no such thing as The Right Script).
Write the frequency of each number in column into next column in bash
Using Awk is an overkill IMO here, the built-in tools will do the work just fine:
sort -n file | uniq -c | sort
Output:
1 0.32832977
2 0.31447645
4 0.27645657
How can I use the UNIX shell to count the number of times a letter appears in a text file?
grep char -o filename | wc -l
How to count frequency of a word without counting compound words in bash?
With a GNU grep, you can use the following command to count and
words that are not enclosed with hyphens:
grep -ioP '\b(?<!-)and\b(?!-)' "$1" | wc -l
Details:
P
option enables the PCRE regex syntax\b(?<!-)and\b(?!-)
matches\b
- a word boundary(?<!-)
- a negative lookbehind that fails the match if there is a hyphen immediately to the left of the current locationand
- a fixed string\b
- a word boundary(?!-)
- a negative lookahead that fails the match if there is a hyphen immediately to the right of the current location.
See the online demo:
#!/bin/bash
s='jerry-and-jeorge, and, aNd, And.'
grep -ioP '\b(?<!-)and\b(?!-)' <<< "$s" | wc -l
# => 3 (not 4)
generating frequency table from file
You mean you want a count of how many times an item appears in the input file? First sort it (using -n
if the input is always numbers as in your example) then count the unique results.
sort -n input.txt | uniq -c
Related Topics
How to Filter Data Between 2 Dates with Awk in a Bash Script
What Is the Explanation of This X86 Hello World Using 32-Bit Int 0X80 Linux System Calls from _Start
How Do 2 or More Fork System Calls Work
"When" Condition on Ansible Playbook Doesn't Work as Expected Using Operators
What Would Be the Equivalent of Win32 API in Linux
How to Install Node Binary Distribution Files on Linux
Linux Pipe Audio File to Microphone Input
Comparing Two Files in Linux Terminal
Selecting a Linux I/O Scheduler
When Setting Ifs to Split on Newlines, Why Is It Necessary to Include a Backspace
Remove Odd or Even Lines from a Text File
Use Find Command But Exclude Files in Two Directories
Failing to Connect to Remote Mongodb Server
Git Gc: No Space Left on Device, Even Though 3Gb Available and Tmp_Pack Only 16Mb
What Does 'Set -O Errtrace' Do in a Shell Script
How to Restart a Service If Its Dependent Service Is Restarted