Calculate Word Occurrences from File in Bash

Count occurrences of a list of words in a text file

You can use fgrep to do this efficiently:

fgrep -of f1.txt f2.txt | sort | uniq -c | awk '{print $2 " " $1}'

Gives this output:

apple 3
cat 1
dog 2
  • fgrep -of f1.txt f2.txt extracts all the matching parts (-o option) of f2.txt based on the patterns in f1.txt
  • sort | uniq -c counts the matching patterns
  • finally, awk swaps the order of words in uniq -c output

Calculate Word occurrences from file in bash

Well, I'm not sure that I've got the point of the thing you are trying to do,
but I would do it this way:

while read file
do
cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
done < file-list

Now you have statistics for all your file, and now you simple aggregate it:

while read file
do
cat stat.$file
done < file-list \
| sort -k2 \
| awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}'

Example of usage:

$ for i in ls bash cp; do man $i > $i.txt ; done
$ cat <<EOF > file-list
> ls.txt
> bash.txt
> cp.txt
> EOF

$ while read file; do
> cat $file | tr -cs A-Za-z\' '\n'| tr A-Z a-z | sort | uniq -c > stat.$file
> done < file-list

$ while read file
> do
> cat stat.$file
> done < file-list \
> | sort -k2 \
> | awk '{if ($2!=prev) {print s" "prev; s=0;}s+=$1;prev=$2;}END{print s" "prev;}' | sort -rn | head

3875 the
1671 is
1137 to
1118 a
1072 of
793 if
744 and
533 command
514 in
507 shell

Shell Script to Count the Occurrence of a Word in a file

Using tr for separating words and then grep and wc seems possible :

tr -s ' ' '\n' < file.txt | grep file | wc -l

how to count occurrence of specific word in group of file by bash/shellscript

This alternative requires no pipelines:

$ awk -v RS='[[:space:]]+' '/^h/{i++} END{print i+0}' simple.txt simple1.txt
7

How it works

  • -v RS='[[:space:]]+'

    This tells awk to treat each word as a record.

  • /^h/{i++}

    For any record (word) that starts with h, we increment variable i by 1.

  • END{print i+0}

    After we have finished reading all the files, we print out the value of i.

How to create a frequency list of every word in a file?

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn

How can I count the occurrences of a string within a file?

This will output the number of lines that contain your search string.

grep -c "echo" FILE

This won't, however, count the number of occurrences in the file (ie, if you have echo multiple times on one line).

edit:

After playing around a bit, you could get the number of occurrences using this dirty little bit of code:

sed 's/echo/echo\n/g' FILE | grep -c "echo"

This basically adds a newline following every instance of echo so they're each on their own line, allowing grep to count those lines. You can refine the regex if you only want the word "echo", as opposed to "echoing", for example.

Print every word and its number of occurrences, using pure `bash`

You could use an associative array for counting the words, a bit like this:

$ cat foo.sh
#!/bin/bash

declare -A words

while read line
do
for word in $line
do
((words[$word]++))
done
done

for i in "${!words[@]}"
do
echo "$i:" "${words[$i]}"
done

Testing it:

$ echo this is a test is this | bash foo.sh
is: 2
this: 2
a: 1
test: 1

This answer was constructed pretty much from these fine answers: this and this. Don't forget to upvote them.

Count occurrence of list of words in multiple files

Take the output from you script and pipe it to

awk '{ arry[$1]+=$2 } END { for (i in arry) { print i" "arry[i] } }' 


Related Topics



Leave a reply



Submit