How to Count Most Occuring Sequence of 3 Letters Within a Word with a Bash Script

How can I count most occuring sequence of 3 letters within a word with a bash script

Here's how to get started with what I THINK you're trying to do:

$ cat tst.awk
BEGIN { stringLgth = 3 }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
field = $fldNr
fieldLgth = length(field)
if ( fieldLgth >= stringLgth ) {
maxBegPos = fieldLgth - (stringLgth - 1)
for (begPos=1; begPos<=maxBegPos; begPos++) {
string = tolower(substr(field,begPos,stringLgth))
cnt[string]++
}
}
}
}
END {
for (string in cnt) {
print string, cnt[string]
}
}

.

$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1

Count occurrences of a char in a string using Bash

I would use the following awk command:

string="text,text,text,text"
char=","
awk -F"${char}" '{print NF-1}' <<< "${string}"

I'm splitting the string by $char and print the number of resulting fields minus 1.

If your shell does not support the <<< operator, use echo:

echo "${string}" | awk -F"${char}" '{print NF-1}'

How to create a frequency list of every word in a file?

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn

Find the most popular 5 words with bash command

$ cat lorem.txt | tr \  '\n' | tr -c -d '[:alpha:]\n' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -5
10 vitae
10 in
9 quis
9 nunc
9 eget

Walk-thru:

  • tr \ '\n' separate words on separate records
  • tr -c -d '[:alpha:]\n' remove non-letters
  • tr '[:upper:]' '[:lower:]' convert to lower case
  • sort | uniq -c |sort -nr sort, count and print in frequency order
    |head -5 exit in five

Deleting lines with more than 30% lowercase letters

Or:

awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
  1. RS='>[a-z]+\n' - Sets the record separator to the line containing '>' and name

  2. RT - This value is set by what is matched by RS above

  3. a=RT - save previous RT value

  4. n=length(gensub(/[A-Z]/,"","g")); - get the length of lower case chars

  5. if(NF && n/length*100 < 30)print a $0; - check we have a value and that the percentage is less than 30 for lower case chars

How to get from a file only the character with reputed value

There are various ways to split input so that grep sees a single word per line. tr is most common. For example:

tr -s '[:space:]' '\n' file | ...

We can build a function to find a specific number of a particular letter:

NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}

Then:

# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a

# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b

# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c


Related Topics



Leave a reply



Submit