How can I count most occuring sequence of 3 letters within a word with a bash script
Here's how to get started with what I THINK you're trying to do:
$ cat tst.awk
BEGIN { stringLgth = 3 }
{
for (fldNr=1; fldNr<=NF; fldNr++) {
field = $fldNr
fieldLgth = length(field)
if ( fieldLgth >= stringLgth ) {
maxBegPos = fieldLgth - (stringLgth - 1)
for (begPos=1; begPos<=maxBegPos; begPos++) {
string = tolower(substr(field,begPos,stringLgth))
cnt[string]++
}
}
}
}
END {
for (string in cnt) {
print string, cnt[string]
}
}
.
$ awk -f tst.awk file | sort -k2,2nr
acc 5
cou 5
cco 4
ing 4
nti 4
oun 4
tin 4
unt 4
aco 3
abc 1
ant 1
any 1
bca 1
cac 1
cal 1
com 1
con 1
fir 1
ica 1
irm 1
lta 1
mpa 1
nsu 1
omp 1
ons 1
ous 1
pan 1
sti 1
sul 1
tan 1
tic 1
ult 1
ust 1
xyz 1
yza 1
zac 1
Count occurrences of a char in a string using Bash
I would use the following awk
command:
string="text,text,text,text"
char=","
awk -F"${char}" '{print NF-1}' <<< "${string}"
I'm splitting the string by $char
and print the number of resulting fields minus 1.
If your shell does not support the <<<
operator, use echo
:
echo "${string}" | awk -F"${char}" '{print NF-1}'
How to create a frequency list of every word in a file?
Not sed
and grep
, but tr
, sort
, uniq
, and awk
:
% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF
a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:
sed -e 's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn
Find the most popular 5 words with bash command
$ cat lorem.txt | tr \ '\n' | tr -c -d '[:alpha:]\n' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -5
10 vitae
10 in
9 quis
9 nunc
9 eget
Walk-thru:
tr \ '\n'
separate words on separate recordstr -c -d '[:alpha:]\n'
remove non-letterstr '[:upper:]' '[:lower:]'
convert to lower casesort | uniq -c |sort -nr
sort, count and print in frequency order
|head -5 exit in five
Deleting lines with more than 30% lowercase letters
Or:
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
RS='>[a-z]+\n'
- Sets the record separator to the line containing '>' and nameRT
- This value is set by what is matched by RS abovea=RT
- save previous RT valuen=length(gensub(/[A-Z]/,"","g"));
- get the length of lower case charsif(NF && n/length*100 < 30)print a $0;
- check we have a value and that the percentage is less than 30 for lower case chars
How to get from a file only the character with reputed value
There are various ways to split input so that grep sees a single word per line. tr
is most common. For example:
tr -s '[:space:]' '\n' file | ...
We can build a function to find a specific number of a particular letter:
NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}
Then:
# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a
# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b
# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c
Related Topics
Libpcap - Capture Packets from All Interfaces
How to Add Boost Library to Code::Blocks in Linux
How to Add an User and Re Set the Root User in Yocto
Shell Script for Process Monitoring
Sudoers Nopasswd: Sudo: No Tty Present and No Askpass Program Specified
How to Print a Number in Arm Assembly
Chef Chef-Validator.Pem Security
Linux How to Add a File to a Specific Folder Within a Zip File
Passing Environment Variables Not Working with Docker
How to Create a Zip File Without Entire Directory Structure
Dlopen Failed: Cannot Open Shared Object File: No Such File or Directory
Converting a Pcap Trace to Netflow Format
How to Break an Arbitrary Tcp/Ip Connection on Linux
Installing Gcc on Linux Without C Compiler
Icudt Error While Installing Stringi Package from R in Linux Offline
Bash Tail the Newest File in Folder Without Variable
Possibly Undefined MACro: Ac_Prog_Libtool
Bash Scripting - Iterating Through "Variable" Variable Names for a List of Associative Arrays