bash: getting percentage from a frequency table
Try this (with the sort moved to the end:
cut -f $1 $2| sort | uniq -c | awk '{array[$2]=$1; sum+=$1} END { for (i in array) printf "%-20s %-15d %6.2f%%\n", i, array[i], array[i]/sum*100}' | sort -r -k2,2 -n
calculating percentage of frequencies in R
You may want to try using data.table. You also get the advantage of speed if working with large tables.
library(data.table)
#if your data is already stored as a data frame,
#you can always skip the next step and continue with data <- data.table(data)
data <- data.table(name=rep(c("A","B"), each=4), cat1=c(1,1,0,0,1,1,0,0), cat2=c(1,0,1,0,1,0,1,0), freq=c(32,56,36,25,14,68,58,90))
data[, percen := sum(freq), by=list(name,cat1)]
data[, percen := freq/percen]
data
> data
name cat1 cat2 freq percen
1: A 1 1 32 0.3636364
2: A 1 0 56 0.6363636
3: A 0 1 36 0.5901639
4: A 0 0 25 0.4098361
5: B 1 1 14 0.1707317
6: B 1 0 68 0.8292683
7: B 0 1 58 0.3918919
8: B 0 0 90 0.6081081
Hope this helps.
Bash script to find the frequency of every letter in a file
Just one awk command
awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file
if you want case insensitive, add tolower()
awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' file
and if you want only characters,
awk -vFS="" '{for(i=1;i<=NF;i++){ if($i~/[a-zA-Z]/) { w[tolower($i)]++} } }END{for(i in w) print i,w[i]}' file
and if you want only digits, change /[a-zA-Z]/
to /[0-9]/
if you do not want to show unicode, do export LC_ALL=C
Bash script frequency analysis of unique letters and repeating letter pairs how should i build this script?
Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.
If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.
For example the easy but costly gawk way to count frequences:
awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'
As for transliterating, there is tr
utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).
How to create a frequency list of every word in a file?
Not sed
and grep
, but tr
, sort
, uniq
, and awk
:
% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF
a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1
In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:
sed -e 's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn
Two-Way Contingency Table with frequencies and percentages
We can change the position
argument in adorn_ns
from rear
(default) to front
library(tidyverse)
starwars %>%
filter(species == "Human") %>%
tabyl(gender, eye_color) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns(position = "front")
# gender blue blue-gray brown dark hazel yellow
# female 3 (33.33%) 0 (0.00%) 5 (55.56%) 0 (0.00%) 1 (11.11%) 0 (0.00%)
# male 9 (34.62%) 1 (3.85%) 12 (46.15%) 1 (3.85%) 1 (3.85%) 2 (7.69%)
Or another option if the object is already created would be post-processswith mutate_at
to change the formatting of all the columns except the first by capturing the characters in two blocks, reverse the positions by reversing the backreference while adding ()
for the percentage
library(tidyverse)
starwars %>%
filter(species == "Human") %>%
tabyl(gender, eye_color) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns() %>%
mutate_at(-1, list(~ str_replace(., "^([0-9.%]+)\\s+\\((\\d+)\\)", "\\2 (\\1)")))
# gender blue blue-gray brown dark hazel yellow
#1 female 3 (33.33%) 0 (0.00%) 5 (55.56%) 0 (0.00%) 1 (11.11%) 0 (0.00%)
#2 male 9 (34.62%) 1 (3.85%) 12 (46.15%) 1 (3.85%) 1 (3.85%) 2 (7.69%)
How to transform raw counts in a table to percent relative abundance on R or bash?
Here are 3 base R solutions :
#1.
df[-1] <-sweep(df[-1], 2, colSums(df[,-1]), `/`) * 100
#2.
df[-1] <- t(t(df[-1])/colSums(df[,-1])) * 100
#3.
df[-1] <- sapply(df[-1], prop.table) * 100
All of which return :
df
# Taxa Sample1 Sample2
#1 Eukaryota;Alveolata;Apicomplexa 20 10
#2 Eukaryota;Alveolata;Dinophyceae 40 10
#3 Eukaryota;Alveolata;UnclassifiedAlveolata 10 20
#4 Eukaryota;Choanoflagellida;Acanthoecidae 10 20
#5 Eukaryota;Choanoflagellida;Codonosigidae 20 40
Related Topics
How to Create a File Listener in Linux
Where Is Default Installation Directory for Mongodb
Poor Memcpy Performance in User Space for Mmap'Ed Physical Memory in Linux
How to Avoid Grub Errors After Running Apt-Get Upgrade - Ubuntu
How Does Copy-On-Write in Fork() Handle Multiple Fork
Does There Exist Kernel Stack for Each Process
Powershell Connecting from a Linux Client to a Windows Remote
How to Manage Permissions When Developing in a Docker Container
Print Stdout/Stderr and Write Them to a File in Bash
Vimdiff: How to Put All Changes Inside a Particular Function from One File to Another
Is Visual Basic Supported by .Net Core on Linux
Dependency Failure While Installing Libboost-All-Dev on Ubuntu Core 14.04
X11 Forwarding of Gui App in Docker Container