Bash: Getting Percentage from a Frequency Table

bash: getting percentage from a frequency table

Try this (with the sort moved to the end:

cut -f $1 $2| sort | uniq -c  | awk '{array[$2]=$1; sum+=$1} END { for (i in array) printf "%-20s %-15d %6.2f%%\n", i, array[i], array[i]/sum*100}' | sort -r -k2,2 -n

calculating percentage of frequencies in R

You may want to try using data.table. You also get the advantage of speed if working with large tables.

library(data.table)
#if your data is already stored as a data frame,
#you can always skip the next step and continue with data <- data.table(data)

data <- data.table(name=rep(c("A","B"), each=4), cat1=c(1,1,0,0,1,1,0,0), cat2=c(1,0,1,0,1,0,1,0), freq=c(32,56,36,25,14,68,58,90))
data[, percen := sum(freq), by=list(name,cat1)]
data[, percen := freq/percen]
data
> data
name cat1 cat2 freq percen
1: A 1 1 32 0.3636364
2: A 1 0 56 0.6363636
3: A 0 1 36 0.5901639
4: A 0 0 25 0.4098361
5: B 1 1 14 0.1707317
6: B 1 0 68 0.8292683
7: B 0 1 58 0.3918919
8: B 0 0 90 0.6081081

Hope this helps.

Bash script to find the frequency of every letter in a file

Just one awk command

awk -vFS="" '{for(i=1;i<=NF;i++)w[$i]++}END{for(i in w) print i,w[i]}' file

if you want case insensitive, add tolower()

awk -vFS="" '{for(i=1;i<=NF;i++)w[tolower($i)]++}END{for(i in w) print i,w[i]}' file

and if you want only characters,

awk -vFS="" '{for(i=1;i<=NF;i++){ if($i~/[a-zA-Z]/) { w[tolower($i)]++} } }END{for(i in w) print i,w[i]}' file

and if you want only digits, change /[a-zA-Z]/ to /[0-9]/

if you do not want to show unicode, do export LC_ALL=C

Bash script frequency analysis of unique letters and repeating letter pairs how should i build this script?

Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.

If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.

For example the easy but costly gawk way to count frequences:

awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'

As for transliterating, there is tr utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).

How to create a frequency list of every word in a file?

Not sed and grep, but tr, sort, uniq, and awk:

% (tr ' ' '\n' | sort | uniq -c | awk '{print $2"@"$1}') <<EOF
This is a file with many words.
Some of the words appear more than once.
Some of the words only appear one time.
EOF

a@1
appear@2
file@1
is@1
many@1
more@1
of@2
once.@1
one@1
only@1
Some@2
than@1
the@2
This@1
time.@1
with@1
words@2
words.@1

In most cases you also want to remove numbers and punctuation, convert everything to lowercase (otherwise "THE", "The" and "the" are counted separately) and suppress an entry for a zero length word. For ASCII text you can do all these with this modified command:

sed -e  's/[^A-Za-z]/ /g' text.txt | tr 'A-Z' 'a-z' | tr ' ' '\n' | grep -v '^$'| sort | uniq -c | sort -rn

Two-Way Contingency Table with frequencies and percentages

We can change the position argument in adorn_ns from rear (default) to front

library(tidyverse)
starwars %>%
filter(species == "Human") %>%
tabyl(gender, eye_color) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns(position = "front")
# gender blue blue-gray brown dark hazel yellow
# female 3 (33.33%) 0 (0.00%) 5 (55.56%) 0 (0.00%) 1 (11.11%) 0 (0.00%)
# male 9 (34.62%) 1 (3.85%) 12 (46.15%) 1 (3.85%) 1 (3.85%) 2 (7.69%)

Or another option if the object is already created would be post-processswith mutate_at to change the formatting of all the columns except the first by capturing the characters in two blocks, reverse the positions by reversing the backreference while adding () for the percentage

library(tidyverse)
starwars %>%
filter(species == "Human") %>%
tabyl(gender, eye_color) %>%
adorn_percentages("row") %>%
adorn_pct_formatting(digits = 2) %>%
adorn_ns() %>%
mutate_at(-1, list(~ str_replace(., "^([0-9.%]+)\\s+\\((\\d+)\\)", "\\2 (\\1)")))
# gender blue blue-gray brown dark hazel yellow
#1 female 3 (33.33%) 0 (0.00%) 5 (55.56%) 0 (0.00%) 1 (11.11%) 0 (0.00%)
#2 male 9 (34.62%) 1 (3.85%) 12 (46.15%) 1 (3.85%) 1 (3.85%) 2 (7.69%)

How to transform raw counts in a table to percent relative abundance on R or bash?

Here are 3 base R solutions :

#1.
df[-1] <-sweep(df[-1], 2, colSums(df[,-1]), `/`) * 100

#2.
df[-1] <- t(t(df[-1])/colSums(df[,-1])) * 100

#3.
df[-1] <- sapply(df[-1], prop.table) * 100

All of which return :

df
# Taxa Sample1 Sample2
#1 Eukaryota;Alveolata;Apicomplexa 20 10
#2 Eukaryota;Alveolata;Dinophyceae 40 10
#3 Eukaryota;Alveolata;UnclassifiedAlveolata 10 20
#4 Eukaryota;Choanoflagellida;Acanthoecidae 10 20
#5 Eukaryota;Choanoflagellida;Codonosigidae 20 40


Related Topics



Leave a reply



Submit