Calculate Percentages/Proportions of Values by Group Using Data.Table

Calculate percentages / proportions of values by group using data.table

I don't quite understand the data.table solution already posted, so I would do it like this (and I would change the name of the columns to not have parentheses to avoid lots of backtick quoting(!) of column names):

dt[ , `percentage(counts)` := `sum(count)` / sum( `sum(count)` ) * 100 , by = "x" ]
#   x y sum(count) percentage(counts)
#1: 1 1          3           16.66667
#2: 1 2          7           38.88889
#3: 1 3          8           44.44444
#4: 2 1          4           23.52941
#5: 2 2          3           17.64706
#6: 2 3         10           58.82353

Ratio of row value to sum of rows in a group using r data.table

You can use prop.table to get ratio for value in each year and quarter.

library(data.table)

dt[, pct_byQtrYr := prop.table(value), .(year, quarter)]
dt

#    ID year quarter value pct_byQtrYr
# 1:  A 2020       4   4.0   0.1951220
# 2:  B 2020       4  10.5   0.5121951
# 3:  C 2020       4   6.0   0.2926829
# 4:  A 2021       1   6.6   0.2933333
# 5:  B 2021       1  15.0   0.6666667
# 6:  C 2021       1   0.9   0.0400000
# 7:  A 2021       2   6.2   0.1980831
# 8:  B 2021       2   9.8   0.3130990
# 9:  C 2021       2  15.3   0.4888179
#10:  A 2021       3   5.0   0.5263158
#11:  B 2021       3   3.4   0.3578947
#12:  C 2021       3   1.1   0.1157895

This is similar to dividing value by sum of the group.

dt[, pct_byQtrYr := value/sum(value), .(year, quarter)]

Calculating the proportion per subgroup with data.table

Using data.table:

df <- read.table(header = T, text = "row  country year
     1  NLD     2005
                 2  NLD     2005       
                 3  BLG     2006
                 4  BLG     2005
                 5  GER     2005
                 6  NLD     2007
                 7  NLD     2005
                 8  NLD     2008")

setDT(df)[, sum := .N, by = country][, prop := .N, by = c("country", "year")][, prop := prop/sum][, sum := NULL]

    row country year prop
1:   1     NLD 2005  0.6
2:   2     NLD 2005  0.6
3:   3     BLG 2006  0.5
4:   4     BLG 2005  0.5
5:   5     GER 2005  1.0
6:   6     NLD 2007  0.2
7:   7     NLD 2005  0.6
8:   8     NLD 2008  0.2

Calculate Percentage and other functions using data.table

We can use the similar approach with data.table

res <- IData[, .(numbers1.mean = mean(numbers1),
          numbers1.median = median(numbers1),
          numbers2.mean=mean(numbers2),
          sum.numbers1.n = sum(numbers1)), let
          ][, perc.numbers1 := sum.numbers1.n/sum(sum.numbers1.n)
           ][, c("let", "numbers1.mean",  "numbers1.median", 
                        "numbers2.mean", "perc.numbers1"), with = FALSE]

head(res)
#    let numbers1.mean numbers1.median numbers2.mean perc.numbers1
#1:   N     10320.951         10473.0      9374.435    0.03567927
#2:   H      9683.590          9256.5      9328.035    0.03648391
#3:   L     10223.322         10226.0      9806.210    0.04005400
#4:   S      9922.486          9618.0     10233.849    0.03678742
#5:   C      9592.620          9226.0      9791.221    0.03517997
#6:   F     10323.867         10382.0     10036.561    0.03962035

Using dplyr function to calculate percentage within groups

library(dplyr)

df %>%
  # line below to freeze order of type_n if not ordered factor already
  mutate(type_n = forcats::fct_inorder(type_n)) %>%
  group_by(type_n) %>%
  summarize(n = n(), total = sum(population)) %>%
  mutate(new_col = (n / total) %>% scales::percent(decimal.mark = ",", suffix = ""))

# A tibble: 3 x 4
  type_n     n total new_col
  <fct>  <int> <int> <chr>  
1 small      2     7 28,6   
2 medium     2    14 14,3   
3 large      3    15 20,0

R data.table: subgroup weighted percent of group

This is almost a single step:

# A
widgets[,{
    totwt = .N
    .SD[,.(frac=.N/totwt),by=style]
},by=color]
    # color  style frac
 # 1:   red  round 0.36
 # 2:   red pointy 0.32
 # 3:   red   flat 0.32
 # 4: green pointy 0.36
 # 5: green   flat 0.32
 # 6: green  round 0.32
 # 7:  blue   flat 0.36
 # 8:  blue  round 0.32
 # 9:  blue pointy 0.32
# 10: black  round 0.36
# 11: black pointy 0.32
# 12: black   flat 0.32

# B
widgets[,{
    totwt = sum(weight)
    .SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
 #    color  style      frac
 # 1:   red  round 0.3466667
 # 2:   red pointy 0.3466667
 # 3:   red   flat 0.3066667
 # 4: green pointy 0.3333333
 # 5: green   flat 0.3200000
 # 6: green  round 0.3466667
 # 7:  blue   flat 0.3866667
 # 8:  blue  round 0.2933333
 # 9:  blue pointy 0.3200000
# 10: black  round 0.3733333
# 11: black pointy 0.3333333
# 12: black   flat 0.2933333

How it works: Construct your denominator for the top-level group (color) before going to the finer group (color with style) to tabulate.

Alternatives. If styles repeat within each color and this is only for display purposes, try a table:

# A
widgets[,
  prop.table(table(color,style),1)
]
#        style
# color   flat pointy round
#   black 0.32   0.32  0.36
#   blue  0.36   0.32  0.32
#   green 0.32   0.36  0.32
#   red   0.32   0.32  0.36

# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
  prop.table(table(color,style),1)
]

#        style
# color        flat    pointy     round
#   black 0.2933333 0.3333333 0.3733333
#   blue  0.3866667 0.3200000 0.2933333
#   green 0.3200000 0.3333333 0.3466667
#   red   0.3066667 0.3466667 0.3466667

For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also, weight has to be an integer; otherwise, its sum will be silently truncated to one (e.g., try rep(1,2.5) # [1] 1 1).

KDB/Q: compute the percentage by group

You can use fby to do this in one query:

q)table:flip`day`week`item!(`mon`tue`wed`mon`tue`wed;1 1 1 2 2 2;2 7 1 1 2 1)
q)update proportion:item % (sum;item) fby week from table
day week item proportion
------------------------
mon 1    2    0.2
tue 1    7    0.7
wed 1    1    0.1
mon 2    1    0.25
tue 2    2    0.5
wed 2    1    0.25

Percentage of factor levels by group in R

Another solution (with base-R):

prop.table(table(mydata$CNT, mydata$FACTOR), margin = 1)

            1         2
  A 0.6000000 0.4000000
  B 0.6666667 0.3333333
  C 0.5000000 0.5000000
  D 1.0000000 0.0000000

How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2 for example). With the melt function of data.table (which works the same as the melt function of the reshape2 package) you can transform from wide to long format:

DT2 <- melt(DT, id = "Group", variable.name = "loci")

When you want to remove the NA-values during the melt-operation, you can add na.rm = TRUE in the above call (na.rm = FALSE is the default behaviour).

Then you can make count and proportion variables as follows:

DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]

which gives the following result:

> DT2
    Group loci value N      prop
 1:    G1 Loc1     G 3 1.0000000
 2:    G2 Loc1    NA 1 0.2500000
 3:    G2 Loc1     G 1 0.2500000
 4:    G2 Loc1     T 2 0.5000000
 5:    G3 Loc1     T 2 0.6666667
 6:    G3 Loc1    NA 1 0.3333333
 7:    G1 Loc2    NA 1 0.3333333
 8:    G1 Loc2     A 1 0.3333333
 9:    G1 Loc2     C 1 0.3333333
10:    G2 Loc2    NA 1 0.2500000
11:    G2 Loc2     C 2 0.5000000
12:    G2 Loc2     A 1 0.2500000
13:    G3 Loc2     A 2 0.6666667
14:    G3 Loc2     C 1 0.3333333
15:    G1 Loc3     C 1 0.3333333
16:    G1 Loc3     G 2 0.6666667
17:    G2 Loc3    NA 2 0.5000000
18:    G2 Loc3     G 2 0.5000000
19:    G3 Loc3     G 3 1.0000000

I you want it back in wide format, you can use dcast on multiple variables:

DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)

which results in:

> DT3
   Group loci N_A N_C N_G N_T N_NA    prop_A    prop_C    prop_G    prop_T   prop_NA
1:    G1 Loc1   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2:    G1 Loc2   1   1   0   0    1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3:    G1 Loc3   0   1   2   0    0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4:    G2 Loc1   0   0   1   2    1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5:    G2 Loc2   1   2   0   0    1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6:    G2 Loc3   0   0   2   0    2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7:    G3 Loc1   0   0   0   2    1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8:    G3 Loc2   2   1   0   0    0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9:    G3 Loc3   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

Another and straightforward approach is using melt and dcast in one call (which is a simplified version of the first part of @Frank's answer):

DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)

which gives:

> DT2
   Group variable A C G T NA
1:    G1     Loc1 0 0 3 0  0
2:    G1     Loc2 1 1 0 0  1
3:    G1     Loc3 0 1 2 0  0
4:    G2     Loc1 0 0 1 2  1
5:    G2     Loc2 1 2 0 0  1
6:    G2     Loc3 0 0 2 0  2
7:    G3     Loc1 0 0 0 2  1
8:    G3     Loc2 2 1 0 0  0
9:    G3     Loc3 0 0 3 0  0

Because the default aggregation function in dcast is length, you will automatically get the counts for each of the values.

Used data:

DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA), 
                     Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"), 
                     Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"), 
                     Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")), 
                .Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))

Calculate Percentages/Proportions of Values by Group Using Data.Table