Calculate Percentages/Proportions of Values by Group Using Data.Table

Calculate percentages / proportions of values by group using data.table

I don't quite understand the data.table solution already posted, so I would do it like this (and I would change the name of the columns to not have parentheses to avoid lots of backtick quoting(!) of column names):

dt[ , `percentage(counts)` := `sum(count)` / sum( `sum(count)` ) * 100 , by = "x" ]
# x y sum(count) percentage(counts)
#1: 1 1 3 16.66667
#2: 1 2 7 38.88889
#3: 1 3 8 44.44444
#4: 2 1 4 23.52941
#5: 2 2 3 17.64706
#6: 2 3 10 58.82353

Ratio of row value to sum of rows in a group using r data.table

You can use prop.table to get ratio for value in each year and quarter.

library(data.table)

dt[, pct_byQtrYr := prop.table(value), .(year, quarter)]
dt

# ID year quarter value pct_byQtrYr
# 1: A 2020 4 4.0 0.1951220
# 2: B 2020 4 10.5 0.5121951
# 3: C 2020 4 6.0 0.2926829
# 4: A 2021 1 6.6 0.2933333
# 5: B 2021 1 15.0 0.6666667
# 6: C 2021 1 0.9 0.0400000
# 7: A 2021 2 6.2 0.1980831
# 8: B 2021 2 9.8 0.3130990
# 9: C 2021 2 15.3 0.4888179
#10: A 2021 3 5.0 0.5263158
#11: B 2021 3 3.4 0.3578947
#12: C 2021 3 1.1 0.1157895

This is similar to dividing value by sum of the group.

dt[, pct_byQtrYr := value/sum(value), .(year, quarter)]

Calculating the proportion per subgroup with data.table

Using data.table:

df <- read.table(header = T, text = "row  country year
1 NLD 2005
2 NLD 2005
3 BLG 2006
4 BLG 2005
5 GER 2005
6 NLD 2007
7 NLD 2005
8 NLD 2008")

setDT(df)[, sum := .N, by = country][, prop := .N, by = c("country", "year")][, prop := prop/sum][, sum := NULL]

row country year prop
1: 1 NLD 2005 0.6
2: 2 NLD 2005 0.6
3: 3 BLG 2006 0.5
4: 4 BLG 2005 0.5
5: 5 GER 2005 1.0
6: 6 NLD 2007 0.2
7: 7 NLD 2005 0.6
8: 8 NLD 2008 0.2

Calculate Percentage and other functions using data.table

We can use the similar approach with data.table

res <- IData[, .(numbers1.mean = mean(numbers1),
numbers1.median = median(numbers1),
numbers2.mean=mean(numbers2),
sum.numbers1.n = sum(numbers1)), let
][, perc.numbers1 := sum.numbers1.n/sum(sum.numbers1.n)
][, c("let", "numbers1.mean", "numbers1.median",
"numbers2.mean", "perc.numbers1"), with = FALSE]

head(res)
# let numbers1.mean numbers1.median numbers2.mean perc.numbers1
#1: N 10320.951 10473.0 9374.435 0.03567927
#2: H 9683.590 9256.5 9328.035 0.03648391
#3: L 10223.322 10226.0 9806.210 0.04005400
#4: S 9922.486 9618.0 10233.849 0.03678742
#5: C 9592.620 9226.0 9791.221 0.03517997
#6: F 10323.867 10382.0 10036.561 0.03962035

Using dplyr function to calculate percentage within groups

library(dplyr)

df %>%
# line below to freeze order of type_n if not ordered factor already
mutate(type_n = forcats::fct_inorder(type_n)) %>%
group_by(type_n) %>%
summarize(n = n(), total = sum(population)) %>%
mutate(new_col = (n / total) %>% scales::percent(decimal.mark = ",", suffix = ""))

# A tibble: 3 x 4
type_n n total new_col
<fct> <int> <int> <chr>
1 small 2 7 28,6
2 medium 2 14 14,3
3 large 3 15 20,0

R data.table: subgroup weighted percent of group

This is almost a single step:

# A
widgets[,{
totwt = .N
.SD[,.(frac=.N/totwt),by=style]
},by=color]
# color style frac
# 1: red round 0.36
# 2: red pointy 0.32
# 3: red flat 0.32
# 4: green pointy 0.36
# 5: green flat 0.32
# 6: green round 0.32
# 7: blue flat 0.36
# 8: blue round 0.32
# 9: blue pointy 0.32
# 10: black round 0.36
# 11: black pointy 0.32
# 12: black flat 0.32

# B
widgets[,{
totwt = sum(weight)
.SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
# color style frac
# 1: red round 0.3466667
# 2: red pointy 0.3466667
# 3: red flat 0.3066667
# 4: green pointy 0.3333333
# 5: green flat 0.3200000
# 6: green round 0.3466667
# 7: blue flat 0.3866667
# 8: blue round 0.2933333
# 9: blue pointy 0.3200000
# 10: black round 0.3733333
# 11: black pointy 0.3333333
# 12: black flat 0.2933333

How it works: Construct your denominator for the top-level group (color) before going to the finer group (color with style) to tabulate.


Alternatives. If styles repeat within each color and this is only for display purposes, try a table:

# A
widgets[,
prop.table(table(color,style),1)
]
# style
# color flat pointy round
# black 0.32 0.32 0.36
# blue 0.36 0.32 0.32
# green 0.32 0.36 0.32
# red 0.32 0.32 0.36

# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
prop.table(table(color,style),1)
]

# style
# color flat pointy round
# black 0.2933333 0.3333333 0.3733333
# blue 0.3866667 0.3200000 0.2933333
# green 0.3200000 0.3333333 0.3466667
# red 0.3066667 0.3466667 0.3466667

For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also, weight has to be an integer; otherwise, its sum will be silently truncated to one (e.g., try rep(1,2.5) # [1] 1 1).

KDB/Q: compute the percentage by group

You can use fby to do this in one query:

q)table:flip`day`week`item!(`mon`tue`wed`mon`tue`wed;1 1 1 2 2 2;2 7 1 1 2 1)
q)update proportion:item % (sum;item) fby week from table
day week item proportion
------------------------
mon 1 2 0.2
tue 1 7 0.7
wed 1 1 0.1
mon 2 1 0.25
tue 2 2 0.5
wed 2 1 0.25

Percentage of factor levels by group in R

Another solution (with base-R):

prop.table(table(mydata$CNT, mydata$FACTOR), margin = 1)
            1         2
A 0.6000000 0.4000000
B 0.6666667 0.3333333
C 0.5000000 0.5000000
D 1.0000000 0.0000000

How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2 for example). With the melt function of data.table (which works the same as the melt function of the reshape2 package) you can transform from wide to long format:

DT2 <- melt(DT, id = "Group", variable.name = "loci")

When you want to remove the NA-values during the melt-operation, you can add na.rm = TRUE in the above call (na.rm = FALSE is the default behaviour).

Then you can make count and proportion variables as follows:

DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]

which gives the following result:

> DT2
Group loci value N prop
1: G1 Loc1 G 3 1.0000000
2: G2 Loc1 NA 1 0.2500000
3: G2 Loc1 G 1 0.2500000
4: G2 Loc1 T 2 0.5000000
5: G3 Loc1 T 2 0.6666667
6: G3 Loc1 NA 1 0.3333333
7: G1 Loc2 NA 1 0.3333333
8: G1 Loc2 A 1 0.3333333
9: G1 Loc2 C 1 0.3333333
10: G2 Loc2 NA 1 0.2500000
11: G2 Loc2 C 2 0.5000000
12: G2 Loc2 A 1 0.2500000
13: G3 Loc2 A 2 0.6666667
14: G3 Loc2 C 1 0.3333333
15: G1 Loc3 C 1 0.3333333
16: G1 Loc3 G 2 0.6666667
17: G2 Loc3 NA 2 0.5000000
18: G2 Loc3 G 2 0.5000000
19: G3 Loc3 G 3 1.0000000

I you want it back in wide format, you can use dcast on multiple variables:

DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)

which results in:

> DT3
Group loci N_A N_C N_G N_T N_NA prop_A prop_C prop_G prop_T prop_NA
1: G1 Loc1 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2: G1 Loc2 1 1 0 0 1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3: G1 Loc3 0 1 2 0 0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4: G2 Loc1 0 0 1 2 1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5: G2 Loc2 1 2 0 0 1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6: G2 Loc3 0 0 2 0 2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7: G3 Loc1 0 0 0 2 1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8: G3 Loc2 2 1 0 0 0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9: G3 Loc3 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

Another and straightforward approach is using melt and dcast in one call (which is a simplified version of the first part of @Frank's answer):

DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)

which gives:

> DT2
Group variable A C G T NA
1: G1 Loc1 0 0 3 0 0
2: G1 Loc2 1 1 0 0 1
3: G1 Loc3 0 1 2 0 0
4: G2 Loc1 0 0 1 2 1
5: G2 Loc2 1 2 0 0 1
6: G2 Loc3 0 0 2 0 2
7: G3 Loc1 0 0 0 2 1
8: G3 Loc2 2 1 0 0 0
9: G3 Loc3 0 0 3 0 0

Because the default aggregation function in dcast is length, you will automatically get the counts for each of the values.


Used data:

DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA), 
Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"),
Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"),
Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")),
.Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))


Related Topics



Leave a reply



Submit