Calculate percentages / proportions of values by group using data.table
I don't quite understand the data.table
solution already posted, so I would do it like this (and I would change the name of the columns to not have parentheses to avoid lots of backtick quoting(!) of column names):
dt[ , `percentage(counts)` := `sum(count)` / sum( `sum(count)` ) * 100 , by = "x" ]
# x y sum(count) percentage(counts)
#1: 1 1 3 16.66667
#2: 1 2 7 38.88889
#3: 1 3 8 44.44444
#4: 2 1 4 23.52941
#5: 2 2 3 17.64706
#6: 2 3 10 58.82353
Ratio of row value to sum of rows in a group using r data.table
You can use prop.table
to get ratio for value
in each year
and quarter
.
library(data.table)
dt[, pct_byQtrYr := prop.table(value), .(year, quarter)]
dt
# ID year quarter value pct_byQtrYr
# 1: A 2020 4 4.0 0.1951220
# 2: B 2020 4 10.5 0.5121951
# 3: C 2020 4 6.0 0.2926829
# 4: A 2021 1 6.6 0.2933333
# 5: B 2021 1 15.0 0.6666667
# 6: C 2021 1 0.9 0.0400000
# 7: A 2021 2 6.2 0.1980831
# 8: B 2021 2 9.8 0.3130990
# 9: C 2021 2 15.3 0.4888179
#10: A 2021 3 5.0 0.5263158
#11: B 2021 3 3.4 0.3578947
#12: C 2021 3 1.1 0.1157895
This is similar to dividing value
by sum
of the group.
dt[, pct_byQtrYr := value/sum(value), .(year, quarter)]
Calculating the proportion per subgroup with data.table
Using data.table
:
df <- read.table(header = T, text = "row country year
1 NLD 2005
2 NLD 2005
3 BLG 2006
4 BLG 2005
5 GER 2005
6 NLD 2007
7 NLD 2005
8 NLD 2008")
setDT(df)[, sum := .N, by = country][, prop := .N, by = c("country", "year")][, prop := prop/sum][, sum := NULL]
row country year prop
1: 1 NLD 2005 0.6
2: 2 NLD 2005 0.6
3: 3 BLG 2006 0.5
4: 4 BLG 2005 0.5
5: 5 GER 2005 1.0
6: 6 NLD 2007 0.2
7: 7 NLD 2005 0.6
8: 8 NLD 2008 0.2
Calculate Percentage and other functions using data.table
We can use the similar approach with data.table
res <- IData[, .(numbers1.mean = mean(numbers1),
numbers1.median = median(numbers1),
numbers2.mean=mean(numbers2),
sum.numbers1.n = sum(numbers1)), let
][, perc.numbers1 := sum.numbers1.n/sum(sum.numbers1.n)
][, c("let", "numbers1.mean", "numbers1.median",
"numbers2.mean", "perc.numbers1"), with = FALSE]
head(res)
# let numbers1.mean numbers1.median numbers2.mean perc.numbers1
#1: N 10320.951 10473.0 9374.435 0.03567927
#2: H 9683.590 9256.5 9328.035 0.03648391
#3: L 10223.322 10226.0 9806.210 0.04005400
#4: S 9922.486 9618.0 10233.849 0.03678742
#5: C 9592.620 9226.0 9791.221 0.03517997
#6: F 10323.867 10382.0 10036.561 0.03962035
Using dplyr function to calculate percentage within groups
library(dplyr)
df %>%
# line below to freeze order of type_n if not ordered factor already
mutate(type_n = forcats::fct_inorder(type_n)) %>%
group_by(type_n) %>%
summarize(n = n(), total = sum(population)) %>%
mutate(new_col = (n / total) %>% scales::percent(decimal.mark = ",", suffix = ""))
# A tibble: 3 x 4
type_n n total new_col
<fct> <int> <int> <chr>
1 small 2 7 28,6
2 medium 2 14 14,3
3 large 3 15 20,0
R data.table: subgroup weighted percent of group
This is almost a single step:
# A
widgets[,{
totwt = .N
.SD[,.(frac=.N/totwt),by=style]
},by=color]
# color style frac
# 1: red round 0.36
# 2: red pointy 0.32
# 3: red flat 0.32
# 4: green pointy 0.36
# 5: green flat 0.32
# 6: green round 0.32
# 7: blue flat 0.36
# 8: blue round 0.32
# 9: blue pointy 0.32
# 10: black round 0.36
# 11: black pointy 0.32
# 12: black flat 0.32
# B
widgets[,{
totwt = sum(weight)
.SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
# color style frac
# 1: red round 0.3466667
# 2: red pointy 0.3466667
# 3: red flat 0.3066667
# 4: green pointy 0.3333333
# 5: green flat 0.3200000
# 6: green round 0.3466667
# 7: blue flat 0.3866667
# 8: blue round 0.2933333
# 9: blue pointy 0.3200000
# 10: black round 0.3733333
# 11: black pointy 0.3333333
# 12: black flat 0.2933333
How it works: Construct your denominator for the top-level group (color
) before going to the finer group (color
with style
) to tabulate.
Alternatives. If style
s repeat within each color
and this is only for display purposes, try a table
:
# A
widgets[,
prop.table(table(color,style),1)
]
# style
# color flat pointy round
# black 0.32 0.32 0.36
# blue 0.36 0.32 0.32
# green 0.32 0.36 0.32
# red 0.32 0.32 0.36
# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
prop.table(table(color,style),1)
]
# style
# color flat pointy round
# black 0.2933333 0.3333333 0.3733333
# blue 0.3866667 0.3200000 0.2933333
# green 0.3200000 0.3333333 0.3466667
# red 0.3066667 0.3466667 0.3466667
For B, this expands the data so that there is one observation for each unit of weight. With large data, such an expansion would be a bad idea (since it costs so much memory). Also, weight
has to be an integer; otherwise, its sum will be silently truncated to one (e.g., try rep(1,2.5) # [1] 1 1
).
KDB/Q: compute the percentage by group
You can use fby
to do this in one query:
q)table:flip`day`week`item!(`mon`tue`wed`mon`tue`wed;1 1 1 2 2 2;2 7 1 1 2 1)
q)update proportion:item % (sum;item) fby week from table
day week item proportion
------------------------
mon 1 2 0.2
tue 1 7 0.7
wed 1 1 0.1
mon 2 1 0.25
tue 2 2 0.5
wed 2 1 0.25
Percentage of factor levels by group in R
Another solution (with base-R):
prop.table(table(mydata$CNT, mydata$FACTOR), margin = 1)
1 2
A 0.6000000 0.4000000
B 0.6666667 0.3333333
C 0.5000000 0.5000000
D 1.0000000 0.0000000
How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)
It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2
for example). With the melt
function of data.table
(which works the same as the melt
function of the reshape2
package) you can transform from wide to long format:
DT2 <- melt(DT, id = "Group", variable.name = "loci")
When you want to remove the NA
-values during the melt-operation, you can add na.rm = TRUE
in the above call (na.rm = FALSE
is the default behaviour).
Then you can make count and proportion variables as follows:
DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]
which gives the following result:
> DT2
Group loci value N prop
1: G1 Loc1 G 3 1.0000000
2: G2 Loc1 NA 1 0.2500000
3: G2 Loc1 G 1 0.2500000
4: G2 Loc1 T 2 0.5000000
5: G3 Loc1 T 2 0.6666667
6: G3 Loc1 NA 1 0.3333333
7: G1 Loc2 NA 1 0.3333333
8: G1 Loc2 A 1 0.3333333
9: G1 Loc2 C 1 0.3333333
10: G2 Loc2 NA 1 0.2500000
11: G2 Loc2 C 2 0.5000000
12: G2 Loc2 A 1 0.2500000
13: G3 Loc2 A 2 0.6666667
14: G3 Loc2 C 1 0.3333333
15: G1 Loc3 C 1 0.3333333
16: G1 Loc3 G 2 0.6666667
17: G2 Loc3 NA 2 0.5000000
18: G2 Loc3 G 2 0.5000000
19: G3 Loc3 G 3 1.0000000
I you want it back in wide format, you can use dcast
on multiple variables:
DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)
which results in:
> DT3
Group loci N_A N_C N_G N_T N_NA prop_A prop_C prop_G prop_T prop_NA
1: G1 Loc1 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2: G1 Loc2 1 1 0 0 1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3: G1 Loc3 0 1 2 0 0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4: G2 Loc1 0 0 1 2 1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5: G2 Loc2 1 2 0 0 1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6: G2 Loc3 0 0 2 0 2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7: G3 Loc1 0 0 0 2 1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8: G3 Loc2 2 1 0 0 0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9: G3 Loc3 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
Another and straightforward approach is using melt
and dcast
in one call (which is a simplified version of the first part of @Frank's answer):
DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)
which gives:
> DT2
Group variable A C G T NA
1: G1 Loc1 0 0 3 0 0
2: G1 Loc2 1 1 0 0 1
3: G1 Loc3 0 1 2 0 0
4: G2 Loc1 0 0 1 2 1
5: G2 Loc2 1 2 0 0 1
6: G2 Loc3 0 0 2 0 2
7: G3 Loc1 0 0 0 2 1
8: G3 Loc2 2 1 0 0 0
9: G3 Loc3 0 0 3 0 0
Because the default aggregation function in dcast
is length
, you will automatically get the counts for each of the values.
Used data:
DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA),
Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"),
Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"),
Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")),
.Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))
Related Topics
Get Country (And Continent) from Longitude and Latitude Point in R
Fill Missing Values Rowwise (Right/Left)
How to Generate Multivariate Random Numbers with Different Marginal Distributions
R: Check If Value from Dataframe Is Within Range Other Dataframe
How to Create a Rank Variable Under Certain Conditions
How to Create a Prop.Table() for a Three Dimension Table
Change Position of Tick Marks of a Single Graph, Using Ggplot2
How to Specify Certificate, Key and Root Certificate with Httr for Certificate Based Authentication
What Does Na.Rm=True Actually Means
Check If a Program Is Installed
Ggplot2 Equivalent of 'Factorization or Categorization' in Googlevis in R
Dynamic Number of Actionbuttons Tied to Unique Observeevent
Fill in Gaps (E.G. Not Single Cells) of Na Values in Raster Using a Neighborhood Analysis