How to Use Data.Table to Efficiently Calculate Allele Frequencies (Proportions) by Group Across Multiple Columns (Loci)

How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2 for example). With the melt function of data.table (which works the same as the melt function of the reshape2 package) you can transform from wide to long format:

DT2 <- melt(DT, id = "Group", variable.name = "loci")

When you want to remove the NA-values during the melt-operation, you can add na.rm = TRUE in the above call (na.rm = FALSE is the default behaviour).

Then you can make count and proportion variables as follows:

DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]

which gives the following result:

> DT2
Group loci value N prop
1: G1 Loc1 G 3 1.0000000
2: G2 Loc1 NA 1 0.2500000
3: G2 Loc1 G 1 0.2500000
4: G2 Loc1 T 2 0.5000000
5: G3 Loc1 T 2 0.6666667
6: G3 Loc1 NA 1 0.3333333
7: G1 Loc2 NA 1 0.3333333
8: G1 Loc2 A 1 0.3333333
9: G1 Loc2 C 1 0.3333333
10: G2 Loc2 NA 1 0.2500000
11: G2 Loc2 C 2 0.5000000
12: G2 Loc2 A 1 0.2500000
13: G3 Loc2 A 2 0.6666667
14: G3 Loc2 C 1 0.3333333
15: G1 Loc3 C 1 0.3333333
16: G1 Loc3 G 2 0.6666667
17: G2 Loc3 NA 2 0.5000000
18: G2 Loc3 G 2 0.5000000
19: G3 Loc3 G 3 1.0000000

I you want it back in wide format, you can use dcast on multiple variables:

DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)

which results in:

> DT3
Group loci N_A N_C N_G N_T N_NA prop_A prop_C prop_G prop_T prop_NA
1: G1 Loc1 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2: G1 Loc2 1 1 0 0 1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3: G1 Loc3 0 1 2 0 0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4: G2 Loc1 0 0 1 2 1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5: G2 Loc2 1 2 0 0 1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6: G2 Loc3 0 0 2 0 2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7: G3 Loc1 0 0 0 2 1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8: G3 Loc2 2 1 0 0 0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9: G3 Loc3 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

Another and straightforward approach is using melt and dcast in one call (which is a simplified version of the first part of @Frank's answer):

DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)

which gives:

> DT2
Group variable A C G T NA
1: G1 Loc1 0 0 3 0 0
2: G1 Loc2 1 1 0 0 1
3: G1 Loc3 0 1 2 0 0
4: G2 Loc1 0 0 1 2 1
5: G2 Loc2 1 2 0 0 1
6: G2 Loc3 0 0 2 0 2
7: G3 Loc1 0 0 0 2 1
8: G3 Loc2 2 1 0 0 0
9: G3 Loc3 0 0 3 0 0

Because the default aggregation function in dcast is length, you will automatically get the counts for each of the values.


Used data:

DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA), 
Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"),
Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"),
Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")),
.Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))

calculating the proportion of count variable per group in data.table in R

If you are looking for the ratio, you can do :

library(data.table)
mydata[, prop := count/sum(count) * 100, by = .(startYear, groupSize)]

# groupSize gender startYear count prop
# 1: intermediate F 2014 7546 55.9958445
# 2: small F 2014 3500 31.3395415
# 3: intermediate M 2014 5930 44.0041555
# 4: small M 2014 7668 68.6604585
# 5: huge F 2014 18114 56.7125861
# 6: huge M 2014 13826 43.2874139
# 7: large F 2014 11943 54.2222828
# 8: large M 2014 10083 45.7777172
#....

How to apply function over subset of columns in data.table while grouping by some other column?

We could specify the .SDcols and assign the output back to the same columns.

foo[, names(foo)[3:5]  := lapply(.SD, function(x) x/sum(x)),
by = group, .SDcols = x:z]

Note that the type should be the same for the output and the input. If the input is integer and output is numeric it would have problem. So, change the class to numeric first and then do the assignment

nm1 <- names(foo)[3:5]
#or programmatically based on checking whether column is numeric
#nm1 <- foo[, which(unlist(lapply(.SD, is.numeric)))]
foo[, (nm1) := lapply(.SD, as.numeric), .SDcols = nm1
][, (nm1) := lapply(.SD, function(x) x/sum(x)),
by = group, .SDcols = nm1][]

A tidyverse approach to the above would be

library(dplyr)
foo %>%
group_by(group) %>%
mutate_if(is.numeric, funs(./sum(.)))

Data.table: Applying function to rows of grouped columns (could also call it collapsing columns)

If they are in the same order, use split.default

setDT(lapply(split.default(data, group$Group), rowSums))[]

-output

   X Y
1: 2 4
2: 2 4
3: 2 4
4: 2 4
5: 2 4

If the column names are not in the same order, then use a matching with named vector

nm1 <- setNames(group$Group, group$Sample)[colnames(data)]
setDT(lapply(split.default(data, nm1), rowSums))[]

Or may also do the split from the 'group' data and loop over the list, extract the columns, and do rowSums

setDT(lapply(split(group$Sample, group$Group),
function(x) rowSums(data[, ..x])))[]

Benchmarks

set.seed(24)
data_test <- as.data.table(matrix(rnorm(5000 * 5000), ncol = 5000, dimnames = list(NULL, paste0("C", 1:5000))))

group_test <- data.table(Sample= paste0("C", 1:5000),Group = rep(LETTERS[1:10], 500) )

system.time({
nm1 <- setNames(group_test$Group, group_test$Sample)[colnames(data_test)]
setDT(lapply(split.default(data_test, nm1), rowSums))[]

})
# user system elapsed
# 0.167 0.048 0.219

system.time({
long <- melt(data_test[, rn := .I], "rn")
dcast(long[group_test, on = "variable==Sample"], rn ~ Group, sum)

})
# user system elapsed
# 2.897 0.305 3.189

Calculate the difference between consecutive, grouped columns in a data.table

I think your loop is fine, except you should use := instead of cbind to add columns:

ref <- lapply(reference,function(x) names(DT)[x])

for (g in ref){
if (length(g)==1) next
gx = tail(g,-1)
gy = head(g,-1)
gn = paste0("D[",gy,":",gx,"]")
DT[,(gn) := mapply(function(x,y).SD[[x]]-.SD[[y]], gx, gy, SIMPLIFY=FALSE)]
}

How to find new individuals in repeated surveys with data.table

This is the first survey each individual appeared in each group,

dat[order(ind, surv), .SD[1], .(group, ind)][surv != 1]
#> group ind surv
#> 1: 1 c 2
#> 2: 2 d 2
#> 3: 1 d 3

Note I've excluded the individuals who first appeared in the first survey.

Merging two dataset - include unique rows

Are you looking for something like this?
The script counts number of each "loci" and divides it by the number of all loci per location.

xy <- read.table(text = "Island,Individual,all1,all2
Santiago,CVW3,01,01
Santiago,CVW8,01,02
Santiago,CVW9,03,03
Santiago,CVW10,01,01
Santiago,CVW12,03,03
Santiago,CVW19,01,01
Santiago,CVW25,01,04
Santiago,CVW39,04,04
Santiago,CVW40,01,01
Santiago,CVW41,01,01
Fogo,CVW64,04,04
Fogo,CVW67,01,01
Fogo,CVW70,02,04
Fogo,CVW74,03,05
Fogo,CVW80,01,02
Fogo,CVW99,01,04
Fogo,CVW101,02,02
Fogo,CVW103,01,02
Fogo,CVW104,01,04
Fogo,CVW111,02,04
SaoNicolau,CVW81,01,01
SaoNicolau,CVW87,01,01
SaoNicolau,CVW92,01,01
SaoNicolau,CVW124,01,01
SaoNicolau,CVW125,01,01
SaoNicolau,CVW133,01,01
SaoNicolau,CVW136,01,01", header = TRUE, sep = ",",
colClasses = c("factor", "factor", "character", "character"),
stringsAsFactors = FALSE)

by(xy, xy$Island, FUN = function(x) {
table(c(x$all1, x$all2))/length(c(x$all1, x$all2))
})

xy$Island: Fogo

01 02 03 04 05
0.30 0.30 0.05 0.30 0.05
------------------------------------------------------------------------------------------------------
xy$Island: Santiago

01 02 03 04
0.60 0.05 0.20 0.15
------------------------------------------------------------------------------------------------------
xy$Island: SaoNicolau

01
1


Related Topics



Leave a reply



Submit