How to Use Data.Table to Efficiently Calculate Allele Frequencies (Proportions) by Group Across Multiple Columns (Loci)

How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)

It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2 for example). With the melt function of data.table (which works the same as the melt function of the reshape2 package) you can transform from wide to long format:

DT2 <- melt(DT, id = "Group", variable.name = "loci")

When you want to remove the NA-values during the melt-operation, you can add na.rm = TRUE in the above call (na.rm = FALSE is the default behaviour).

Then you can make count and proportion variables as follows:

DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]

which gives the following result:

> DT2
    Group loci value N      prop
 1:    G1 Loc1     G 3 1.0000000
 2:    G2 Loc1    NA 1 0.2500000
 3:    G2 Loc1     G 1 0.2500000
 4:    G2 Loc1     T 2 0.5000000
 5:    G3 Loc1     T 2 0.6666667
 6:    G3 Loc1    NA 1 0.3333333
 7:    G1 Loc2    NA 1 0.3333333
 8:    G1 Loc2     A 1 0.3333333
 9:    G1 Loc2     C 1 0.3333333
10:    G2 Loc2    NA 1 0.2500000
11:    G2 Loc2     C 2 0.5000000
12:    G2 Loc2     A 1 0.2500000
13:    G3 Loc2     A 2 0.6666667
14:    G3 Loc2     C 1 0.3333333
15:    G1 Loc3     C 1 0.3333333
16:    G1 Loc3     G 2 0.6666667
17:    G2 Loc3    NA 2 0.5000000
18:    G2 Loc3     G 2 0.5000000
19:    G3 Loc3     G 3 1.0000000

I you want it back in wide format, you can use dcast on multiple variables:

DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)

which results in:

> DT3
   Group loci N_A N_C N_G N_T N_NA    prop_A    prop_C    prop_G    prop_T   prop_NA
1:    G1 Loc1   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2:    G1 Loc2   1   1   0   0    1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3:    G1 Loc3   0   1   2   0    0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4:    G2 Loc1   0   0   1   2    1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5:    G2 Loc2   1   2   0   0    1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6:    G2 Loc3   0   0   2   0    2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7:    G3 Loc1   0   0   0   2    1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8:    G3 Loc2   2   1   0   0    0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9:    G3 Loc3   0   0   3   0    0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000

Another and straightforward approach is using melt and dcast in one call (which is a simplified version of the first part of @Frank's answer):

DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)

which gives:

> DT2
   Group variable A C G T NA
1:    G1     Loc1 0 0 3 0  0
2:    G1     Loc2 1 1 0 0  1
3:    G1     Loc3 0 1 2 0  0
4:    G2     Loc1 0 0 1 2  1
5:    G2     Loc2 1 2 0 0  1
6:    G2     Loc3 0 0 2 0  2
7:    G3     Loc1 0 0 0 2  1
8:    G3     Loc2 2 1 0 0  0
9:    G3     Loc3 0 0 3 0  0

Because the default aggregation function in dcast is length, you will automatically get the counts for each of the values.

Used data:

DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA), 
                     Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"), 
                     Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"), 
                     Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")), 
                .Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))

calculating the proportion of count variable per group in data.table in R

If you are looking for the ratio, you can do :

library(data.table)
mydata[, prop := count/sum(count) * 100, by = .(startYear, groupSize)]

#       groupSize gender startYear count       prop
# 1: intermediate      F      2014  7546 55.9958445
# 2:        small      F      2014  3500 31.3395415
# 3: intermediate      M      2014  5930 44.0041555
# 4:        small      M      2014  7668 68.6604585
# 5:         huge      F      2014 18114 56.7125861
# 6:         huge      M      2014 13826 43.2874139
# 7:        large      F      2014 11943 54.2222828
# 8:        large      M      2014 10083 45.7777172
#....

How to apply function over subset of columns in data.table while grouping by some other column?

We could specify the .SDcols and assign the output back to the same columns.

foo[, names(foo)[3:5]  := lapply(.SD, function(x) x/sum(x)),
                 by = group, .SDcols = x:z]

Note that the type should be the same for the output and the input. If the input is integer and output is numeric it would have problem. So, change the class to numeric first and then do the assignment

nm1 <- names(foo)[3:5]
#or programmatically based on checking whether column is numeric
#nm1 <- foo[, which(unlist(lapply(.SD, is.numeric)))]
foo[, (nm1) := lapply(.SD, as.numeric), .SDcols = nm1
      ][, (nm1) := lapply(.SD, function(x) x/sum(x)), 
                by = group, .SDcols = nm1][]

A tidyverse approach to the above would be

library(dplyr)
foo %>% 
     group_by(group) %>%
     mutate_if(is.numeric, funs(./sum(.)))

Data.table: Applying function to rows of grouped columns (could also call it collapsing columns)

If they are in the same order, use split.default

setDT(lapply(split.default(data, group$Group), rowSums))[]

-output

If the column names are not in the same order, then use a matching with named vector

nm1 <- setNames(group$Group, group$Sample)[colnames(data)]
setDT(lapply(split.default(data, nm1), rowSums))[]

Or may also do the split from the 'group' data and loop over the list, extract the columns, and do rowSums

setDT(lapply(split(group$Sample, group$Group),
       function(x) rowSums(data[, ..x])))[]

Benchmarks

set.seed(24)
data_test <- as.data.table(matrix(rnorm(5000 * 5000), ncol = 5000, dimnames = list(NULL, paste0("C", 1:5000))))

group_test <- data.table(Sample= paste0("C", 1:5000),Group = rep(LETTERS[1:10], 500) )

system.time({
nm1 <- setNames(group_test$Group, group_test$Sample)[colnames(data_test)]
setDT(lapply(split.default(data_test, nm1), rowSums))[]

})
#   user  system elapsed 
#  0.167   0.048   0.219 

system.time({
long <- melt(data_test[, rn := .I], "rn")
dcast(long[group_test, on = "variable==Sample"], rn ~ Group, sum)

})
#   user  system elapsed 
#  2.897   0.305   3.189

Calculate the difference between consecutive, grouped columns in a data.table

I think your loop is fine, except you should use := instead of cbind to add columns:

ref <- lapply(reference,function(x) names(DT)[x])

for (g in ref){
    if (length(g)==1) next
    gx   = tail(g,-1)
    gy   = head(g,-1)
    gn   = paste0("D[",gy,":",gx,"]")
    DT[,(gn) := mapply(function(x,y).SD[[x]]-.SD[[y]], gx, gy, SIMPLIFY=FALSE)]
}

How to find new individuals in repeated surveys with data.table

This is the first survey each individual appeared in each group,

dat[order(ind, surv), .SD[1], .(group, ind)][surv != 1]
#>    group ind surv
#> 1:     1   c    2
#> 2:     2   d    2
#> 3:     1   d    3

Note I've excluded the individuals who first appeared in the first survey.

Merging two dataset - include unique rows

Are you looking for something like this?
The script counts number of each "loci" and divides it by the number of all loci per location.

xy <- read.table(text = "Island,Individual,all1,all2
                 Santiago,CVW3,01,01
                 Santiago,CVW8,01,02
                 Santiago,CVW9,03,03
                 Santiago,CVW10,01,01
                 Santiago,CVW12,03,03
                 Santiago,CVW19,01,01
                 Santiago,CVW25,01,04
                 Santiago,CVW39,04,04
                 Santiago,CVW40,01,01
                 Santiago,CVW41,01,01
                 Fogo,CVW64,04,04
                 Fogo,CVW67,01,01
                 Fogo,CVW70,02,04
                 Fogo,CVW74,03,05
                 Fogo,CVW80,01,02
                 Fogo,CVW99,01,04
                 Fogo,CVW101,02,02
                 Fogo,CVW103,01,02
                 Fogo,CVW104,01,04
                 Fogo,CVW111,02,04
                 SaoNicolau,CVW81,01,01
                 SaoNicolau,CVW87,01,01
                 SaoNicolau,CVW92,01,01
                 SaoNicolau,CVW124,01,01
                 SaoNicolau,CVW125,01,01
                 SaoNicolau,CVW133,01,01
                 SaoNicolau,CVW136,01,01", header = TRUE, sep = ",", 
                 colClasses = c("factor", "factor", "character", "character"),
                 stringsAsFactors = FALSE)

by(xy, xy$Island, FUN = function(x) {
  table(c(x$all1, x$all2))/length(c(x$all1, x$all2))
})

xy$Island:                  Fogo

  01   02   03   04   05 
0.30 0.30 0.05 0.30 0.05 
------------------------------------------------------------------------------------------------------ 
xy$Island:                  Santiago

  01   02   03   04 
0.60 0.05 0.20 0.15 
------------------------------------------------------------------------------------------------------ 
xy$Island:                  SaoNicolau

01 
 1

How to Use Data.Table to Efficiently Calculate Allele Frequencies (Proportions) by Group Across Multiple Columns (Loci)