How to use data.table to efficiently calculate allele frequencies (proportions) by group across multiple columns (loci)
It's probably wise to transform your data.table into long format first. This will make it easier to use for further calculations (or making visualisations with ggplot2
for example). With the melt
function of data.table
(which works the same as the melt
function of the reshape2
package) you can transform from wide to long format:
DT2 <- melt(DT, id = "Group", variable.name = "loci")
When you want to remove the NA
-values during the melt-operation, you can add na.rm = TRUE
in the above call (na.rm = FALSE
is the default behaviour).
Then you can make count and proportion variables as follows:
DT2 <- DT2[, .N, by = .(Group, loci, value)][, prop := N/sum(N), by = .(Group, loci)]
which gives the following result:
> DT2
Group loci value N prop
1: G1 Loc1 G 3 1.0000000
2: G2 Loc1 NA 1 0.2500000
3: G2 Loc1 G 1 0.2500000
4: G2 Loc1 T 2 0.5000000
5: G3 Loc1 T 2 0.6666667
6: G3 Loc1 NA 1 0.3333333
7: G1 Loc2 NA 1 0.3333333
8: G1 Loc2 A 1 0.3333333
9: G1 Loc2 C 1 0.3333333
10: G2 Loc2 NA 1 0.2500000
11: G2 Loc2 C 2 0.5000000
12: G2 Loc2 A 1 0.2500000
13: G3 Loc2 A 2 0.6666667
14: G3 Loc2 C 1 0.3333333
15: G1 Loc3 C 1 0.3333333
16: G1 Loc3 G 2 0.6666667
17: G2 Loc3 NA 2 0.5000000
18: G2 Loc3 G 2 0.5000000
19: G3 Loc3 G 3 1.0000000
I you want it back in wide format, you can use dcast
on multiple variables:
DT3 <- dcast(DT2, Group + loci ~ value, value.var = c("N", "prop"), fill = 0)
which results in:
> DT3
Group loci N_A N_C N_G N_T N_NA prop_A prop_C prop_G prop_T prop_NA
1: G1 Loc1 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
2: G1 Loc2 1 1 0 0 1 0.3333333 0.3333333 0.0000000 0.0000000 0.3333333
3: G1 Loc3 0 1 2 0 0 0.0000000 0.3333333 0.6666667 0.0000000 0.0000000
4: G2 Loc1 0 0 1 2 1 0.0000000 0.0000000 0.2500000 0.5000000 0.2500000
5: G2 Loc2 1 2 0 0 1 0.2500000 0.5000000 0.0000000 0.0000000 0.2500000
6: G2 Loc3 0 0 2 0 2 0.0000000 0.0000000 0.5000000 0.0000000 0.5000000
7: G3 Loc1 0 0 0 2 1 0.0000000 0.0000000 0.0000000 0.6666667 0.3333333
8: G3 Loc2 2 1 0 0 0 0.6666667 0.3333333 0.0000000 0.0000000 0.0000000
9: G3 Loc3 0 0 3 0 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
Another and straightforward approach is using melt
and dcast
in one call (which is a simplified version of the first part of @Frank's answer):
DT2 <- dcast(melt(DT, id="Group"), Group + variable ~ value)
which gives:
> DT2
Group variable A C G T NA
1: G1 Loc1 0 0 3 0 0
2: G1 Loc2 1 1 0 0 1
3: G1 Loc3 0 1 2 0 0
4: G2 Loc1 0 0 1 2 1
5: G2 Loc2 1 2 0 0 1
6: G2 Loc3 0 0 2 0 2
7: G3 Loc1 0 0 0 2 1
8: G3 Loc2 2 1 0 0 0
9: G3 Loc3 0 0 3 0 0
Because the default aggregation function in dcast
is length
, you will automatically get the counts for each of the values.
Used data:
DT <- structure(list(Loc1 = c("G", "G", "G", NA, "G", "T", "T", "T", "T", NA),
Loc2 = c(NA, "A", "C", NA, "C", "A", "C", "A", "C", "A"),
Loc3 = c("C", "G", "G", NA, NA, "G", "G", "G", "G", "G"),
Group = c("G1", "G1", "G1", "G2", "G2", "G2", "G2", "G3", "G3", "G3")),
.Names = c("Loc1", "Loc2", "Loc3", "Group"), row.names = c(NA, -10L), class = c("data.table", "data.frame"))
calculating the proportion of count variable per group in data.table in R
If you are looking for the ratio, you can do :
library(data.table)
mydata[, prop := count/sum(count) * 100, by = .(startYear, groupSize)]
# groupSize gender startYear count prop
# 1: intermediate F 2014 7546 55.9958445
# 2: small F 2014 3500 31.3395415
# 3: intermediate M 2014 5930 44.0041555
# 4: small M 2014 7668 68.6604585
# 5: huge F 2014 18114 56.7125861
# 6: huge M 2014 13826 43.2874139
# 7: large F 2014 11943 54.2222828
# 8: large M 2014 10083 45.7777172
#....
How to apply function over subset of columns in data.table while grouping by some other column?
We could specify the .SDcols
and assign the output back to the same columns.
foo[, names(foo)[3:5] := lapply(.SD, function(x) x/sum(x)),
by = group, .SDcols = x:z]
Note that the type should be the same for the output and the input. If the input is integer
and output is numeric
it would have problem. So, change the class
to numeric
first and then do the assignment
nm1 <- names(foo)[3:5]
#or programmatically based on checking whether column is numeric
#nm1 <- foo[, which(unlist(lapply(.SD, is.numeric)))]
foo[, (nm1) := lapply(.SD, as.numeric), .SDcols = nm1
][, (nm1) := lapply(.SD, function(x) x/sum(x)),
by = group, .SDcols = nm1][]
A tidyverse
approach to the above would be
library(dplyr)
foo %>%
group_by(group) %>%
mutate_if(is.numeric, funs(./sum(.)))
Data.table: Applying function to rows of grouped columns (could also call it collapsing columns)
If they are in the same order, use split.default
setDT(lapply(split.default(data, group$Group), rowSums))[]
-output
X Y
1: 2 4
2: 2 4
3: 2 4
4: 2 4
5: 2 4
If the column names are not in the same order, then use a matching with named vector
nm1 <- setNames(group$Group, group$Sample)[colnames(data)]
setDT(lapply(split.default(data, nm1), rowSums))[]
Or may also do the split
from the 'group' data and loop over the list
, extract the columns, and do rowSums
setDT(lapply(split(group$Sample, group$Group),
function(x) rowSums(data[, ..x])))[]
Benchmarks
set.seed(24)
data_test <- as.data.table(matrix(rnorm(5000 * 5000), ncol = 5000, dimnames = list(NULL, paste0("C", 1:5000))))
group_test <- data.table(Sample= paste0("C", 1:5000),Group = rep(LETTERS[1:10], 500) )
system.time({
nm1 <- setNames(group_test$Group, group_test$Sample)[colnames(data_test)]
setDT(lapply(split.default(data_test, nm1), rowSums))[]
})
# user system elapsed
# 0.167 0.048 0.219
system.time({
long <- melt(data_test[, rn := .I], "rn")
dcast(long[group_test, on = "variable==Sample"], rn ~ Group, sum)
})
# user system elapsed
# 2.897 0.305 3.189
Calculate the difference between consecutive, grouped columns in a data.table
I think your loop is fine, except you should use :=
instead of cbind
to add columns:
ref <- lapply(reference,function(x) names(DT)[x])
for (g in ref){
if (length(g)==1) next
gx = tail(g,-1)
gy = head(g,-1)
gn = paste0("D[",gy,":",gx,"]")
DT[,(gn) := mapply(function(x,y).SD[[x]]-.SD[[y]], gx, gy, SIMPLIFY=FALSE)]
}
How to find new individuals in repeated surveys with data.table
This is the first survey each individual appeared in each group,
dat[order(ind, surv), .SD[1], .(group, ind)][surv != 1]
#> group ind surv
#> 1: 1 c 2
#> 2: 2 d 2
#> 3: 1 d 3
Note I've excluded the individuals who first appeared in the first survey.
Merging two dataset - include unique rows
Are you looking for something like this?
The script counts number of each "loci" and divides it by the number of all loci per location.
xy <- read.table(text = "Island,Individual,all1,all2
Santiago,CVW3,01,01
Santiago,CVW8,01,02
Santiago,CVW9,03,03
Santiago,CVW10,01,01
Santiago,CVW12,03,03
Santiago,CVW19,01,01
Santiago,CVW25,01,04
Santiago,CVW39,04,04
Santiago,CVW40,01,01
Santiago,CVW41,01,01
Fogo,CVW64,04,04
Fogo,CVW67,01,01
Fogo,CVW70,02,04
Fogo,CVW74,03,05
Fogo,CVW80,01,02
Fogo,CVW99,01,04
Fogo,CVW101,02,02
Fogo,CVW103,01,02
Fogo,CVW104,01,04
Fogo,CVW111,02,04
SaoNicolau,CVW81,01,01
SaoNicolau,CVW87,01,01
SaoNicolau,CVW92,01,01
SaoNicolau,CVW124,01,01
SaoNicolau,CVW125,01,01
SaoNicolau,CVW133,01,01
SaoNicolau,CVW136,01,01", header = TRUE, sep = ",",
colClasses = c("factor", "factor", "character", "character"),
stringsAsFactors = FALSE)
by(xy, xy$Island, FUN = function(x) {
table(c(x$all1, x$all2))/length(c(x$all1, x$all2))
})
xy$Island: Fogo
01 02 03 04 05
0.30 0.30 0.05 0.30 0.05
------------------------------------------------------------------------------------------------------
xy$Island: Santiago
01 02 03 04
0.60 0.05 0.20 0.15
------------------------------------------------------------------------------------------------------
xy$Island: SaoNicolau
01
1
Related Topics
How to Get All Possible Combinations of N Number of Data Set
Extract First N Digits from a String
Plot Weighted Frequency Matrix
Select List Element Programmatically Using Name Stored as String
How to Remove Trailing Zeros in R Dataframe
R Plotly: Preserving Appearance of Two Legends When Converting Ggplot2 with Ggplotly
Conda Build R Package Fails at C Compiler Issue on Macos Mojave
Calculate Peak Values in a Plot Using R
R Markdown Add Tag to Head of HTML Output
Embed Instagram/Youtube into Shiny R App
Coloring a Geom_Histogram by Gradient
Install Previous Versions of R on Ubuntu
How to Simulate Bimodal Distribution
Change Position of Tick Marks of a Single Graph, Using Ggplot2
Change The Color of a Ggplot Geom a Posteriori (After Having Specified Another Color)
Passing a List of Arguments to a Function with Quasiquotation