Calculate Multiple Aggregations on Several Variables Using Lapply(.Sd, ...)

Calculate multiple aggregations on several variables using lapply(.SD, ...)

You're missing a [[1]] or $mpg:

mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x)))[[1]],
by="cyl", .SDcols=c("mpg")]
#or
mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x)))$mpg,
by="cyl", .SDcols=c("mpg")]
# cyl V1 V2
#1: 6 19.74286 19.7
#2: 4 26.66364 26.0
#3: 8 15.10000 15.2

For the more general case, try:

mtcars.dt[, as.list(unlist(lapply(.SD, function(x) list(mean=mean(x),
median=median(x))))),
by="cyl", .SDcols=c("mpg", "hp")]
# cyl mpg.mean mpg.median hp.mean hp.median
# 1: 6 19.74 19.7 122.29 110.0
# 2: 4 26.66 26.0 82.64 91.0
# 3: 8 15.10 15.2 209.21 192.5

(or as.list(sapply(.SD, ...)))

How to apply multiple functions to multiple columns within by?

There is an option in unlist to avoid unlisting recursively - the recursive parameter (By default, the recursive = TRUE)

DT[,unlist(lapply(.SD,my.sum.fun), 
recursive = FALSE),.SDcols=c("mpg","hp"),by=list(cyl)]
# cyl mpg.mean mpg.median mpg.sd hp.mean hp.median hp.sd
#1: 6 19.74286 19.7 1.453567 122.28571 110.0 24.26049
#2: 4 26.66364 26.0 4.509828 82.63636 91.0 20.93453
#3: 8 15.10000 15.2 2.560048 209.21429 192.5 50.97689

Apply multiple functions to multiple columns in data.table by group

First you need to change your function. data.table expects consistent types and median can return integer or double values depending on input.

my.summary <- function(x) list(mean = mean(x), median = as.numeric(median(x)))

Then you need to ensure that only the first level of the nested list is unlisted. The result of the unlist call still needs to be a list (remember, a data.table is a list of column vectors).

DT[, unlist(lapply(.SD, my.summary), recursive = FALSE), by = c, .SDcols = c("a", "b")]
# c a.mean a.median b.mean b.median
#1: 1 1.5 1.5 2.5 2.5
#2: 2 4.0 4.0 5.0 5.0

Apply multiple functions to multiple columns in data.table

I'd normally do this:

my.summary = function(x) list(mean = mean(x), median = median(x))

DT[, unlist(lapply(.SD, my.summary)), .SDcols = c('a', 'b')]
#a.mean a.median b.mean b.median
# 3 3 4 4

Apply several summary functions on several variables by group in one call

You can do it all in one step and get proper labeling:

> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
# id1 id2 val1.mn val1.n val2.mn val2.n
# 1 a x 1.5 2.0 6.5 2.0
# 2 b x 2.0 2.0 8.0 2.0
# 3 a y 3.5 2.0 7.0 2.0
# 4 b y 3.0 2.0 6.0 2.0

This creates a dataframe with two id columns and two matrix columns:

str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
'data.frame': 4 obs. of 4 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
$ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"

As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)

str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) ) 
)
'data.frame': 4 obs. of 6 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1.mn: num 1.5 2 3.5 3
$ val1.n : num 2 2 2 2
$ val2.mn: num 6.5 8 7 6
$ val2.n : num 2 2 2 2

This is the syntax for multiple variables on the LHS:

aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )

How to avoid same column names when multiple transformations in data.table?

You could try

library(data.table)
dt[, unlist(lapply(.SD, function(x) list(Mean=mean(x),
SD=sd(x))),recursive=FALSE), by=ID]
# ID Obs_1.Mean Obs_1.SD Obs_2.Mean Obs_2.SD Obs_3.Mean Obs_3.SD
#1: 1 0.4854187 1.1108687 -0.3238542 0.2885969 0.7410611 0.1067961
#2: 2 0.4171586 0.2875411 -0.2397030 1.8732682 0.2041125 0.3438338
#3: 3 -0.3601052 0.8105370 0.8195368 0.3829833 -0.4087233 1.4705692

Or a variation as suggested by @David Arenburg

 dt[, as.list(unlist(lapply(.SD, function(x) list(Mean=mean(x),
SD=sd(x))))), by=ID]
# ID Obs_1.Mean Obs_1.SD Obs_2.Mean Obs_2.SD Obs_3.Mean Obs_3.SD
#1: 1 0.4854187 1.1108687 -0.3238542 0.2885969 0.7410611 0.1067961
#2: 2 0.4171586 0.2875411 -0.2397030 1.8732682 0.2041125 0.3438338
#3: 3 -0.3601052 0.8105370 0.8195368 0.3829833 -0.4087233 1.4705692

Aggregate multiple columns at once

We can use the formula method of aggregate. The variables on the 'rhs' of ~ are the grouping variables while the . represents all other variables in the 'df1' (from the example, we assume that we need the mean for all the columns except the grouping), specify the dataset and the function (mean).

aggregate(.~id1+id2, df1, mean)

Or we can use summarise_each from dplyr after grouping (group_by)

library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))

Or using summarise with across (dplyr devel version - ‘0.8.99.9000’)

df1 %>% 
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))

Or another option is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD) and get the mean.

library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]

data

df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b", 
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

data.table: Group by, then aggregate with custom function returning several new columns

You are already returning a list from the function. You do not need to list them again. So remove the list and have the code like below

mtcars_dt[,
returnsMultipleColumns(dt_group_all_columns = .SD),
by = c("mpg", "cyl"),
.SDcols = colnames(mtcars_dt)
]
mpg cyl new_column_1 new_column_2
1: 21.0 6 returned_value_1 returned_value_2
2: 22.8 4 returned_value_1 returned_value_2
3: 21.4 6 returned_value_1 returned_value_2
4: 18.7 8 returned_value_1 returned_value_2


Related Topics



Leave a reply



Submit