Calculate multiple aggregations on several variables using lapply(.SD, ...)
You're missing a [[1]]
or $mpg
:
mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x)))[[1]],
by="cyl", .SDcols=c("mpg")]
#or
mtcars.dt[, lapply(.SD, function(x) list(mean(x), median(x)))$mpg,
by="cyl", .SDcols=c("mpg")]
# cyl V1 V2
#1: 6 19.74286 19.7
#2: 4 26.66364 26.0
#3: 8 15.10000 15.2
For the more general case, try:
mtcars.dt[, as.list(unlist(lapply(.SD, function(x) list(mean=mean(x),
median=median(x))))),
by="cyl", .SDcols=c("mpg", "hp")]
# cyl mpg.mean mpg.median hp.mean hp.median
# 1: 6 19.74 19.7 122.29 110.0
# 2: 4 26.66 26.0 82.64 91.0
# 3: 8 15.10 15.2 209.21 192.5
(or as.list(sapply(.SD, ...))
)
How to apply multiple functions to multiple columns within by?
There is an option in unlist
to avoid unlisting recursively - the recursive
parameter (By default, the recursive = TRUE
)
DT[,unlist(lapply(.SD,my.sum.fun),
recursive = FALSE),.SDcols=c("mpg","hp"),by=list(cyl)]
# cyl mpg.mean mpg.median mpg.sd hp.mean hp.median hp.sd
#1: 6 19.74286 19.7 1.453567 122.28571 110.0 24.26049
#2: 4 26.66364 26.0 4.509828 82.63636 91.0 20.93453
#3: 8 15.10000 15.2 2.560048 209.21429 192.5 50.97689
Apply multiple functions to multiple columns in data.table by group
First you need to change your function. data.table expects consistent types and median
can return integer or double values depending on input.
my.summary <- function(x) list(mean = mean(x), median = as.numeric(median(x)))
Then you need to ensure that only the first level of the nested list is unlisted. The result of the unlist
call still needs to be a list (remember, a data.table is a list of column vectors).
DT[, unlist(lapply(.SD, my.summary), recursive = FALSE), by = c, .SDcols = c("a", "b")]
# c a.mean a.median b.mean b.median
#1: 1 1.5 1.5 2.5 2.5
#2: 2 4.0 4.0 5.0 5.0
Apply multiple functions to multiple columns in data.table
I'd normally do this:
my.summary = function(x) list(mean = mean(x), median = median(x))
DT[, unlist(lapply(.SD, my.summary)), .SDcols = c('a', 'b')]
#a.mean a.median b.mean b.median
# 3 3 4 4
Apply several summary functions on several variables by group in one call
You can do it all in one step and get proper labeling:
> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
# id1 id2 val1.mn val1.n val2.mn val2.n
# 1 a x 1.5 2.0 6.5 2.0
# 2 b x 2.0 2.0 8.0 2.0
# 3 a y 3.5 2.0 7.0 2.0
# 4 b y 3.0 2.0 6.0 2.0
This creates a dataframe with two id columns and two matrix columns:
str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
'data.frame': 4 obs. of 4 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
$ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)
str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
)
'data.frame': 4 obs. of 6 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1.mn: num 1.5 2 3.5 3
$ val1.n : num 2 2 2 2
$ val2.mn: num 6.5 8 7 6
$ val2.n : num 2 2 2 2
This is the syntax for multiple variables on the LHS:
aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
How to avoid same column names when multiple transformations in data.table?
You could try
library(data.table)
dt[, unlist(lapply(.SD, function(x) list(Mean=mean(x),
SD=sd(x))),recursive=FALSE), by=ID]
# ID Obs_1.Mean Obs_1.SD Obs_2.Mean Obs_2.SD Obs_3.Mean Obs_3.SD
#1: 1 0.4854187 1.1108687 -0.3238542 0.2885969 0.7410611 0.1067961
#2: 2 0.4171586 0.2875411 -0.2397030 1.8732682 0.2041125 0.3438338
#3: 3 -0.3601052 0.8105370 0.8195368 0.3829833 -0.4087233 1.4705692
Or a variation as suggested by @David Arenburg
dt[, as.list(unlist(lapply(.SD, function(x) list(Mean=mean(x),
SD=sd(x))))), by=ID]
# ID Obs_1.Mean Obs_1.SD Obs_2.Mean Obs_2.SD Obs_3.Mean Obs_3.SD
#1: 1 0.4854187 1.1108687 -0.3238542 0.2885969 0.7410611 0.1067961
#2: 2 0.4171586 0.2875411 -0.2397030 1.8732682 0.2041125 0.3438338
#3: 3 -0.3601052 0.8105370 0.8195368 0.3829833 -0.4087233 1.4705692
Aggregate multiple columns at once
We can use the formula method of aggregate
. The variables on the 'rhs' of ~
are the grouping variables while the .
represents all other variables in the 'df1' (from the example, we assume that we need the mean
for all the columns except the grouping), specify the dataset and the function (mean
).
aggregate(.~id1+id2, df1, mean)
Or we can use summarise_each
from dplyr
after grouping (group_by
)
library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))
Or using summarise
with across
(dplyr
devel version - ‘0.8.99.9000’
)
df1 %>%
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))
Or another option is data.table
. We convert the 'data.frame' to 'data.table' (setDT(df1)
, grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD
) and get the mean
.
library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]
data
df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b",
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
data.table: Group by, then aggregate with custom function returning several new columns
You are already returning a list from the function. You do not need to list them again. So remove the list
and have the code like below
mtcars_dt[,
returnsMultipleColumns(dt_group_all_columns = .SD),
by = c("mpg", "cyl"),
.SDcols = colnames(mtcars_dt)
]
mpg cyl new_column_1 new_column_2
1: 21.0 6 returned_value_1 returned_value_2
2: 22.8 4 returned_value_1 returned_value_2
3: 21.4 6 returned_value_1 returned_value_2
4: 18.7 8 returned_value_1 returned_value_2
Related Topics
Replace Values in a Vector Based on Another Vector
Issue When Importing Dataset: 'Error in Scan(...): Line 1 Did Not Have 145 Elements'
Is There a Way of Manipulating Ggplot Scale Breaks and Labels
Why Does As.Factor Return a Character When Used Inside Apply
Shiny Slider on Logarithmic Scale
Identify All Objects of Given Class for Further Processing
How to Sort a Data Frame by Date
Write List of Data.Frames to Separate CSV Files with Lapply
Convert Binary String to Binary or Decimal Value
R Function with No Return Value
How to Generate Distributions Given, Mean, Sd, Skew and Kurtosis in R
Common Legend for Multiple Plots in R
How Subset a Data Frame by a Factor and Repeat a Plot for Each Subset
Filter Function in Dplyr Errors: Object 'Name' Not Found
Formatting Reactive Data.Frames in Shiny