Summarizing multiple columns with data.table
You can use a simple lapply
statement with .SD
dt[, lapply(.SD, sum, na.rm=TRUE), by=category ]
category index a b z c d
1: c 19 51.13289 48.49994 42.50884 9.535588 11.53253
2: b 9 17.34860 20.35022 10.32514 11.764105 10.53127
3: a 27 25.91616 31.12624 0.00000 29.197343 31.71285
If you only want to summarize over certain columns, you can add the .SDcols
argument
# note that .SDcols also allows reordering of the columns
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=c("a", "c", "z") ]
category a c z
1: c 51.13289 9.535588 42.50884
2: b 17.34860 11.764105 10.32514
3: a 25.91616 29.197343 0.00000
This of course, is not limited to sum
and you can use any function with lapply
, including anonymous functions. (ie, it's a regular lapply
statement).
Lastly, there is no need to use i=T
and j= <..>
. Personally, I think that makes the code less readable, but it is just a style preference.
Documentation
See ?.SD
, ?data.table
and its .SDcols
argument, and the vignette Using .SD for Data Analysis.
Also have a look at data.table
FAQ 2.1.
R data.table to aggregate by multiple columns and retaining all columns
the easiest way is copying the data.table
as you already want to return all the column in a new data.table
, and then append the columns x_agg, y_agg
library(data.table)
dt <- data.frame(x=rnorm(40), y=rnorm(20), z= rnorm(10), year=rep(2019:2020,times=2, each=10), month=rep(1:4, 10), day=rep(1:4,10))
setDT(dt)
dt2<- copy(dt)
names <- c("x","y")
dt2[, paste0(names, "_agg"):= lapply(.SD, sum),
.SDcols=names, by = .(year, month, day)][]
x y z year month day x_agg y_agg
1: 0.52378890 0.19143318 -0.3387854 2019 1 1 -0.1709390 -2.967623395
2: -0.35158261 1.62461341 -0.9818403 2019 2 2 -3.6556367 5.940791892
3: 1.29391093 -0.73192766 -2.5227705 2019 3 3 2.1449165 -0.009080778
4: 1.15131966 -0.96903745 -0.5124389 2019 4 4 2.7530336 -1.763717065
5: -0.97305571 -1.16620834 0.8567205 2019 1 1 -0.1709390 -2.967623395
6: -1.73289458 1.74064829 -0.7019242 2019 2 2 -3.6556367 5.940791892
7: 0.14822163 0.72738728 -1.4267469 2019 3 3 2.1449165 -0.009080778
8: -0.17853639 0.08717892 2.0463365 2019 4 4 2.7530336 -1.763717065
9: 0.43857404 -0.50903654 -0.6887948 2019 1 1 -0.1709390 -2.967623395
10: 0.56904083 -0.39486575 -0.1134194 2019 2 2 -3.6556367 5.940791892
11: 0.54823107 -0.28118769 -0.3387854 2020 3 3 1.3975639 -5.470426871
12: 1.12885306 -0.80344406 -0.9818403 2020 4 4 2.5982909 -3.062138945
13: 0.98747699 0.72247033 -2.5227705 2020 1 1 2.4807741 0.134137894
14: -2.60859806 -1.37195721 -0.5124389 2020 2 2 -0.8401949 -2.285724235
15: -0.44170249 -1.47594529 0.8567205 2020 3 3 1.3975639 -5.470426871
16: 0.02994275 0.01272509 -0.7019242 2020 4 4 2.5982909 -3.062138945
17: -0.11760158 -0.65540139 -1.4267469 2020 1 1 2.4807741 0.134137894
18: 0.87222687 0.22909510 2.0463365 2020 2 2 -0.8401949 -2.285724235
19: 0.33379209 -0.97808045 -0.6887948 2020 3 3 1.3975639 -5.470426871
20: -0.70379104 -0.74035050 -0.1134194 2020 4 4 2.5982909 -3.062138945
21: 0.22151323 0.19143318 -0.3387854 2019 1 1 -0.1709390 -2.967623395
22: -0.91018028 1.62461341 -0.9818403 2019 2 2 -3.6556367 5.940791892
23: -0.05931458 -0.73192766 -2.5227705 2019 3 3 2.1449165 -0.009080778
24: 0.51606540 -0.96903745 -0.5124389 2019 4 4 2.7530336 -1.763717065
25: -0.81728153 -1.16620834 0.8567205 2019 1 1 -0.1709390 -2.967623395
26: -1.43174995 1.74064829 -0.7019242 2019 2 2 -3.6556367 5.940791892
27: 0.76209854 0.72738728 -1.4267469 2019 3 3 2.1449165 -0.009080778
28: 1.26418496 0.08717892 2.0463365 2019 4 4 2.7530336 -1.763717065
29: 0.43552206 -0.50903654 -0.6887948 2019 1 1 -0.1709390 -2.967623395
30: 0.20172988 -0.39486575 -0.1134194 2019 2 2 -3.6556367 5.940791892
31: 0.21270847 -0.28118769 -0.3387854 2020 3 3 1.3975639 -5.470426871
32: 1.21382327 -0.80344406 -0.9818403 2020 4 4 2.5982909 -3.062138945
33: 0.41322214 0.72247033 -2.5227705 2020 1 1 2.4807741 0.134137894
34: 0.09986465 -1.37195721 -0.5124389 2020 2 2 -0.8401949 -2.285724235
35: -0.09185291 -1.47594529 0.8567205 2020 3 3 1.3975639 -5.470426871
36: 0.13209497 0.01272509 -0.7019242 2020 4 4 2.5982909 -3.062138945
37: 1.19767652 -0.65540139 -1.4267469 2020 1 1 2.4807741 0.134137894
38: 0.79631162 0.22909510 2.0463365 2020 2 2 -0.8401949 -2.285724235
39: 0.83638763 -0.97808045 -0.6887948 2020 3 3 1.3975639 -5.470426871
40: 0.79736792 -0.74035050 -0.1134194 2020 4 4 2.5982909 -3.062138945
x y z year month day x_agg y_agg
aggregating multiple columns in data.table
this is actually what i was looking for and is mentioned in the FAQ:
dtb[,lapply(.SD,mean),by="id"]
aggregate multiple columns in a data frame at once calculating different statistics on different columns - R
We could use dplyr
for flexibility
library(dplyr)
df1 %>%
group_by(name) %>%
summarise(v1 = mean(v1, na.rm = TRUE),
v2 = sd(v2, na.rm = TRUE), v3 = max(v3, na.rm = TRUE),
v4 = sum(v4, na.rm = TRUE))
If there are multiple columns to be blocked for different functions, use across
df1 %>%
group_by(name) %>%
summarise(across(c(v1, v2), mean, na.rm = TRUE),
v3 = sd(v3, na.rm = TRUE),
across(c(v4, v5), sum, na.rm = TRUE))
Or use collap
from collapse
library(collapse)
collap(df1, ~ name, custom = list(fmean = c("v1", "v2"),
fsd = "v3", fsum = c("v4", "v5")))
Aggregate multiple columns at once
We can use the formula method of aggregate
. The variables on the 'rhs' of ~
are the grouping variables while the .
represents all other variables in the 'df1' (from the example, we assume that we need the mean
for all the columns except the grouping), specify the dataset and the function (mean
).
aggregate(.~id1+id2, df1, mean)
Or we can use summarise_each
from dplyr
after grouping (group_by
)
library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))
Or using summarise
with across
(dplyr
devel version - ‘0.8.99.9000’
)
df1 %>%
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))
Or another option is data.table
. We convert the 'data.frame' to 'data.table' (setDT(df1)
, grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD
) and get the mean
.
library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]
data
df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b",
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
Aggregate Multiple Columns with Different Formula in large data.tables using Data.table r
group1 = c("cty", "hwy")
group2 = c("manufacturer", "model")
dat1[, c(
lapply(.SD[, ..group1], mean),
lapply(.SD[, ..group2], first)
), by=.(cyl, trans)]
gives
cyl trans cty hwy manufacturer model
1: 4 auto(l5) 20.33333 31.00000 audi a4
2: 4 manual(m5) 21.54545 29.27273 audi a4
3: 4 manual(m6) 21.00000 29.57143 audi a4
4: 4 auto(av) 22.00000 30.50000 audi a4
5: 6 auto(l5) 15.18750 21.43750 audi a4
6: 6 manual(m5) 16.66667 22.66667 audi a4
7: 6 auto(av) 18.66667 26.00000 audi a4
8: 4 auto(s6) 20.50000 28.25000 audi a4 quattro
9: 6 auto(s6) 17.40000 26.00000 audi a4 quattro
10: 6 manual(m6) 16.00000 22.60000 audi a4 quattro
11: 8 auto(s6) 13.60000 20.40000 audi a6 quattro
12: 8 auto(l4) 12.20000 16.73333 chevrolet c1500 suburban 2wd
13: 8 manual(m6) 13.42857 20.00000 chevrolet corvette
14: 4 auto(l4) 20.50000 27.62500 chevrolet malibu
15: 6 auto(l4) 16.03448 22.68966 chevrolet malibu
16: 4 auto(l3) 21.00000 27.00000 dodge caravan 2wd
17: 6 auto(l6) 16.00000 23.00000 dodge caravan 2wd
18: 8 auto(l5) 12.29412 16.41176 dodge dakota pickup 4wd
19: 8 manual(m5) 13.00000 18.80000 dodge dakota pickup 4wd
20: 8 auto(l6) 12.50000 18.50000 ford expedition 2wd
21: 8 auto(s5) 12.00000 18.00000 nissan pathfinder 4wd
22: 8 auto(s4) 16.00000 25.00000 pontiac grand prix
23: 4 auto(s4) 20.00000 26.00000 subaru impreza awd
24: 4 auto(s5) 22.00000 31.00000 toyota camry solara
25: 6 auto(s5) 18.00000 27.00000 toyota camry solara
26: 5 auto(s6) 20.50000 29.00000 volkswagen jetta
27: 5 manual(m5) 20.50000 28.50000 volkswagen jetta
cyl trans cty hwy manufacturer model
Flexible mixing of multiple aggregations in data.table for different column combinations
The first thing to notice is that data.table's j
argument expects a list output, which can be built with c
, as mentioned in @akrun's answer. Here are two ways to do it:
set.seed(1)
DT <- data.table(C1=c("a","b","b"), C2=round(rnorm(4),4), C3=1:12, C4=9:12)
sum_cols <- c("C2","C3")
mean_cols <- c("C3","C4")
# with the development version, 1.10.1+
DT[, c(
.N,
sum = lapply(.SD[, ..sum_cols], sum),
mean = lapply(.SD[, ..mean_cols], mean)
), by=C1]
# in earlier versions
DT[, c(
.N,
sum = lapply(.SD[, sum_cols, with=FALSE], sum),
mean = lapply(.SD[, mean_cols, with=FALSE], mean)
), by=C1]
mget
returns a list and c
connects elements together to make a list.
Comments
If you turn on the verbose
data.table option for these calls, you'll see a message:
The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
Also, you'll see that the optimized group mean and sum are not being used (see ?GForce
for details). We can get around this by following FAQ 1.6 perhaps, but I couldn't figure out how.
data.table summary by group for multiple columns
dt <- dt[, .(y = mean(y), z = mean(z)), by=.(a)]
Aggregate over combinations of columns with data.table
One option is rollup
from data.table
library(data.table)
setDT(data)
rollup(data, j = sum(Expenditure), by = c("Country","Gender","Age", "Civil_Status"))
Related Topics
Plotting Average of Multiple Variables in Time-Series Using Ggplot
Trycatch with Parlapply (Parallel Package) in R
How to Insert Pictures into Each Individual Bar in a Ggplot Graph
Efficient Apply or Mapply for Multiple Matrix Arguments by Row
Can Ggplot Make 2D Summaries of Data
Exporting R Regression Summary for Publishable Paper
Drawing Simple Mediation Diagram in R
How to Use Superscript with Ggplot2
How to Preprocess Features When Some of Them Are Factors
Run Asynchronous Function in R
Combine Voronoi Polygons and Maps
R: Adding Alpha Bags to a 2D or 3D Scatterplot
How to Use Aws Cli to Only Copy Files in S3 Bucket That Match a Given String Pattern
How to Retrieve the Most Repeated Value in a Column Present in a Data Frame
R: Bar Plot with Two Groups, of Which One Is Stacked
Changing Word Template for Knitr in Rmarkdown