Aggregating Multiple Columns in Data.Table

Summarizing multiple columns with data.table

You can use a simple lapply statement with .SD

dt[, lapply(.SD, sum, na.rm=TRUE), by=category ]

category index a b z c d
1: c 19 51.13289 48.49994 42.50884 9.535588 11.53253
2: b 9 17.34860 20.35022 10.32514 11.764105 10.53127
3: a 27 25.91616 31.12624 0.00000 29.197343 31.71285

If you only want to summarize over certain columns, you can add the .SDcols argument

#  note that .SDcols also allows reordering of the columns
dt[, lapply(.SD, sum, na.rm=TRUE), by=category, .SDcols=c("a", "c", "z") ]

category a c z
1: c 51.13289 9.535588 42.50884
2: b 17.34860 11.764105 10.32514
3: a 25.91616 29.197343 0.00000

This of course, is not limited to sum and you can use any function with lapply, including anonymous functions. (ie, it's a regular lapply statement).

Lastly, there is no need to use i=T and j= <..>. Personally, I think that makes the code less readable, but it is just a style preference.



Documentation

See ?.SD, ?data.table and its .SDcols argument, and the vignette Using .SD for Data Analysis.

Also have a look at data.table FAQ 2.1.

R data.table to aggregate by multiple columns and retaining all columns

the easiest way is copying the data.table as you already want to return all the column in a new data.table, and then append the columns x_agg, y_agg

library(data.table)
dt <- data.frame(x=rnorm(40), y=rnorm(20), z= rnorm(10), year=rep(2019:2020,times=2, each=10), month=rep(1:4, 10), day=rep(1:4,10))

setDT(dt)

dt2<- copy(dt)
names <- c("x","y")

dt2[, paste0(names, "_agg"):= lapply(.SD, sum),
.SDcols=names, by = .(year, month, day)][]
            x           y          z year month day      x_agg        y_agg
1: 0.52378890 0.19143318 -0.3387854 2019 1 1 -0.1709390 -2.967623395
2: -0.35158261 1.62461341 -0.9818403 2019 2 2 -3.6556367 5.940791892
3: 1.29391093 -0.73192766 -2.5227705 2019 3 3 2.1449165 -0.009080778
4: 1.15131966 -0.96903745 -0.5124389 2019 4 4 2.7530336 -1.763717065
5: -0.97305571 -1.16620834 0.8567205 2019 1 1 -0.1709390 -2.967623395
6: -1.73289458 1.74064829 -0.7019242 2019 2 2 -3.6556367 5.940791892
7: 0.14822163 0.72738728 -1.4267469 2019 3 3 2.1449165 -0.009080778
8: -0.17853639 0.08717892 2.0463365 2019 4 4 2.7530336 -1.763717065
9: 0.43857404 -0.50903654 -0.6887948 2019 1 1 -0.1709390 -2.967623395
10: 0.56904083 -0.39486575 -0.1134194 2019 2 2 -3.6556367 5.940791892
11: 0.54823107 -0.28118769 -0.3387854 2020 3 3 1.3975639 -5.470426871
12: 1.12885306 -0.80344406 -0.9818403 2020 4 4 2.5982909 -3.062138945
13: 0.98747699 0.72247033 -2.5227705 2020 1 1 2.4807741 0.134137894
14: -2.60859806 -1.37195721 -0.5124389 2020 2 2 -0.8401949 -2.285724235
15: -0.44170249 -1.47594529 0.8567205 2020 3 3 1.3975639 -5.470426871
16: 0.02994275 0.01272509 -0.7019242 2020 4 4 2.5982909 -3.062138945
17: -0.11760158 -0.65540139 -1.4267469 2020 1 1 2.4807741 0.134137894
18: 0.87222687 0.22909510 2.0463365 2020 2 2 -0.8401949 -2.285724235
19: 0.33379209 -0.97808045 -0.6887948 2020 3 3 1.3975639 -5.470426871
20: -0.70379104 -0.74035050 -0.1134194 2020 4 4 2.5982909 -3.062138945
21: 0.22151323 0.19143318 -0.3387854 2019 1 1 -0.1709390 -2.967623395
22: -0.91018028 1.62461341 -0.9818403 2019 2 2 -3.6556367 5.940791892
23: -0.05931458 -0.73192766 -2.5227705 2019 3 3 2.1449165 -0.009080778
24: 0.51606540 -0.96903745 -0.5124389 2019 4 4 2.7530336 -1.763717065
25: -0.81728153 -1.16620834 0.8567205 2019 1 1 -0.1709390 -2.967623395
26: -1.43174995 1.74064829 -0.7019242 2019 2 2 -3.6556367 5.940791892
27: 0.76209854 0.72738728 -1.4267469 2019 3 3 2.1449165 -0.009080778
28: 1.26418496 0.08717892 2.0463365 2019 4 4 2.7530336 -1.763717065
29: 0.43552206 -0.50903654 -0.6887948 2019 1 1 -0.1709390 -2.967623395
30: 0.20172988 -0.39486575 -0.1134194 2019 2 2 -3.6556367 5.940791892
31: 0.21270847 -0.28118769 -0.3387854 2020 3 3 1.3975639 -5.470426871
32: 1.21382327 -0.80344406 -0.9818403 2020 4 4 2.5982909 -3.062138945
33: 0.41322214 0.72247033 -2.5227705 2020 1 1 2.4807741 0.134137894
34: 0.09986465 -1.37195721 -0.5124389 2020 2 2 -0.8401949 -2.285724235
35: -0.09185291 -1.47594529 0.8567205 2020 3 3 1.3975639 -5.470426871
36: 0.13209497 0.01272509 -0.7019242 2020 4 4 2.5982909 -3.062138945
37: 1.19767652 -0.65540139 -1.4267469 2020 1 1 2.4807741 0.134137894
38: 0.79631162 0.22909510 2.0463365 2020 2 2 -0.8401949 -2.285724235
39: 0.83638763 -0.97808045 -0.6887948 2020 3 3 1.3975639 -5.470426871
40: 0.79736792 -0.74035050 -0.1134194 2020 4 4 2.5982909 -3.062138945
x y z year month day x_agg y_agg

aggregating multiple columns in data.table

this is actually what i was looking for and is mentioned in the FAQ:

dtb[,lapply(.SD,mean),by="id"]

aggregate multiple columns in a data frame at once calculating different statistics on different columns - R

We could use dplyr for flexibility

library(dplyr)
df1 %>%
group_by(name) %>%
summarise(v1 = mean(v1, na.rm = TRUE),
v2 = sd(v2, na.rm = TRUE), v3 = max(v3, na.rm = TRUE),
v4 = sum(v4, na.rm = TRUE))

If there are multiple columns to be blocked for different functions, use across

df1 %>%
group_by(name) %>%
summarise(across(c(v1, v2), mean, na.rm = TRUE),
v3 = sd(v3, na.rm = TRUE),
across(c(v4, v5), sum, na.rm = TRUE))

Or use collap from collapse

library(collapse)
collap(df1, ~ name, custom = list(fmean = c("v1", "v2"),
fsd = "v3", fsum = c("v4", "v5")))

Aggregate multiple columns at once

We can use the formula method of aggregate. The variables on the 'rhs' of ~ are the grouping variables while the . represents all other variables in the 'df1' (from the example, we assume that we need the mean for all the columns except the grouping), specify the dataset and the function (mean).

aggregate(.~id1+id2, df1, mean)

Or we can use summarise_each from dplyr after grouping (group_by)

library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))

Or using summarise with across (dplyr devel version - ‘0.8.99.9000’)

df1 %>% 
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))

Or another option is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD) and get the mean.

library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]

data

df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b", 
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

Aggregate Multiple Columns with Different Formula in large data.tables using Data.table r

group1 = c("cty", "hwy")
group2 = c("manufacturer", "model")

dat1[, c(
lapply(.SD[, ..group1], mean),
lapply(.SD[, ..group2], first)
), by=.(cyl, trans)]

gives

    cyl      trans      cty      hwy manufacturer              model
1: 4 auto(l5) 20.33333 31.00000 audi a4
2: 4 manual(m5) 21.54545 29.27273 audi a4
3: 4 manual(m6) 21.00000 29.57143 audi a4
4: 4 auto(av) 22.00000 30.50000 audi a4
5: 6 auto(l5) 15.18750 21.43750 audi a4
6: 6 manual(m5) 16.66667 22.66667 audi a4
7: 6 auto(av) 18.66667 26.00000 audi a4
8: 4 auto(s6) 20.50000 28.25000 audi a4 quattro
9: 6 auto(s6) 17.40000 26.00000 audi a4 quattro
10: 6 manual(m6) 16.00000 22.60000 audi a4 quattro
11: 8 auto(s6) 13.60000 20.40000 audi a6 quattro
12: 8 auto(l4) 12.20000 16.73333 chevrolet c1500 suburban 2wd
13: 8 manual(m6) 13.42857 20.00000 chevrolet corvette
14: 4 auto(l4) 20.50000 27.62500 chevrolet malibu
15: 6 auto(l4) 16.03448 22.68966 chevrolet malibu
16: 4 auto(l3) 21.00000 27.00000 dodge caravan 2wd
17: 6 auto(l6) 16.00000 23.00000 dodge caravan 2wd
18: 8 auto(l5) 12.29412 16.41176 dodge dakota pickup 4wd
19: 8 manual(m5) 13.00000 18.80000 dodge dakota pickup 4wd
20: 8 auto(l6) 12.50000 18.50000 ford expedition 2wd
21: 8 auto(s5) 12.00000 18.00000 nissan pathfinder 4wd
22: 8 auto(s4) 16.00000 25.00000 pontiac grand prix
23: 4 auto(s4) 20.00000 26.00000 subaru impreza awd
24: 4 auto(s5) 22.00000 31.00000 toyota camry solara
25: 6 auto(s5) 18.00000 27.00000 toyota camry solara
26: 5 auto(s6) 20.50000 29.00000 volkswagen jetta
27: 5 manual(m5) 20.50000 28.50000 volkswagen jetta
cyl trans cty hwy manufacturer model

Flexible mixing of multiple aggregations in data.table for different column combinations

The first thing to notice is that data.table's j argument expects a list output, which can be built with c, as mentioned in @akrun's answer. Here are two ways to do it:

set.seed(1)
DT <- data.table(C1=c("a","b","b"), C2=round(rnorm(4),4), C3=1:12, C4=9:12)
sum_cols <- c("C2","C3")
mean_cols <- c("C3","C4")

# with the development version, 1.10.1+
DT[, c(
.N,
sum = lapply(.SD[, ..sum_cols], sum),
mean = lapply(.SD[, ..mean_cols], mean)
), by=C1]

# in earlier versions
DT[, c(
.N,
sum = lapply(.SD[, sum_cols, with=FALSE], sum),
mean = lapply(.SD[, mean_cols, with=FALSE], mean)
), by=C1]

mget returns a list and c connects elements together to make a list.


Comments

If you turn on the verbose data.table option for these calls, you'll see a message:

The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.

Also, you'll see that the optimized group mean and sum are not being used (see ?GForce for details). We can get around this by following FAQ 1.6 perhaps, but I couldn't figure out how.

data.table summary by group for multiple columns

dt <- dt[, .(y = mean(y), z = mean(z)), by=.(a)]

Aggregate over combinations of columns with data.table

One option is rollup from data.table

library(data.table)
setDT(data)
rollup(data, j = sum(Expenditure), by = c("Country","Gender","Age", "Civil_Status"))


Related Topics



Leave a reply



Submit