Aggregate/Summarize Multiple Variables Per Group (E.G. Sum, Mean)

Aggregate / summarize multiple variables per group (e.g. sum, mean)

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522

Aggregate multiple columns at once

We can use the formula method of aggregate. The variables on the 'rhs' of ~ are the grouping variables while the . represents all other variables in the 'df1' (from the example, we assume that we need the mean for all the columns except the grouping), specify the dataset and the function (mean).

aggregate(.~id1+id2, df1, mean)

Or we can use summarise_each from dplyr after grouping (group_by)

library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))

Or using summarise with across (dplyr devel version - ‘0.8.99.9000’)

df1 %>% 
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))

Or another option is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD) and get the mean.

library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]

data

df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b", 
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

Summarizing multiple columns with dplyr?

In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:

library(dplyr)

df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06

Alternatively, the purrrlyr package provides the same functionality:

library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06

Also don't forget about data.table (use keyby to sort sort groups):

library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678

Let's try to compare performance.

library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3

Groupby and combine and aggregate multiple groups into one single group based on condition

You can do this in multiple steps. First partition the dataframe into 2 where the first one contains all rows that need to be aggregated (both more than 12 time points and more than one level1 group).

grp = grp.reset_index()
grp['nunique'] = grp.groupby(['level0'])['level1'].transform('nunique')

# partition
grp_small = grp.loc[grp['nunique'] > 1].groupby(['level0', 'level1', 'level2']).filter(lambda x: len(x) < 12)
idx_small = grp_small.index
grp_large = grp.loc[set(grp.index) - set(idx_small)]

Now we can apply the sum aggregation on the grp_small dataframe while leaving grp_large as it is.

grp_small = grp_small.groupby(['level0', 'date'], as_index=False).sum()
grp_small[['level1', 'level2']] = ['agg_lv1', 'agg_lv2']

And finally, we concat the two dataframes together and apply some final postprocessing:

df = pd.concat([grp_large, grp_small], ignore_index=True)
df = df.drop(columns='nunique').set_index(['level0', 'level1', 'level2', 'date']).sort_index()

Result with the given data (with added rows to the first group during computation):

                                   values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
... ...
2007-04-30 19
2007-05-31 77
2007-08-31 463
agg_lv1 agg_lv2 2006-04-30 700
2006-05-31 2600
2006-08-31 200
2007-06-30 300
2007-09-30 7000
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387

How to aggregate multiple columns in a dataframe using values multiple columns

I always prefer using base packages and packages preinstalled with R. In terms of aggregation however I much prefer the ddply way because of its flexibility. You can aggregate with mean sum sd or whatever descriptive you choose.

column1<-c("S104259","S2914138","S999706","S1041120",rep("S1042529",6),rep('S1235729',4))
column2<-c("T6-R190116","T2-R190213","T8-R190118",rep("T8-R190118",3),rep('T2-R190118',3),rep('T6-R200118',4),'T1-R200118')
column3<-c(rep("3S_DMSO",7),rep("uns_DMSO",5),rep("3s_DMSO",2))
output_1<-c(664,292,1158,574,38,0,2850,18,74,8,10,0,664,30)
output_2<-c(364,34,0,74,8,0,850,8,7,8,310,0,64,380)
df<-data.frame(column1,column2,column3,output_1,output_2)

library(plyr)
factornames<-c("column1","column2","column3")
plyr::ddply(df,factornames,plyr::numcolwise(mean,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sum,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sd,na.rm=TRUE))

Aggregate multiple variables with rbind

A dplyr approach would be:

df %>%
bind_rows(df %>%
group_by(year) %>%
summarize(county = 'Florida', across(starts_with('value'), sum))) %>%
arrange(year, county)
#> year county value1 value2 value3 value4
#> 1 2005 Alachua County 3 3 3 3
#> 2 2005 Baker County 9 9 9 9
#> 3 2005 Bay County 5 5 5 5
#> 4 2005 Florida 17 17 17 17
#> 5 2006 Alachua County 6 6 6 6
#> 6 2006 Baker County 8 8 8 8
#> 7 2006 Bay County 8 8 8 8
#> 8 2006 Florida 22 22 22 22
#> 9 2007 Alachua County 8 8 8 8
#> 10 2007 Baker County 4 4 4 4
#> 11 2007 Bay County 10 10 10 10
#> 12 2007 Florida 22 22 22 22

Apply several summary functions on several variables by group in one call

You can do it all in one step and get proper labeling:

> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
# id1 id2 val1.mn val1.n val2.mn val2.n
# 1 a x 1.5 2.0 6.5 2.0
# 2 b x 2.0 2.0 8.0 2.0
# 3 a y 3.5 2.0 7.0 2.0
# 4 b y 3.0 2.0 6.0 2.0

This creates a dataframe with two id columns and two matrix columns:

str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
'data.frame': 4 obs. of 4 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
$ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"

As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)

str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) ) 
)
'data.frame': 4 obs. of 6 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1.mn: num 1.5 2 3.5 3
$ val1.n : num 2 2 2 2
$ val2.mn: num 6.5 8 7 6
$ val2.n : num 2 2 2 2

This is the syntax for multiple variables on the LHS:

aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )

How to group and sum using a loop in R?

Instead of a loop, it's easier to use tidyverse functions. To do this, you "group" by your variable and summarize with the summary function being sum.

library(tidyverse)
df %>%
group_by(taxa) %>%
summarize(across(ON1:ON3, sum))
#> # A tibble: 2 × 4
#> taxa ON1 ON2 ON3
#> <chr> <dbl> <dbl> <dbl>
#> 1 arch 28 118 163
#> 2 bac 210 266 205
Created on 2021-09-29 by the reprex package (v2.0.1)


Related Topics



Leave a reply



Submit