Aggregate / summarize multiple variables per group (e.g. sum, mean)
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
Aggregate multiple columns at once
We can use the formula method of aggregate
. The variables on the 'rhs' of ~
are the grouping variables while the .
represents all other variables in the 'df1' (from the example, we assume that we need the mean
for all the columns except the grouping), specify the dataset and the function (mean
).
aggregate(.~id1+id2, df1, mean)
Or we can use summarise_each
from dplyr
after grouping (group_by
)
library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))
Or using summarise
with across
(dplyr
devel version - ‘0.8.99.9000’
)
df1 %>%
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))
Or another option is data.table
. We convert the 'data.frame' to 'data.table' (setDT(df1)
, grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD
) and get the mean
.
library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]
data
df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b",
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
Summarizing multiple columns with dplyr?
In dplyr
(>=1.00) you may use across(everything()
in summarise
to apply a function to all variables:
library(dplyr)
df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Alternatively, the purrrlyr
package provides the same functionality:
library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Also don't forget about data.table
(use keyby
to sort sort groups):
library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678
Let's try to compare performance.
library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3
Groupby and combine and aggregate multiple groups into one single group based on condition
You can do this in multiple steps. First partition the dataframe into 2 where the first one contains all rows that need to be aggregated (both more than 12 time points and more than one level1
group).
grp = grp.reset_index()
grp['nunique'] = grp.groupby(['level0'])['level1'].transform('nunique')
# partition
grp_small = grp.loc[grp['nunique'] > 1].groupby(['level0', 'level1', 'level2']).filter(lambda x: len(x) < 12)
idx_small = grp_small.index
grp_large = grp.loc[set(grp.index) - set(idx_small)]
Now we can apply the sum
aggregation on the grp_small
dataframe while leaving grp_large
as it is.
grp_small = grp_small.groupby(['level0', 'date'], as_index=False).sum()
grp_small[['level1', 'level2']] = ['agg_lv1', 'agg_lv2']
And finally, we concat the two dataframes together and apply some final postprocessing:
df = pd.concat([grp_large, grp_small], ignore_index=True)
df = df.drop(columns='nunique').set_index(['level0', 'level1', 'level2', 'date']).sort_index()
Result with the given data (with added rows to the first group during computation):
values
level0 level1 level2 date
A AA AA_1 2006-10-31 300
2006-11-30 220
2006-12-31 415
... ...
2007-04-30 19
2007-05-31 77
2007-08-31 463
agg_lv1 agg_lv2 2006-04-30 700
2006-05-31 2600
2006-08-31 200
2007-06-30 300
2007-09-30 7000
... ... ... ... ...
Z ZZ ZZ_9 2006-04-30 3680
2006-09-30 277
2007-03-31 1490
2007-09-30 289
2007-10-31 387
How to aggregate multiple columns in a dataframe using values multiple columns
I always prefer using base packages and packages preinstalled with R. In terms of aggregation however I much prefer the ddply way because of its flexibility. You can aggregate with mean sum sd or whatever descriptive you choose.
column1<-c("S104259","S2914138","S999706","S1041120",rep("S1042529",6),rep('S1235729',4))
column2<-c("T6-R190116","T2-R190213","T8-R190118",rep("T8-R190118",3),rep('T2-R190118',3),rep('T6-R200118',4),'T1-R200118')
column3<-c(rep("3S_DMSO",7),rep("uns_DMSO",5),rep("3s_DMSO",2))
output_1<-c(664,292,1158,574,38,0,2850,18,74,8,10,0,664,30)
output_2<-c(364,34,0,74,8,0,850,8,7,8,310,0,64,380)
df<-data.frame(column1,column2,column3,output_1,output_2)
library(plyr)
factornames<-c("column1","column2","column3")
plyr::ddply(df,factornames,plyr::numcolwise(mean,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sum,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sd,na.rm=TRUE))
Aggregate multiple variables with rbind
A dplyr
approach would be:
df %>%
bind_rows(df %>%
group_by(year) %>%
summarize(county = 'Florida', across(starts_with('value'), sum))) %>%
arrange(year, county)
#> year county value1 value2 value3 value4
#> 1 2005 Alachua County 3 3 3 3
#> 2 2005 Baker County 9 9 9 9
#> 3 2005 Bay County 5 5 5 5
#> 4 2005 Florida 17 17 17 17
#> 5 2006 Alachua County 6 6 6 6
#> 6 2006 Baker County 8 8 8 8
#> 7 2006 Bay County 8 8 8 8
#> 8 2006 Florida 22 22 22 22
#> 9 2007 Alachua County 8 8 8 8
#> 10 2007 Baker County 4 4 4 4
#> 11 2007 Bay County 10 10 10 10
#> 12 2007 Florida 22 22 22 22
Apply several summary functions on several variables by group in one call
You can do it all in one step and get proper labeling:
> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
# id1 id2 val1.mn val1.n val2.mn val2.n
# 1 a x 1.5 2.0 6.5 2.0
# 2 b x 2.0 2.0 8.0 2.0
# 3 a y 3.5 2.0 7.0 2.0
# 4 b y 3.0 2.0 6.0 2.0
This creates a dataframe with two id columns and two matrix columns:
str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
'data.frame': 4 obs. of 4 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
$ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "mn" "n"
As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)
str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
)
'data.frame': 4 obs. of 6 variables:
$ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
$ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
$ val1.mn: num 1.5 2 3.5 3
$ val1.n : num 2 2 2 2
$ val2.mn: num 6.5 8 7 6
$ val2.n : num 2 2 2 2
This is the syntax for multiple variables on the LHS:
aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
How to group and sum using a loop in R?
Instead of a loop, it's easier to use tidyverse functions. To do this, you "group" by your variable and summarize
with the summary function being sum
.
library(tidyverse)
df %>%
group_by(taxa) %>%
summarize(across(ON1:ON3, sum))
#> # A tibble: 2 × 4
#> taxa ON1 ON2 ON3
#> <chr> <dbl> <dbl> <dbl>
#> 1 arch 28 118 163
#> 2 bac 210 266 205
Created on 2021-09-29 by the reprex package (v2.0.1)
Related Topics
How to Append a Sequential Number for Every Element in a Data Frame
Create Counter Within Consecutive Runs of Values
How to Find the Closest Date to a Given Date
How to Find the Statistical Mode
Plot Two Graphs in Same Plot in R
Split a Large Dataframe into a List of Data Frames Based on Common Value in Column
R Reshape Data Frame from Long to Wide Format
How to Arrange a Variable List of Plots Using Grid.Arrange
Convert Categorical Variables to Numeric in R
Column Name Changes in R for Loop for Defined Data Frame
Remove Rows With All or Some Nas (Missing Values) in Data.Frame
Extract Row Corresponding to Minimum Value of a Variable by Group
How to Specifically Order Ggplot2 X Axis Instead of Alphabetical Order
Installing Older Version of R Package
Data.Table Objects Assigned With := from Within Function Not Printed