Aggregate Multiple Columns At Once

Aggregate multiple columns at once

We can use the formula method of aggregate. The variables on the 'rhs' of ~ are the grouping variables while the . represents all other variables in the 'df1' (from the example, we assume that we need the mean for all the columns except the grouping), specify the dataset and the function (mean).

aggregate(.~id1+id2, df1, mean)

Or we can use summarise_each from dplyr after grouping (group_by)

library(dplyr)
df1 %>%
    group_by(id1, id2) %>% 
    summarise_each(funs(mean))

Or using summarise with across (dplyr devel version - ‘0.8.99.9000’)

df1 %>% 
    group_by(id1, id2) %>%
    summarise(across(starts_with('val'), mean))

Or another option is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD) and get the mean.

library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]

data

df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b", 
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"), 
val1 = c(1L, 
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L, 
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"), 
class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8"))

aggregate multiple columns in a data frame at once calculating different statistics on different columns - R

We could use dplyr for flexibility

library(dplyr)
df1 %>%
     group_by(name) %>% 
     summarise(v1 = mean(v1, na.rm = TRUE),
    v2 = sd(v2, na.rm = TRUE), v3 = max(v3, na.rm = TRUE),
            v4 = sum(v4, na.rm = TRUE))

If there are multiple columns to be blocked for different functions, use across

df1 %>%
    group_by(name) %>%
    summarise(across(c(v1, v2), mean, na.rm = TRUE),
               v3 = sd(v3, na.rm = TRUE),
               across(c(v4, v5), sum, na.rm = TRUE))

Or use collap from collapse

library(collapse)
collap(df1, ~ name, custom = list(fmean = c("v1", "v2"),
      fsd = "v3", fsum = c("v4", "v5")))

How to aggregate multiple columns in a dataframe using values multiple columns

I always prefer using base packages and packages preinstalled with R. In terms of aggregation however I much prefer the ddply way because of its flexibility. You can aggregate with mean sum sd or whatever descriptive you choose.

column1<-c("S104259","S2914138","S999706","S1041120",rep("S1042529",6),rep('S1235729',4))
column2<-c("T6-R190116","T2-R190213","T8-R190118",rep("T8-R190118",3),rep('T2-R190118',3),rep('T6-R200118',4),'T1-R200118')
column3<-c(rep("3S_DMSO",7),rep("uns_DMSO",5),rep("3s_DMSO",2))
output_1<-c(664,292,1158,574,38,0,2850,18,74,8,10,0,664,30)
output_2<-c(364,34,0,74,8,0,850,8,7,8,310,0,64,380)
df<-data.frame(column1,column2,column3,output_1,output_2)

library(plyr)
factornames<-c("column1","column2","column3")
plyr::ddply(df,factornames,plyr::numcolwise(mean,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sum,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sd,na.rm=TRUE))

Aggregate / summarize multiple variables per group (e.g. sum, mean)

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
#  year month         x1           x2
1  2000     1  -80.83405 -224.9540159
2  2000     2 -223.76331 -288.2418017
3  2000     3 -188.83930 -481.5601913
4  2000     4 -197.47797 -473.7137420
5  2000     5 -259.07928 -372.4563522

Using aggregate() function in R for multiple columns

An alternative way using only aggregate() and/or colMeans(), showing how to select many columns without having to name them:

# Sum    
aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
# Mean by year
aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = mean, na.rm = T, na.action = NULL)
# Mean without grouping per year
colMeans(dat[4:ncol(dat)])

The output:

> aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
  Year Rainfall   test
1 1979    833.1  833.1
2 1980   1492.6 1492.6
> aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = mean, na.rm = T, na.action = NULL)
  Year Rainfall     test
1 1979 14.12034 14.12034
2 1980 24.87667 24.87667
> colMeans(dat[4:ncol(dat)])
Rainfall     test 
 19.5437  19.5437

EDIT: in case they are some non-numeric columns in the mix that should stay as non-numeric, you can replace dat[4:ncol(dat)] by Filter(is.numeric, dat[4:ncol(dat)], see an example below:

> dat$test <- as.character(dat$test)
> aggregate(Filter(is.numeric, dat[4:ncol(dat)]), by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
  Year Rainfall
1 1979    833.1
2 1980   1492.6

Python pandas groupby aggregate on multiple columns, then pivot

Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.

We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.

rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category') \
  .agg(rnm_cols.keys()).rename(columns=rnm_cols)

            Size   Sum        Mean        Std
Category                                     
Books          3    58   19.333333   2.081666
Clothes        3   148   49.333333   4.041452
Technology     6  1800  300.000000  70.710678

option 1

use agg ← link to docs

agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)

                  Std   Sum        Mean  Size
Category                                     
Books        2.081666    58   19.333333     3
Clothes      4.041452   148   49.333333     3
Technology  70.710678  1800  300.000000     6

option 2

more for less

use describe ← link to docs

df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()

            count        mean        std    min    25%    50%    75%    max
Category                                                                   
Books         3.0   19.333333   2.081666   17.0   18.5   20.0   20.5   21.0
Clothes       3.0   49.333333   4.041452   45.0   47.5   50.0   51.5   53.0
Technology    6.0  300.000000  70.710678  200.0  262.5  300.0  337.5  400.0

Aggregating rows for multiple columns in R

We can use the formula method of aggregate. By specifying . on the LHS of ~, we select all the columns except the 'Id' column.

aggregate(.~Id, df, sum)
#   Id  A B C total
#1  3 11 4 7    22
#2  4  9 7 8    24

Or we can also specify the columns without using the formula method

aggregate(df[2:ncol(df)],df['Id'], FUN=sum)
#  Id  A B C total
#1  3 11 4 7    22
#2  4  9 7 8    24

Other options include dplyr and data.table.

Using dplyr, we group by 'Id' and get the sum of all columns with summarise_each.

library(dplyr)
df %>%
  group_by(Id) %>%
  summarise_each(funs(sum))

Or with data.table, we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Id', we loop (lapply(..) through the Subset of Data.table (.SD) and get the sum.

library(data.table)
setDT(df)[, lapply(.SD, sum), by = Id]

R Aggregate over multiple columns

Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.

data <- structure(
     list(
          date = structure(c(9131, 9132, 9133, 9134, 9135,
                             9136), class = "Date"),
          x1 = c(50.75, 62.625, 57.25, 56.571,
                 36.75, 39.125),
          x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
                 41.625),
          x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
          x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
          x5 = c(NA_real_,
                 NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
          xn = c(53.25,
                 61.143, 56.571, 58.571, 36.25, 44.375),
          year = c(1995, 1995, 1995, 1995,
                   1995, 1995),
          month = c(1, 1, 1, 1, 1, 1),
          day = c(1, 2, 3,
                  4, 5, 6)
     ),
     row.names = c(NA,-6L),
     class = c("tbl_df", "tbl",
               "data.frame"
     ))

First, we create a subset of the data that contains all columns containing x, and set them to TRUE or FALSE based on whether the value is greater than 70.

theCols <- data[,colnames(data)[grepl("x",colnames(data))]]

Second, we cbind() the year onto the matrix of logical values.

x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))

Finally, we use aggregate across all columns other than year and sum the columns.

aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)

...and the output:

  Group.1 x1 x2 x3 x4 x5 xn
1    1995  0  0  5  1  0  0
>

Note that by using colnames() to extract the columns that start with x and nrow() in the aggregate() function, we make this a general solution that will handle a varying number of x locations.

Two tidyverse solutions

A tidyverse solution to the same problem is as follows. It includes the following steps.

Use mutate() with across() to create the TRUE / FALSE versions of the x variables. Note that across() requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.
Use pivot_longer() to allow us to summarise() multiple measures without a lot of complicated code.
Use pivot_wider() to convert the data back to one column for each x measurement.

...and the code is:

devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr) 
library(lubridate) 
data %>%
     mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
        select(-year,-month,-day) %>% group_by(date) %>% 
        pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>% 
        mutate(year = year(date)) %>% group_by(year,measure) %>%
        select(-date) %>% 
                summarise(value = sum(value,na.rm=TRUE)) %>%
        pivot_wider(id_cols = year,names_from = "measure",
                    values_from = value)

...and the output, which matches the Base R solution that I originally posted:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups:   year [1]
   year    x1    x2    x3    x4    x5    xn
  <dbl> <int> <int> <int> <int> <int> <int>
1  1995     0     0     5     1     0     0
>

...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer() before creating the logical variable for exceeding the threshold, so it does not require the across() function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.

df_example %>% 
        pivot_longer(x1:x5) %>% 
        mutate(greater_120 = value > 120) %>% 
        group_by(year,name) %>% 
        summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
        pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)

...and the output:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups:   year [1]
   year    x1    x2    x3    x4    x5
  <dbl> <int> <int> <int> <int> <int>
1  1995     0     0     0     0     0
>

Conclusions

As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise() are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer() and pivot_wider() when working in the tidyverse.

That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.

Apply Same Aggregation on Multiple Columns when Using Groupby (python)

You can generate dict:

d = {**{"payment_amount": 'sum'}, 
     **dict.fromkeys(["user_id" , "category" , "name"], 'first')}

print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}

expected_output = example_df.groupby("user_id").agg(d)

Aggregate Multiple Columns At Once