Aggregate Multiple Columns At Once

Aggregate multiple columns at once

We can use the formula method of aggregate. The variables on the 'rhs' of ~ are the grouping variables while the . represents all other variables in the 'df1' (from the example, we assume that we need the mean for all the columns except the grouping), specify the dataset and the function (mean).

aggregate(.~id1+id2, df1, mean)

Or we can use summarise_each from dplyr after grouping (group_by)

library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))

Or using summarise with across (dplyr devel version - ‘0.8.99.9000’)

df1 %>% 
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))

Or another option is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD) and get the mean.

library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]

data

df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b", 
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

aggregate multiple columns in a data frame at once calculating different statistics on different columns - R

We could use dplyr for flexibility

library(dplyr)
df1 %>%
group_by(name) %>%
summarise(v1 = mean(v1, na.rm = TRUE),
v2 = sd(v2, na.rm = TRUE), v3 = max(v3, na.rm = TRUE),
v4 = sum(v4, na.rm = TRUE))

If there are multiple columns to be blocked for different functions, use across

df1 %>%
group_by(name) %>%
summarise(across(c(v1, v2), mean, na.rm = TRUE),
v3 = sd(v3, na.rm = TRUE),
across(c(v4, v5), sum, na.rm = TRUE))

Or use collap from collapse

library(collapse)
collap(df1, ~ name, custom = list(fmean = c("v1", "v2"),
fsd = "v3", fsum = c("v4", "v5")))

How to aggregate multiple columns in a dataframe using values multiple columns

I always prefer using base packages and packages preinstalled with R. In terms of aggregation however I much prefer the ddply way because of its flexibility. You can aggregate with mean sum sd or whatever descriptive you choose.

column1<-c("S104259","S2914138","S999706","S1041120",rep("S1042529",6),rep('S1235729',4))
column2<-c("T6-R190116","T2-R190213","T8-R190118",rep("T8-R190118",3),rep('T2-R190118',3),rep('T6-R200118',4),'T1-R200118')
column3<-c(rep("3S_DMSO",7),rep("uns_DMSO",5),rep("3s_DMSO",2))
output_1<-c(664,292,1158,574,38,0,2850,18,74,8,10,0,664,30)
output_2<-c(364,34,0,74,8,0,850,8,7,8,310,0,64,380)
df<-data.frame(column1,column2,column3,output_1,output_2)

library(plyr)
factornames<-c("column1","column2","column3")
plyr::ddply(df,factornames,plyr::numcolwise(mean,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sum,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sd,na.rm=TRUE))

Aggregate / summarize multiple variables per group (e.g. sum, mean)

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522

Using aggregate() function in R for multiple columns

An alternative way using only aggregate() and/or colMeans(), showing how to select many columns without having to name them:

# Sum    
aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
# Mean by year
aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = mean, na.rm = T, na.action = NULL)
# Mean without grouping per year
colMeans(dat[4:ncol(dat)])

The output:

> aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
Year Rainfall test
1 1979 833.1 833.1
2 1980 1492.6 1492.6
> aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = mean, na.rm = T, na.action = NULL)
Year Rainfall test
1 1979 14.12034 14.12034
2 1980 24.87667 24.87667
> colMeans(dat[4:ncol(dat)])
Rainfall test
19.5437 19.5437

EDIT: in case they are some non-numeric columns in the mix that should stay as non-numeric, you can replace dat[4:ncol(dat)] by Filter(is.numeric, dat[4:ncol(dat)], see an example below:

> dat$test <- as.character(dat$test)
> aggregate(Filter(is.numeric, dat[4:ncol(dat)]), by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
Year Rainfall
1 1979 833.1
2 1980 1492.6

Python pandas groupby aggregate on multiple columns, then pivot

Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.

We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.

rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category') \
.agg(rnm_cols.keys()).rename(columns=rnm_cols)

Size Sum Mean Std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678

option 1

use agg ← link to docs

agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)

Std Sum Mean Size
Category
Books 2.081666 58 19.333333 3
Clothes 4.041452 148 49.333333 3
Technology 70.710678 1800 300.000000 6

option 2

more for less

use describe ← link to docs

df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()

count mean std min 25% 50% 75% max
Category
Books 3.0 19.333333 2.081666 17.0 18.5 20.0 20.5 21.0
Clothes 3.0 49.333333 4.041452 45.0 47.5 50.0 51.5 53.0
Technology 6.0 300.000000 70.710678 200.0 262.5 300.0 337.5 400.0

Aggregating rows for multiple columns in R

We can use the formula method of aggregate. By specifying . on the LHS of ~, we select all the columns except the 'Id' column.

aggregate(.~Id, df, sum)
# Id A B C total
#1 3 11 4 7 22
#2 4 9 7 8 24

Or we can also specify the columns without using the formula method

aggregate(df[2:ncol(df)],df['Id'], FUN=sum)
# Id A B C total
#1 3 11 4 7 22
#2 4 9 7 8 24

Other options include dplyr and data.table.

Using dplyr, we group by 'Id' and get the sum of all columns with summarise_each.

library(dplyr)
df %>%
group_by(Id) %>%
summarise_each(funs(sum))

Or with data.table, we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Id', we loop (lapply(..) through the Subset of Data.table (.SD) and get the sum.

library(data.table)
setDT(df)[, lapply(.SD, sum), by = Id]

R Aggregate over multiple columns

Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.

data <- structure(
list(
date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"),
x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125),
x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625),
x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375),
year = c(1995, 1995, 1995, 1995,
1995, 1995),
month = c(1, 1, 1, 1, 1, 1),
day = c(1, 2, 3,
4, 5, 6)
),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl",
"data.frame"
))

First, we create a subset of the data that contains all columns containing x, and set them to TRUE or FALSE based on whether the value is greater than 70.

theCols <- data[,colnames(data)[grepl("x",colnames(data))]]

Second, we cbind() the year onto the matrix of logical values.

x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))

Finally, we use aggregate across all columns other than year and sum the columns.

aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)

...and the output:

  Group.1 x1 x2 x3 x4 x5 xn
1 1995 0 0 5 1 0 0
>

Note that by using colnames() to extract the columns that start with x and nrow() in the aggregate() function, we make this a general solution that will handle a varying number of x locations.

Two tidyverse solutions

A tidyverse solution to the same problem is as follows. It includes the following steps.

  1. Use mutate() with across() to create the TRUE / FALSE versions of the x variables. Note that across() requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.

  2. Use pivot_longer() to allow us to summarise() multiple measures without a lot of complicated code.

  3. Use pivot_wider() to convert the data back to one column for each x measurement.

...and the code is:

devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
select(-year,-month,-day) %>% group_by(date) %>%
pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>%
mutate(year = year(date)) %>% group_by(year,measure) %>%
select(-date) %>%
summarise(value = sum(value,na.rm=TRUE)) %>%
pivot_wider(id_cols = year,names_from = "measure",
values_from = value)

...and the output, which matches the Base R solution that I originally posted:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: year [1]
year x1 x2 x3 x4 x5 xn
<dbl> <int> <int> <int> <int> <int> <int>
1 1995 0 0 5 1 0 0
>

...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer() before creating the logical variable for exceeding the threshold, so it does not require the across() function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.

df_example %>% 
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year,name) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)

...and the output:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups: year [1]
year x1 x2 x3 x4 x5
<dbl> <int> <int> <int> <int> <int>
1 1995 0 0 0 0 0
>

Conclusions

As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise() are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer() and pivot_wider() when working in the tidyverse.

That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.

Apply Same Aggregation on Multiple Columns when Using Groupby (python)

You can generate dict:

d = {**{"payment_amount": 'sum'}, 
**dict.fromkeys(["user_id" , "category" , "name"], 'first')}

print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}

expected_output = example_df.groupby("user_id").agg(d)

More general solution should be:

d = dict.fromkeys(example_df.columns, 'first')
d['payment_amount'] = 'sum'
print (d)
{'user_id': 'first', 'category': 'first', 'name': 'first', 'payment_amount': 'sum'}

expected_output = example_df.groupby("user_id").agg(d)


Related Topics



Leave a reply



Submit