Aggregate multiple columns at once
We can use the formula method of aggregate
. The variables on the 'rhs' of ~
are the grouping variables while the .
represents all other variables in the 'df1' (from the example, we assume that we need the mean
for all the columns except the grouping), specify the dataset and the function (mean
).
aggregate(.~id1+id2, df1, mean)
Or we can use summarise_each
from dplyr
after grouping (group_by
)
library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))
Or using summarise
with across
(dplyr
devel version - ‘0.8.99.9000’
)
df1 %>%
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))
Or another option is data.table
. We convert the 'data.frame' to 'data.table' (setDT(df1)
, grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD
) and get the mean
.
library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]
data
df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b",
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
aggregate multiple columns in a data frame at once calculating different statistics on different columns - R
We could use dplyr
for flexibility
library(dplyr)
df1 %>%
group_by(name) %>%
summarise(v1 = mean(v1, na.rm = TRUE),
v2 = sd(v2, na.rm = TRUE), v3 = max(v3, na.rm = TRUE),
v4 = sum(v4, na.rm = TRUE))
If there are multiple columns to be blocked for different functions, use across
df1 %>%
group_by(name) %>%
summarise(across(c(v1, v2), mean, na.rm = TRUE),
v3 = sd(v3, na.rm = TRUE),
across(c(v4, v5), sum, na.rm = TRUE))
Or use collap
from collapse
library(collapse)
collap(df1, ~ name, custom = list(fmean = c("v1", "v2"),
fsd = "v3", fsum = c("v4", "v5")))
How to aggregate multiple columns in a dataframe using values multiple columns
I always prefer using base packages and packages preinstalled with R. In terms of aggregation however I much prefer the ddply way because of its flexibility. You can aggregate with mean sum sd or whatever descriptive you choose.
column1<-c("S104259","S2914138","S999706","S1041120",rep("S1042529",6),rep('S1235729',4))
column2<-c("T6-R190116","T2-R190213","T8-R190118",rep("T8-R190118",3),rep('T2-R190118',3),rep('T6-R200118',4),'T1-R200118')
column3<-c(rep("3S_DMSO",7),rep("uns_DMSO",5),rep("3s_DMSO",2))
output_1<-c(664,292,1158,574,38,0,2850,18,74,8,10,0,664,30)
output_2<-c(364,34,0,74,8,0,850,8,7,8,310,0,64,380)
df<-data.frame(column1,column2,column3,output_1,output_2)
library(plyr)
factornames<-c("column1","column2","column3")
plyr::ddply(df,factornames,plyr::numcolwise(mean,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sum,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sd,na.rm=TRUE))
Aggregate / summarize multiple variables per group (e.g. sum, mean)
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
Using aggregate() function in R for multiple columns
An alternative way using only aggregate()
and/or colMeans()
, showing how to select many columns without having to name them:
# Sum
aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
# Mean by year
aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = mean, na.rm = T, na.action = NULL)
# Mean without grouping per year
colMeans(dat[4:ncol(dat)])
The output:
> aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
Year Rainfall test
1 1979 833.1 833.1
2 1980 1492.6 1492.6
> aggregate(dat[4:ncol(dat)], by = list(Year = dat$Year), FUN = mean, na.rm = T, na.action = NULL)
Year Rainfall test
1 1979 14.12034 14.12034
2 1980 24.87667 24.87667
> colMeans(dat[4:ncol(dat)])
Rainfall test
19.5437 19.5437
EDIT: in case they are some non-numeric columns in the mix that should stay as non-numeric, you can replace dat[4:ncol(dat)]
by Filter(is.numeric, dat[4:ncol(dat)]
, see an example below:
> dat$test <- as.character(dat$test)
> aggregate(Filter(is.numeric, dat[4:ncol(dat)]), by = list(Year = dat$Year), FUN = sum, na.rm = T, na.action = NULL)
Year Rainfall
1 1979 833.1
2 1980 1492.6
Python pandas groupby aggregate on multiple columns, then pivot
Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.
We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.
rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
df.set_index(['Category', 'Item']).stack().groupby('Category') \
.agg(rnm_cols.keys()).rename(columns=rnm_cols)
Size Sum Mean Std
Category
Books 3 58 19.333333 2.081666
Clothes 3 148 49.333333 4.041452
Technology 6 1800 300.000000 70.710678
option 1
use agg
← link to docs
agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)
Std Sum Mean Size
Category
Books 2.081666 58 19.333333 3
Clothes 4.041452 148 49.333333 3
Technology 70.710678 1800 300.000000 6
option 2
more for less
use describe
← link to docs
df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()
count mean std min 25% 50% 75% max
Category
Books 3.0 19.333333 2.081666 17.0 18.5 20.0 20.5 21.0
Clothes 3.0 49.333333 4.041452 45.0 47.5 50.0 51.5 53.0
Technology 6.0 300.000000 70.710678 200.0 262.5 300.0 337.5 400.0
Aggregating rows for multiple columns in R
We can use the formula method of aggregate
. By specifying .
on the LHS
of ~
, we select all the columns except the 'Id' column.
aggregate(.~Id, df, sum)
# Id A B C total
#1 3 11 4 7 22
#2 4 9 7 8 24
Or we can also specify the columns without using the formula method
aggregate(df[2:ncol(df)],df['Id'], FUN=sum)
# Id A B C total
#1 3 11 4 7 22
#2 4 9 7 8 24
Other options include dplyr
and data.table
.
Using dplyr
, we group by 'Id' and get the sum
of all columns with summarise_each
.
library(dplyr)
df %>%
group_by(Id) %>%
summarise_each(funs(sum))
Or with data.table
, we convert the 'data.frame' to 'data.table' (setDT(df)
), grouped by 'Id', we loop (lapply(..
) through the Subset of Data.table (.SD
) and get the sum
.
library(data.table)
setDT(df)[, lapply(.SD, sum), by = Id]
R Aggregate over multiple columns
Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.
data <- structure(
list(
date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"),
x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125),
x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625),
x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375),
year = c(1995, 1995, 1995, 1995,
1995, 1995),
month = c(1, 1, 1, 1, 1, 1),
day = c(1, 2, 3,
4, 5, 6)
),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl",
"data.frame"
))
First, we create a subset of the data that contains all columns containing x
, and set them to TRUE or FALSE based on whether the value is greater than 70.
theCols <- data[,colnames(data)[grepl("x",colnames(data))]]
Second, we cbind()
the year onto the matrix of logical values.
x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))
Finally, we use aggregate across all columns other than year
and sum the columns.
aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)
...and the output:
Group.1 x1 x2 x3 x4 x5 xn
1 1995 0 0 5 1 0 0
>
Note that by using colnames()
to extract the columns that start with x
and nrow()
in the aggregate()
function, we make this a general solution that will handle a varying number of x
locations.
Two tidyverse solutions
A tidyverse solution to the same problem is as follows. It includes the following steps.
Use
mutate()
withacross()
to create the TRUE / FALSE versions of thex
variables. Note thatacross()
requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.Use
pivot_longer()
to allow us tosummarise()
multiple measures without a lot of complicated code.Use
pivot_wider()
to convert the data back to one column for eachx
measurement.
...and the code is:
devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
select(-year,-month,-day) %>% group_by(date) %>%
pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>%
mutate(year = year(date)) %>% group_by(year,measure) %>%
select(-date) %>%
summarise(value = sum(value,na.rm=TRUE)) %>%
pivot_wider(id_cols = year,names_from = "measure",
values_from = value)
...and the output, which matches the Base R solution that I originally posted:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: year [1]
year x1 x2 x3 x4 x5 xn
<dbl> <int> <int> <int> <int> <int> <int>
1 1995 0 0 5 1 0 0
>
...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer()
before creating the logical variable for exceeding the threshold, so it does not require the across()
function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.
df_example %>%
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year,name) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)
...and the output:
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups: year [1]
year x1 x2 x3 x4 x5
<dbl> <int> <int> <int> <int> <int>
1 1995 0 0 0 0 0
>
Conclusions
As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise()
are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer()
and pivot_wider()
when working in the tidyverse.
That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.
Apply Same Aggregation on Multiple Columns when Using Groupby (python)
You can generate dict
:
d = {**{"payment_amount": 'sum'},
**dict.fromkeys(["user_id" , "category" , "name"], 'first')}
print (d)
{'payment_amount': 'sum', 'user_id': 'first', 'category': 'first', 'name': 'first'}
expected_output = example_df.groupby("user_id").agg(d)
More general solution should be:
d = dict.fromkeys(example_df.columns, 'first')
d['payment_amount'] = 'sum'
print (d)
{'user_id': 'first', 'category': 'first', 'name': 'first', 'payment_amount': 'sum'}
expected_output = example_df.groupby("user_id").agg(d)
Related Topics
R on Macos Error: Vector Memory Exhausted (Limit Reached)
Read All Files in Directory and Apply Multiple Functions to Each Data Frame
Offline Install of R Package and Dependencies
Create Sequence of Repeated Values, in Sequence
Conditional Merge/Replacement in R
Error in ≪My Code≫: Target of Assignment Expands to Non-Language Object
A Similar Function to R'S Rep in Matlab
Unordered Combinations of All Lengths
How to Remove Outliers from a Dataset
Why Is Rbindlist "Better" Than Rbind
How to Quickly Form Groups (Quartiles, Deciles, etc) by Ordering Column(S) in a Data Frame
Splitting a Continuous Variable into Equal Sized Groups
Geom_Rect and Alpha - Does This Work With Hard Coded Values
Subset Rows in a Data Frame Based on a Vector of Values