Aggregating Rows for Multiple Columns in R

Aggregate multiple columns at once

We can use the formula method of aggregate. The variables on the 'rhs' of ~ are the grouping variables while the . represents all other variables in the 'df1' (from the example, we assume that we need the mean for all the columns except the grouping), specify the dataset and the function (mean).

aggregate(.~id1+id2, df1, mean)

Or we can use summarise_each from dplyr after grouping (group_by)

library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))

Or using summarise with across (dplyr devel version - ‘0.8.99.9000’)

df1 %>% 
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))

Or another option is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD) and get the mean.

library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]

data

df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b", 
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

Aggregating rows for multiple columns in R

We can use the formula method of aggregate. By specifying . on the LHS of ~, we select all the columns except the 'Id' column.

aggregate(.~Id, df, sum)
# Id A B C total
#1 3 11 4 7 22
#2 4 9 7 8 24

Or we can also specify the columns without using the formula method

aggregate(df[2:ncol(df)],df['Id'], FUN=sum)
# Id A B C total
#1 3 11 4 7 22
#2 4 9 7 8 24

Other options include dplyr and data.table.

Using dplyr, we group by 'Id' and get the sum of all columns with summarise_each.

library(dplyr)
df %>%
group_by(Id) %>%
summarise_each(funs(sum))

Or with data.table, we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'Id', we loop (lapply(..) through the Subset of Data.table (.SD) and get the sum.

library(data.table)
setDT(df)[, lapply(.SD, sum), by = Id]

R Aggregate over multiple columns

Here is an answer that uses base R, and since none of the data in the example data is above 120, we set a criterion of above 70.

data <- structure(
list(
date = structure(c(9131, 9132, 9133, 9134, 9135,
9136), class = "Date"),
x1 = c(50.75, 62.625, 57.25, 56.571,
36.75, 39.125),
x2 = c(62.25, 58.714, 49.875, 56.375, 43.25,
41.625),
x3 = c(90.25, NA, 70.125, 75.75, 83.286, 98.5),
x4 = c(60, 72, 68.375, 65.5, 63.25, 55.875),
x5 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
xn = c(53.25,
61.143, 56.571, 58.571, 36.25, 44.375),
year = c(1995, 1995, 1995, 1995,
1995, 1995),
month = c(1, 1, 1, 1, 1, 1),
day = c(1, 2, 3,
4, 5, 6)
),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl",
"data.frame"
))

First, we create a subset of the data that contains all columns containing x, and set them to TRUE or FALSE based on whether the value is greater than 70.

theCols <- data[,colnames(data)[grepl("x",colnames(data))]]

Second, we cbind() the year onto the matrix of logical values.

x_logical <- cbind(year = data$year,as.data.frame(apply(theCols,2,function(x) x > 70)))

Finally, we use aggregate across all columns other than year and sum the columns.

aggregate(x_logical[2:ncol(x_logical)],by = list(x_logical$year),sum,na.rm=TRUE)

...and the output:

  Group.1 x1 x2 x3 x4 x5 xn
1 1995 0 0 5 1 0 0
>

Note that by using colnames() to extract the columns that start with x and nrow() in the aggregate() function, we make this a general solution that will handle a varying number of x locations.

Two tidyverse solutions

A tidyverse solution to the same problem is as follows. It includes the following steps.

  1. Use mutate() with across() to create the TRUE / FALSE versions of the x variables. Note that across() requires dplyr 1.0.0, which is currently in development but due for production release the week of May 25th.

  2. Use pivot_longer() to allow us to summarise() multiple measures without a lot of complicated code.

  3. Use pivot_wider() to convert the data back to one column for each x measurement.

...and the code is:

devtools::install_github("tidyverse/dplyr") # needed for across()
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
mutate(.,across(starts_with("x"),~if_else(. > 70,TRUE,FALSE))) %>%
select(-year,-month,-day) %>% group_by(date) %>%
pivot_longer(starts_with("x"),names_to = "measure",values_to = "value") %>%
mutate(year = year(date)) %>% group_by(year,measure) %>%
select(-date) %>%
summarise(value = sum(value,na.rm=TRUE)) %>%
pivot_wider(id_cols = year,names_from = "measure",
values_from = value)

...and the output, which matches the Base R solution that I originally posted:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 7
# Groups: year [1]
year x1 x2 x3 x4 x5 xn
<dbl> <int> <int> <int> <int> <int> <int>
1 1995 0 0 5 1 0 0
>

...and here's an edited version of the other answer that will also produce the same results as above. This solution implements pivot_longer() before creating the logical variable for exceeding the threshold, so it does not require the across() function. Also note that since this uses 120 as the threshold value and none of the data meets this threshold, the sums are all 0.

df_example %>% 
pivot_longer(x1:x5) %>%
mutate(greater_120 = value > 120) %>%
group_by(year,name) %>%
summarise(sum_120 = sum(greater_120,na.rm = TRUE)) %>%
pivot_wider(id_cols = year,names_from = "name", values_from = sum_120)

...and the output:

`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 1 x 6
# Groups: year [1]
year x1 x2 x3 x4 x5
<dbl> <int> <int> <int> <int> <int>
1 1995 0 0 0 0 0
>

Conclusions

As usual, there are many ways to accomplish a given task in R. Depending on one's preferences, the problem can be solved with Base R or the tidyverse. One of the quirks of the tidyverse is that some operations such as summarise() are much easier to perform on narrow format tidy data than on wide format data. Therefore, it's important to be proficient with tidyr::pivot_longer() and pivot_wider() when working in the tidyverse.

That said, with the production release of dplyr 1.0.0, the team at RStudio continues to add features that facilitate working with wide format data.

How to aggregate multiple columns in a dataframe using values multiple columns

I always prefer using base packages and packages preinstalled with R. In terms of aggregation however I much prefer the ddply way because of its flexibility. You can aggregate with mean sum sd or whatever descriptive you choose.

column1<-c("S104259","S2914138","S999706","S1041120",rep("S1042529",6),rep('S1235729',4))
column2<-c("T6-R190116","T2-R190213","T8-R190118",rep("T8-R190118",3),rep('T2-R190118',3),rep('T6-R200118',4),'T1-R200118')
column3<-c(rep("3S_DMSO",7),rep("uns_DMSO",5),rep("3s_DMSO",2))
output_1<-c(664,292,1158,574,38,0,2850,18,74,8,10,0,664,30)
output_2<-c(364,34,0,74,8,0,850,8,7,8,310,0,64,380)
df<-data.frame(column1,column2,column3,output_1,output_2)

library(plyr)
factornames<-c("column1","column2","column3")
plyr::ddply(df,factornames,plyr::numcolwise(mean,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sum,na.rm=TRUE))
plyr::ddply(df,factornames,plyr::numcolwise(sd,na.rm=TRUE))

R data.table to aggregate by multiple columns and retaining all columns

the easiest way is copying the data.table as you already want to return all the column in a new data.table, and then append the columns x_agg, y_agg

library(data.table)
dt <- data.frame(x=rnorm(40), y=rnorm(20), z= rnorm(10), year=rep(2019:2020,times=2, each=10), month=rep(1:4, 10), day=rep(1:4,10))

setDT(dt)

dt2<- copy(dt)
names <- c("x","y")

dt2[, paste0(names, "_agg"):= lapply(.SD, sum),
.SDcols=names, by = .(year, month, day)][]
            x           y          z year month day      x_agg        y_agg
1: 0.52378890 0.19143318 -0.3387854 2019 1 1 -0.1709390 -2.967623395
2: -0.35158261 1.62461341 -0.9818403 2019 2 2 -3.6556367 5.940791892
3: 1.29391093 -0.73192766 -2.5227705 2019 3 3 2.1449165 -0.009080778
4: 1.15131966 -0.96903745 -0.5124389 2019 4 4 2.7530336 -1.763717065
5: -0.97305571 -1.16620834 0.8567205 2019 1 1 -0.1709390 -2.967623395
6: -1.73289458 1.74064829 -0.7019242 2019 2 2 -3.6556367 5.940791892
7: 0.14822163 0.72738728 -1.4267469 2019 3 3 2.1449165 -0.009080778
8: -0.17853639 0.08717892 2.0463365 2019 4 4 2.7530336 -1.763717065
9: 0.43857404 -0.50903654 -0.6887948 2019 1 1 -0.1709390 -2.967623395
10: 0.56904083 -0.39486575 -0.1134194 2019 2 2 -3.6556367 5.940791892
11: 0.54823107 -0.28118769 -0.3387854 2020 3 3 1.3975639 -5.470426871
12: 1.12885306 -0.80344406 -0.9818403 2020 4 4 2.5982909 -3.062138945
13: 0.98747699 0.72247033 -2.5227705 2020 1 1 2.4807741 0.134137894
14: -2.60859806 -1.37195721 -0.5124389 2020 2 2 -0.8401949 -2.285724235
15: -0.44170249 -1.47594529 0.8567205 2020 3 3 1.3975639 -5.470426871
16: 0.02994275 0.01272509 -0.7019242 2020 4 4 2.5982909 -3.062138945
17: -0.11760158 -0.65540139 -1.4267469 2020 1 1 2.4807741 0.134137894
18: 0.87222687 0.22909510 2.0463365 2020 2 2 -0.8401949 -2.285724235
19: 0.33379209 -0.97808045 -0.6887948 2020 3 3 1.3975639 -5.470426871
20: -0.70379104 -0.74035050 -0.1134194 2020 4 4 2.5982909 -3.062138945
21: 0.22151323 0.19143318 -0.3387854 2019 1 1 -0.1709390 -2.967623395
22: -0.91018028 1.62461341 -0.9818403 2019 2 2 -3.6556367 5.940791892
23: -0.05931458 -0.73192766 -2.5227705 2019 3 3 2.1449165 -0.009080778
24: 0.51606540 -0.96903745 -0.5124389 2019 4 4 2.7530336 -1.763717065
25: -0.81728153 -1.16620834 0.8567205 2019 1 1 -0.1709390 -2.967623395
26: -1.43174995 1.74064829 -0.7019242 2019 2 2 -3.6556367 5.940791892
27: 0.76209854 0.72738728 -1.4267469 2019 3 3 2.1449165 -0.009080778
28: 1.26418496 0.08717892 2.0463365 2019 4 4 2.7530336 -1.763717065
29: 0.43552206 -0.50903654 -0.6887948 2019 1 1 -0.1709390 -2.967623395
30: 0.20172988 -0.39486575 -0.1134194 2019 2 2 -3.6556367 5.940791892
31: 0.21270847 -0.28118769 -0.3387854 2020 3 3 1.3975639 -5.470426871
32: 1.21382327 -0.80344406 -0.9818403 2020 4 4 2.5982909 -3.062138945
33: 0.41322214 0.72247033 -2.5227705 2020 1 1 2.4807741 0.134137894
34: 0.09986465 -1.37195721 -0.5124389 2020 2 2 -0.8401949 -2.285724235
35: -0.09185291 -1.47594529 0.8567205 2020 3 3 1.3975639 -5.470426871
36: 0.13209497 0.01272509 -0.7019242 2020 4 4 2.5982909 -3.062138945
37: 1.19767652 -0.65540139 -1.4267469 2020 1 1 2.4807741 0.134137894
38: 0.79631162 0.22909510 2.0463365 2020 2 2 -0.8401949 -2.285724235
39: 0.83638763 -0.97808045 -0.6887948 2020 3 3 1.3975639 -5.470426871
40: 0.79736792 -0.74035050 -0.1134194 2020 4 4 2.5982909 -3.062138945
x y z year month day x_agg y_agg

Aggregate by multiple columns, sum one column and keep other columns? Create new column based on aggregated values?

In data.table:

library(data.table)

setDT(df)[, .(Amount = sum(Amount, na.rm = TRUE),
UniqueStores = uniqueN(Store, na.rm = TRUE)),
by = .(ProductID, Day, Product)
]

Output:

   ProductID       Day Product Amount UniqueStores
1: 1 Monday Food 10 1
2: 1 Tuesday Food 10 2
3: 2 Wednesday Toys 15 2
4: 2 Friday Toys 7 1

Aggregating data from multiple columns instead of a single column

With aggregate: it works if we remove the second column:

aggregate(. ~ Gene, df[-2], FUN=sum)

Output:

                Gene V1 V2 V3 V4 V5
1 ENSG00000000003.14 4 9 5 3 22

OR

We could use summarise with across from dplyr package:
Credits to Chris Ruehlemann his answer was 3 minutes earlier!!!

df %>% 
group_by(Gene) %>%
summarise(across(starts_with('V'), sum))

Output:

 Gene                  V1    V2    V3    V4    V5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003.14 4 9 5 3 22

data:

df <- structure(list(Gene = c("ENSG00000000003.14", "ENSG00000000003.14", 
"ENSG00000000003.14", "ENSG00000000003.14"), Transcript_ID = c("ENST00000612152.4",
"ENST00000373020.8", "ENST00000614008.4", "ENST00000496771.5"
), V1 = c(0, 4, 0, 0), V2 = c(6, 0, 0, 3), V3 = c(0, 5, 0, 0),
V4 = c(3, 0, 0, 0), V5 = c(15, 0, 0, 7)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(Gene = structure(list(), class = c("collector_character",
"collector")), Transcript_ID = structure(list(), class = c("collector_character",
"collector")), V1 = structure(list(), class = c("collector_double",
"collector")), V2 = structure(list(), class = c("collector_double",
"collector")), V3 = structure(list(), class = c("collector_double",
"collector")), V4 = structure(list(), class = c("collector_double",
"collector")), V5 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))

How can I aggregate multiple columns in a data.frame with a custom function in R?

You can do this with dplyr:

library(dplyr)
df %>%
group_by(Name) %>%
summarize_all(funs(sort(.)[1]))

Result:

# A tibble: 3 x 4
Name Height Weight Age
<fctr> <int> <int> <int>
1 Alice 180 70 35
2 Bob NA 80 27
3 Charles 170 75 NA

Data:

df = read.table(text = "Name     Height     Weight   Age
Alice 180 NA 35
Bob NA 80 27
Alice NA 70 NA
Charles 170 75 NA", header = TRUE)

Aggregate / summarize multiple variables per group (e.g. sum, mean)

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522


Related Topics



Leave a reply



Submit