R: How to Sum Columns Grouped by a Factor

How to sum a variable by group

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))

R: how to sum columns grouped by a factor?

You can use dplyr for this:

library(dplyr)
df = data.frame(
  user = c("a", "a", "b", "b", "c"),
  v1   = c(1, 1, 1, 2, 1),
  v2   = c(0, 0, 0, 0, 1),
  v3   = c(0, 1, 0, 3, 1))

group_by(df, user) %>% 
summarize(v1_sum = sum(v1),
          v2_sum = sum(v2),
          v3_sum = sum(v3))

If you're not familiar with the %>% notation, it is basically like piping from bash. It takes the output from group_by() and puts it into summarize(). The same thing would be accomplished this way:

by_user = group_by(df, user)
df_summarized = summarize(by_user, 
                          v1_sum = sum(v1),
                          v2_sum = sum(v2),
                          v3_sum = sum(v3))

How to group by factor levels from two columns and output new column that shows sum of each level in R?

Instead of grouping by 'RawDate', group by 'ID', 'YEAR' and get the sum on a logical vector

library(dplyr)
complete_df %>%
       group_by(ID, YEAR) %>%
       mutate(TotalWon = sum(Renewal == 'WON'), TotalLost = sum(Renewal == 'LOST'))

If we need a summarised output, use summarise instead of mutate

Summing values in a column and grouping by another column in R

summarise_all is your friend here.

summarise_all(group_by(df, Dept), sum)
# # A tibble: 2 x 4
#    Dept   Mike Steve   Tom
#   <chr> <dbl> <dbl> <dbl>
# 1 Dept1     2     2     1
# 2 Dept2     0     3     2

R sum a variable by two groups

You can group_by ID and Year then use sum within summarise

library(dplyr)

txt <- "ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"

df <- read.table(text = txt, header = TRUE)

df %>% 
  group_by(ID, Year) %>% 
  summarise(Total = sum(Amount, na.rm = TRUE))
#> # A tibble: 6 x 3
#> # Groups:   ID [?]
#>      ID  Year Total
#>   <int> <int> <int>
#> 1     3  2000   100
#> 2     3  2002    20
#> 3     3  2004    30
#> 4     4  2000    25
#> 5     4  2002    55
#> 6     4  2004    95

If you have more than one Amount column & want to apply more than one function, you can use either summarise_if or summarise_all

df %>% 
  group_by(ID, Year) %>% 
  summarise_if(is.numeric, funs(sum, mean))
#> # A tibble: 6 x 4
#> # Groups:   ID [?]
#>      ID  Year   sum  mean
#>   <int> <int> <int> <dbl>
#> 1     3  2000   100  50  
#> 2     3  2002    20  10  
#> 3     3  2004    30  30  
#> 4     4  2000    25  25  
#> 5     4  2002    55  27.5
#> 6     4  2004    95  47.5

df %>% 
  group_by(ID, Year) %>% 
  summarise_all(funs(sum, mean, max, min))
#> # A tibble: 6 x 6
#> # Groups:   ID [?]
#>      ID  Year   sum  mean   max   min
#>   <int> <int> <int> <dbl> <dbl> <dbl>
#> 1     3  2000   100  50      55    45
#> 2     3  2002    20  10      10    10
#> 3     3  2004    30  30      30    30
#> 4     4  2000    25  25      25    25
#> 5     4  2002    55  27.5    40    15
#> 6     4  2004    95  47.5    50    45

^{Created on 2018-09-19 by the reprex package (v0.2.1.9000)}

R: Sum specific columns grouped by a particular column

1.Minimal reproducible example data:

df <- structure(list(Col1 = c(10L, 10L, 30L, 45L, 45L),
                     Col2 = c("A", "A", "B", "C", "C"), 
                     Col3 = c(5L, 6L, 2L, 5L, 2L),
                     Col4 = c(4L, 3L, 7L, 1L, 1L)),
                row.names = c(NA, -5L), class = "data.frame")

2.Solution using dplyr

library(dplyr)

df %>%
group_by(Col1, Col2) %>%
summarise(Col3 = sum(Col3),
          Col4 = sum(Col4))

Returns:

   Col1 Col2   Col3  Col4
  <int> <chr> <int> <int>
1    10 A        11     7
2    30 B         2     7
3    45 C         7     2

How to sum total of elements in several columns of factor type that are not empty?

What am I doing wrong with dplyr's code block?

It's because there are NAs. Try

library(dplyr)  

df2 = df %>% 
      select(Group, A_n, B_n) %>% 
      group_by(Group) %>% 
      summarise_all(sum, na.rm=TRUE)

instead.

Output on my machine:

# A tibble: 2 x 3
   Group   A_n   B_n
  <fctr> <dbl> <dbl>
1 Group1     2     1
2 Group2     1     1

I'm afraid my approach ... is too verbose and maybe overkill

You can just do this:

df <- data.frame(list(Group = c("Group1", "Group1", "Group2", "Group2"),
                      A=c("Some text", "Text here too", "Some other text", NA), 
                      B=c(NA, "Some random text", NA, "Random here too")))

library(dplyr)

df2 = df %>% 
    group_by(Group) %>% 
    summarise_all(.funs=function(x) length(na.omit(x)))

Output on my machine:

# A tibble: 2 x 3
   Group     A     B
  <fctr> <int> <int>
1 Group1     2     1
2 Group2     1     1

A little explanation

If you look at help(summarise_all), you'll see its arguments are .tbl, .funs, and ... (which we won't worry about the ellipses for now). So, we feed df into group_by() using the pipe %>%, then feed that into summarise_all(), again using the pipe %>%. That takes care of the .tbl argument. The .funs argument is how you specify what function(s) should be used to summarise to all non-grouping columns in .tbl. Here we want to know how many elements of each column is not NA, which we can do (as one approach) by applying length(na.omit(x)) to each non-grouping column x in .tbl.

My best suggestion for a resource to learn about dplyr is Chapter 5 of R for Data Science, a book by Hadley Wickham, who wrote the dplyr package (among many others).

R: How to Sum Columns Grouped by a Factor

How to sum a variable by group

R: how to sum columns grouped by a factor?

How to group by factor levels from two columns and output new column that shows sum of each level in R?

Summing values in a column and grouping by another column in R

R sum a variable by two groups

R: Sum specific columns grouped by a particular column

How to sum total of elements in several columns of factor type that are not empty?

What am I doing wrong with dplyr's code block?

I'm afraid my approach ... is too verbose and maybe overkill

A little explanation

Related Topics

Leave a reply