R: How to Sum Columns Grouped by a Factor

How to sum a variable by group

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))

R: how to sum columns grouped by a factor?

You can use dplyr for this:

library(dplyr)
df = data.frame(
user = c("a", "a", "b", "b", "c"),
v1 = c(1, 1, 1, 2, 1),
v2 = c(0, 0, 0, 0, 1),
v3 = c(0, 1, 0, 3, 1))

group_by(df, user) %>%
summarize(v1_sum = sum(v1),
v2_sum = sum(v2),
v3_sum = sum(v3))

If you're not familiar with the %>% notation, it is basically like piping from bash. It takes the output from group_by() and puts it into summarize(). The same thing would be accomplished this way:

by_user = group_by(df, user)
df_summarized = summarize(by_user,
v1_sum = sum(v1),
v2_sum = sum(v2),
v3_sum = sum(v3))

How to group by factor levels from two columns and output new column that shows sum of each level in R?

Instead of grouping by 'RawDate', group by 'ID', 'YEAR' and get the sum on a logical vector

library(dplyr)
complete_df %>%
group_by(ID, YEAR) %>%
mutate(TotalWon = sum(Renewal == 'WON'), TotalLost = sum(Renewal == 'LOST'))

If we need a summarised output, use summarise instead of mutate

Summing values in a column and grouping by another column in R

summarise_all is your friend here.

summarise_all(group_by(df, Dept), sum)
# # A tibble: 2 x 4
# Dept Mike Steve Tom
# <chr> <dbl> <dbl> <dbl>
# 1 Dept1 2 2 1
# 2 Dept2 0 3 2

R sum a variable by two groups

You can group_by ID and Year then use sum within summarise

library(dplyr)

txt <- "ID Year Amount
3 2000 45
3 2000 55
3 2002 10
3 2002 10
3 2004 30
4 2000 25
4 2002 40
4 2002 15
4 2004 45
4 2004 50"

df <- read.table(text = txt, header = TRUE)

df %>%
group_by(ID, Year) %>%
summarise(Total = sum(Amount, na.rm = TRUE))
#> # A tibble: 6 x 3
#> # Groups: ID [?]
#> ID Year Total
#> <int> <int> <int>
#> 1 3 2000 100
#> 2 3 2002 20
#> 3 3 2004 30
#> 4 4 2000 25
#> 5 4 2002 55
#> 6 4 2004 95

If you have more than one Amount column & want to apply more than one function, you can use either summarise_if or summarise_all

df %>% 
group_by(ID, Year) %>%
summarise_if(is.numeric, funs(sum, mean))
#> # A tibble: 6 x 4
#> # Groups: ID [?]
#> ID Year sum mean
#> <int> <int> <int> <dbl>
#> 1 3 2000 100 50
#> 2 3 2002 20 10
#> 3 3 2004 30 30
#> 4 4 2000 25 25
#> 5 4 2002 55 27.5
#> 6 4 2004 95 47.5

df %>%
group_by(ID, Year) %>%
summarise_all(funs(sum, mean, max, min))
#> # A tibble: 6 x 6
#> # Groups: ID [?]
#> ID Year sum mean max min
#> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 3 2000 100 50 55 45
#> 2 3 2002 20 10 10 10
#> 3 3 2004 30 30 30 30
#> 4 4 2000 25 25 25 25
#> 5 4 2002 55 27.5 40 15
#> 6 4 2004 95 47.5 50 45

Created on 2018-09-19 by the reprex package (v0.2.1.9000)

R: Sum specific columns grouped by a particular column

1.Minimal reproducible example data:

df <- structure(list(Col1 = c(10L, 10L, 30L, 45L, 45L),
Col2 = c("A", "A", "B", "C", "C"),
Col3 = c(5L, 6L, 2L, 5L, 2L),
Col4 = c(4L, 3L, 7L, 1L, 1L)),
row.names = c(NA, -5L), class = "data.frame")

2.Solution using dplyr

library(dplyr)

df %>%
group_by(Col1, Col2) %>%
summarise(Col3 = sum(Col3),
Col4 = sum(Col4))

Returns:

   Col1 Col2   Col3  Col4
<int> <chr> <int> <int>
1 10 A 11 7
2 30 B 2 7
3 45 C 7 2

How to sum total of elements in several columns of factor type that are not empty?

What am I doing wrong with dplyr's code block?

It's because there are NAs. Try

library(dplyr)  

df2 = df %>%
select(Group, A_n, B_n) %>%
group_by(Group) %>%
summarise_all(sum, na.rm=TRUE)

instead.

Output on my machine:

# A tibble: 2 x 3
Group A_n B_n
<fctr> <dbl> <dbl>
1 Group1 2 1
2 Group2 1 1

I'm afraid my approach ... is too verbose and maybe overkill

You can just do this:

df <- data.frame(list(Group = c("Group1", "Group1", "Group2", "Group2"),
A=c("Some text", "Text here too", "Some other text", NA),
B=c(NA, "Some random text", NA, "Random here too")))

library(dplyr)

df2 = df %>%
group_by(Group) %>%
summarise_all(.funs=function(x) length(na.omit(x)))

Output on my machine:

# A tibble: 2 x 3
Group A B
<fctr> <int> <int>
1 Group1 2 1
2 Group2 1 1

A little explanation

If you look at help(summarise_all), you'll see its arguments are .tbl, .funs, and ... (which we won't worry about the ellipses for now). So, we feed df into group_by() using the pipe %>%, then feed that into summarise_all(), again using the pipe %>%. That takes care of the .tbl argument. The .funs argument is how you specify what function(s) should be used to summarise to all non-grouping columns in .tbl. Here we want to know how many elements of each column is not NA, which we can do (as one approach) by applying length(na.omit(x)) to each non-grouping column x in .tbl.

My best suggestion for a resource to learn about dplyr is Chapter 5 of R for Data Science, a book by Hadley Wickham, who wrote the dplyr package (among many others).



Related Topics



Leave a reply



Submit