Ddply for Sum by Group in R

ddply for sum by group in R

As pointed out in a comment, you can do multiple operations inside the summarize.

This reduces your code to one line of ddply() and one line of subsetting, which is easy enough with the [ operator:

x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income))
x[x$freq > 3, ]

Y freq tot
3 228122 4 6778

This is also exceptionally easy with the data.table package:

library(data.table)
data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778

In fact, the operation to calculate the length of a vector has its own shortcut in data.table - use the .N shortcut:

data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778

Aggregate sum and mean in R with ddply

Antoher solution using dplyr. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.

library(dplyr)
library(ggplot2)

diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)

How to sum a variable by group

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))

Summarize variable for different time periods and by group using ddply

DF <- read.table(text="   Client   Q   Sales   Date
A 2 30 01/01/2014
A 3 24 02/01/2014
A 1 10 03/01/2014
B 4 10 01/01/2014
B 1 20 02/01/2014
B 3 30 03/01/2014", header=TRUE)

library(plyr)
ddply(DF, .(Client), summarise,
Q = sum(Q),
`Sales03/01/2014` = Sales[Date=="03/01/2014"],
Sales = sum(Sales))

# Client Q Sales03/01/2014 Sales
#1 A 6 10 64
#2 B 8 30 60

Note that order of evaluation is important here if you want the same name for output as for input Sales. Also, it is best to avoid names that are not valid syntax.

Better & faster way to sum & ifelse for a large set of columns in a big data frame using ddply R

We could possibly reduce the time by switching to dplyr. Also, instead of doing the sum and then using ifelse to check and reconvert, this can be directly done by checking any value greater than 0

library(dplyr)
dummies %>%
dplyr::select(id, where(is.numeric)) %>%
dplyr::group_by(id) %>%
dplyr::summarise(across(everything(), ~ +(any(. > 0, na.rm = TRUE))))

or using data.table

library(data.table)
setDT(dummies)[, lapply(.SD, function(x)
+(any(x > 0, na.rm = TRUE))), id, .SDcols = patterns('group')]

R sort summarise ddply by group sum

Here is a base R version, where DF is the result of your ddply call:

with(DF, DF[order(-ave(Total, Category, FUN=sum), Category, -Total), ])

produces:

  Category City Total
7 Two A 34
8 Two B 2
6 Three D 8
5 Three C 5
4 One D 10
3 One A 1
1 Four B 5
2 Four C 3

The logic is basically the same as David's, calculate the sum of Total for each Category, use that number for all rows in each Category (we do this with ave(..., FUN=sum)), and then sort by that plus some tie breakers to make sure stuff comes out as expected.

(How) can I use ddply to summarize a dataframe grouped by two factors?

Just remove the c in the .variables argument, so your code is:

library(plyr)
ddply(ExampleData, .(Condition, Block), summarize, Average=mean(Var1, na.rm=TRUE), SD=sd(Var1),N=length(Var1), Med =median(Var1))

By the way, you might want to switch to using dplyr instead of plyr.
https://blog.rstudio.com/2014/01/17/introducing-dplyr/

If you were to do this in dplyr:

summarize(group_by(ExampleData, Condition, Block), Average=mean(Var1, na.rm=TRUE), SD=sd(Var1),N=length(Var1), Med =median(Var1))

You could also use the piping so this could be:

ExampleData %>% 
group_by(Condition, Block) %>%
summarise(Average=mean(Var1, na.rm=TRUE),
SD=sd(Var1),
N=length(Var1),
Med =median(Var1))

How to use a for loop to use ddply on multiple columns?

OP mentioned to use simple for-loop for this transformation on data. I understand that there are many other optimized way to solve this but in order to respect OP desired I tried using for-loop based solution. I have used dplyr as plyr is old now.

library(dplyr)
Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
#small change in the way data.frame is created
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2,
Feature3, stringsAsFactors = FALSE)

Feat <- c(colnames(df.main[3:5]))

# Ready with Key columns on which grouping is done
resultdf <- unique(select(df.main, Subject, GroupOfInterest))
#> resultdf
# Subject GroupOfInterest
#1 1 a
#2 1 b
#3 1 c
#7 2 a
#8 2 b
#9 2 c

#For loop for each column
for(q in Feat){
summean <- paste0('mean(', q, ')')
summ_name <- paste0(q) #Name of the column to store sum
df_sum <- df.main %>%
group_by(Subject, GroupOfInterest) %>%
summarise_(.dots = setNames(summean, summ_name))
#merge the result of new sum column in resultdf
resultdf <- merge(resultdf, df_sum, by = c("Subject", "GroupOfInterest"))
}

# Final result
#> resultdf
# Subject GroupOfInterest Feature1 Feature2 Feature3
#1 1 a 6.5 473.0 3.5
#2 1 b 4.5 437.0 2.0
#3 1 c 12.0 415.5 3.5
#4 2 a 10.0 437.5 3.0
#5 2 b 3.0 447.0 4.5
#6 2 c 6.0 462.0 2.5


Related Topics



Leave a reply



Submit