ddply for sum by group in R
As pointed out in a comment, you can do multiple operations inside the summarize
.
This reduces your code to one line of ddply()
and one line of subsetting, which is easy enough with the [
operator:
x <- ddply(data, .(Y), summarize, freq=length(Y), tot=sum(income))
x[x$freq > 3, ]
Y freq tot
3 228122 4 6778
This is also exceptionally easy with the data.table
package:
library(data.table)
data.table(data)[, list(freq=length(income), tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778
In fact, the operation to calculate the length of a vector has its own shortcut in data.table
- use the .N
shortcut:
data.table(data)[, list(freq=.N, tot=sum(income)), by=Y][freq > 3]
Y freq tot
1: 228122 4 6778
Aggregate sum and mean in R with ddply
Antoher solution using dplyr
. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)
How to sum a variable by group
Using aggregate
:
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34
In the example above, multiple dimensions can be specified in the list
. Multiple aggregated metrics of the same data type can be incorporated via cbind
:
aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
(embedding @thelatemail comment), aggregate
has a formula interface too
aggregate(Frequency ~ Category, x, sum)
Or if you want to aggregate multiple columns, you could use the .
notation (works for one column too)
aggregate(. ~ Category, x, sum)
or tapply
:
tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34
Using this data:
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
Summarize variable for different time periods and by group using ddply
DF <- read.table(text=" Client Q Sales Date
A 2 30 01/01/2014
A 3 24 02/01/2014
A 1 10 03/01/2014
B 4 10 01/01/2014
B 1 20 02/01/2014
B 3 30 03/01/2014", header=TRUE)
library(plyr)
ddply(DF, .(Client), summarise,
Q = sum(Q),
`Sales03/01/2014` = Sales[Date=="03/01/2014"],
Sales = sum(Sales))
# Client Q Sales03/01/2014 Sales
#1 A 6 10 64
#2 B 8 30 60
Note that order of evaluation is important here if you want the same name for output as for input Sales
. Also, it is best to avoid names that are not valid syntax.
Better & faster way to sum & ifelse for a large set of columns in a big data frame using ddply R
We could possibly reduce the time by switching to dplyr
. Also, instead of doing the sum
and then using ifelse
to check and reconvert, this can be directly done by checking any
value greater than 0
library(dplyr)
dummies %>%
dplyr::select(id, where(is.numeric)) %>%
dplyr::group_by(id) %>%
dplyr::summarise(across(everything(), ~ +(any(. > 0, na.rm = TRUE))))
or using data.table
library(data.table)
setDT(dummies)[, lapply(.SD, function(x)
+(any(x > 0, na.rm = TRUE))), id, .SDcols = patterns('group')]
R sort summarise ddply by group sum
Here is a base R version, where DF
is the result of your ddply
call:
with(DF, DF[order(-ave(Total, Category, FUN=sum), Category, -Total), ])
produces:
Category City Total
7 Two A 34
8 Two B 2
6 Three D 8
5 Three C 5
4 One D 10
3 One A 1
1 Four B 5
2 Four C 3
The logic is basically the same as David's, calculate the sum of Total
for each Category
, use that number for all rows in each Category
(we do this with ave(..., FUN=sum)
), and then sort by that plus some tie breakers to make sure stuff comes out as expected.
(How) can I use ddply to summarize a dataframe grouped by two factors?
Just remove the c in the .variables
argument, so your code is:
library(plyr)
ddply(ExampleData, .(Condition, Block), summarize, Average=mean(Var1, na.rm=TRUE), SD=sd(Var1),N=length(Var1), Med =median(Var1))
By the way, you might want to switch to using dplyr
instead of plyr
.
https://blog.rstudio.com/2014/01/17/introducing-dplyr/
If you were to do this in dplyr
:
summarize(group_by(ExampleData, Condition, Block), Average=mean(Var1, na.rm=TRUE), SD=sd(Var1),N=length(Var1), Med =median(Var1))
You could also use the piping so this could be:
ExampleData %>%
group_by(Condition, Block) %>%
summarise(Average=mean(Var1, na.rm=TRUE),
SD=sd(Var1),
N=length(Var1),
Med =median(Var1))
How to use a for loop to use ddply on multiple columns?
OP
mentioned to use simple for-loop
for this transformation on data. I understand that there are many other optimized way to solve this but in order to respect OP
desired I tried using for-loop
based solution. I have used dplyr
as plyr
is old now.
library(dplyr)
Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
#small change in the way data.frame is created
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2,
Feature3, stringsAsFactors = FALSE)
Feat <- c(colnames(df.main[3:5]))
# Ready with Key columns on which grouping is done
resultdf <- unique(select(df.main, Subject, GroupOfInterest))
#> resultdf
# Subject GroupOfInterest
#1 1 a
#2 1 b
#3 1 c
#7 2 a
#8 2 b
#9 2 c
#For loop for each column
for(q in Feat){
summean <- paste0('mean(', q, ')')
summ_name <- paste0(q) #Name of the column to store sum
df_sum <- df.main %>%
group_by(Subject, GroupOfInterest) %>%
summarise_(.dots = setNames(summean, summ_name))
#merge the result of new sum column in resultdf
resultdf <- merge(resultdf, df_sum, by = c("Subject", "GroupOfInterest"))
}
# Final result
#> resultdf
# Subject GroupOfInterest Feature1 Feature2 Feature3
#1 1 a 6.5 473.0 3.5
#2 1 b 4.5 437.0 2.0
#3 1 c 12.0 415.5 3.5
#4 2 a 10.0 437.5 3.0
#5 2 b 3.0 447.0 4.5
#6 2 c 6.0 462.0 2.5
Related Topics
How to Read the Source Code for an R Function
Large-Scale Regression in R with a Sparse Feature Matrix
R - Faster Way to Calculate Rolling Statistics Over a Variable Interval
Fitting a Curve to Specific Data
Calculating Percentile of Dataset Column
Find Names of Columns Which Contain Missing Values
How to Manually Change the Key Labels in a Legend in Ggplot2
Count the Number of Non-Zero Elements of Each Column
Automate Zip File Reading in R
Reading Information from a Password Protected Site
Dplyr Group by Colnames Described as Vector of Strings
Dplyr::Do() Requires Named Function
Using R Convert Data.Frame to Simple Vector
Ordering Stacks by Size in a Ggplot2 Stacked Bar Graph
Correctly Color Vertices in R Igraph
Is There a Fast Estimation of Simple Regression (A Regression Line with Only Intercept and Slope)