R use ddply or aggregate
I, too, would recommend data.table
here, but since you asked for an aggregate
solution, here is one which combines aggregate
and merge
to get all the columns:
merge(events22, aggregate(saleDate ~ custId, events22, max))
Or just aggregate
if you only want the "custId" and "DelivDate" columns:
aggregate(list(DelivDate = events22$saleDate),
list(custId = events22$custId),
function(x) events22[["DelivDate"]][which.max(x)])
Finally, here's an option using sqldf
:
library(sqldf)
sqldf("select custId, DelivDate, max(saleDate) `saleDate`
from events22 group by custId")
Benchmarks
I'm not a benchmarking or data.table
expert, but it surprised me that data.table
is not faster here. My suspicion is that the results would be quite different on a larger dataset, say for instance, your 400k lines one. Anyway, here's some benchmarking code modeled after @mnel's answer here so you can do some tests on your actual dataset for future reference.
library(rbenchmark)
First, set up your functions for what you want to benchmark.
DDPLY <- function() {
x <- ddply(events22, .(custId), .inform = T,
function(x) {
x[x$saleDate == max(x$saleDate),"DelivDate"]})
}
DATATABLE <- function() { x <- dt[, .SD[which.max(saleDate), ], by = custId] }
AGG1 <- function() {
x <- merge(events22, aggregate(saleDate ~ custId, events22, max)) }
AGG2 <- function() {
x <- aggregate(list(DelivDate = events22$saleDate),
list(custId = events22$custId),
function(x) events22[["DelivDate"]][which.max(x)]) }
SQLDF <- function() {
x <- sqldf("select custId, DelivDate, max(saleDate) `saleDate`
from events22 group by custId") }
DOCALL <- function() {
do.call(rbind,
lapply(split(events22, events22$custId), function(x){
x[which.max(x$saleDate), ]
})
)
}
Second, do the benchmarking.
benchmark(DDPLY(), DATATABLE(), AGG1(), AGG2(), SQLDF(), DOCALL(),
order = "elapsed")[1:5]
# test replications elapsed relative user.self
# 4 AGG2() 100 0.285 1.000 0.284
# 3 AGG1() 100 0.891 3.126 0.896
# 6 DOCALL() 100 1.202 4.218 1.204
# 2 DATATABLE() 100 1.251 4.389 1.248
# 1 DDPLY() 100 1.254 4.400 1.252
# 5 SQLDF() 100 2.109 7.400 2.108
Aggregate sum and mean in R with ddply
Antoher solution using dplyr
. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)
A faster way than applying 'ddply' to aggregate a variable by a function by groups
A more efficient data.table
option would be with :=
and set
functions. According to ?':='
set is a low-overhead loop-able version of :=. It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop).
Also, based on ?setorder
setorder (and setorderv) reorders the rows of a data.table based on the columns (and column order) provided. It reorders the table by reference and is therefore very memory efficient.
Below code assigns (:=
) by reference grouped by 'oid', 'aid' or 'oid', 'pid' and does the order with setorder
thus making it more efficient. The copy
was made on the original object so that it won't get changed while doing the assignment
library(data.table)
t3 <- copy(t)
t3[, aj1.consecutive := (sum(j2.consecutive) -
j2.consecutive)/(.N-1), .(oid, aid)]
setorder(t3, oid, pid)
t3[, apj1.consecutive := (sum(j2.consecutive) -
j2.consecutive)/(.N-1), .(oid, pid)]
setorder(t3, aid, o4.in)
-checking with OP's output
all.equal(t2, t3, check.attributes = FALSE)
[1] TRUE
Is there a faster way than applying 'ddply' to aggregate columns by groups with a large dataset?
We can use data.table
methods to make it faster
library(data.table)
df2 <- copy(df)
df3 <- setDT(df2)[order(o4.in)]
df3[, oid.lag.a := shift(oid), by = aid
][, oid.lag.p := shift(oid), by = pid]
df3[, j2.consecutive := fcoalesce(+(oid.lag.a == oid.lag.p), 0L)]
Also, note that some things in the OP's code are unnecessary i.e. using ifelse
to convert a logical to binary. It can just be as.integer
or coercion with +
. The second line again with ifelse
can be removed as well with coalesce
library(dplyr)
out <- df %>%
arrange(o4.in) %>%
group_by(aid) %>%
mutate(
oid.lag.a = lag(oid)
) %>%
group_by(pid) %>%
mutate(
oid.lag.p = lag(oid)
) %>%
ungroup %>%
mutate(j2.consecutive = coalesce(+(oid.lag.a == oid.lag.p), 0))
-checking the output from dplyr/data.table
all(out$j2.consecutive == df3$j2.consecutive )
[1] TRUE
How to use a for loop to use ddply on multiple columns?
OP
mentioned to use simple for-loop
for this transformation on data. I understand that there are many other optimized way to solve this but in order to respect OP
desired I tried using for-loop
based solution. I have used dplyr
as plyr
is old now.
library(dplyr)
Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
#small change in the way data.frame is created
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2,
Feature3, stringsAsFactors = FALSE)
Feat <- c(colnames(df.main[3:5]))
# Ready with Key columns on which grouping is done
resultdf <- unique(select(df.main, Subject, GroupOfInterest))
#> resultdf
# Subject GroupOfInterest
#1 1 a
#2 1 b
#3 1 c
#7 2 a
#8 2 b
#9 2 c
#For loop for each column
for(q in Feat){
summean <- paste0('mean(', q, ')')
summ_name <- paste0(q) #Name of the column to store sum
df_sum <- df.main %>%
group_by(Subject, GroupOfInterest) %>%
summarise_(.dots = setNames(summean, summ_name))
#merge the result of new sum column in resultdf
resultdf <- merge(resultdf, df_sum, by = c("Subject", "GroupOfInterest"))
}
# Final result
#> resultdf
# Subject GroupOfInterest Feature1 Feature2 Feature3
#1 1 a 6.5 473.0 3.5
#2 1 b 4.5 437.0 2.0
#3 1 c 12.0 415.5 3.5
#4 2 a 10.0 437.5 3.0
#5 2 b 3.0 447.0 4.5
#6 2 c 6.0 462.0 2.5
difference between ddply and aggregate
aggregate
calls FUN on each column independently, which is why you get independent means. ddply
is going to pass all columns to the function. A quick demonstration of what is being passed in aggregate
may be in order:
Some sample data for demonstration:
d <- data.frame(a=1:4, b=5:8, c=c(1,1,2,2))
> d
a b c
1 1 5 1
2 2 6 1
3 3 7 2
4 4 8 2
By using the function print
and ignoring the result of the commands aggregate
or ddply
, we can see what gets passed to the function in each iteration.
aggregate
:
tmp <- aggregate(d[1:2], by=list(d$c), print)
[1] 1 2
[1] 3 4
[1] 5 6
[1] 7 8
Note that individual columns are sent to print.
ddply
:
tmp <- ddply(d, .(c), print)
a b c
1 1 5 1
2 2 6 1
a b c
3 3 7 2
4 4 8 2
Note that data frames are being sent to print.
How to aggregate using ddply when not all elements of a variable exist on R
We can use the by
function:
by(data_sample, data_sample$Person, FUN = function(dat){
sum(dat[dat$Expense_Type == 'j',]$Expense_Value) / sum(dat$Expense_Value)
})
We could also make use of the dplyr
package:
library(dplyr)
data_sample %>%
group_by(Person) %>%
summarise(Percent_J = sum(ifelse(Expense_Type == 'j', Expense_Value, 0)) / sum(Expense_Value))
# A tibble: 5 × 2
Person Percent_J
<dbl> <dbl>
1 0 0.0000000
2 123 0.6666667
3 134 0.0000000
4 145 1.0000000
5 321 0.0000000
Aggregate over categories that contain NAs with ddply and lapply?
I know the question explicitly requests a ddply()/lapply()
solution.
But ... if you are willing to come on over to the dark side, here is a data.table()
-based function that should do the trick:
# Convert mydata to a data.table
library(data.table)
dt <- data.table(mydata, key = c("group", "size"))
# Define workhorse function
myfunction <- function(dt, VAR) {
E <- as.name(substitute(VAR))
dt[i = !is.na(eval(E)),
j = {n <- sum(.SD[,someValue])
.SD[, list(sumTest = sum(someValue),
sumTestTotal = n,
share = sum(someValue)/n),
by = VAR]
},
by = key(dt)]
}
# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")
ADDED ON EDIT
Here's how you could run this for a vector of different categorical variables:
catVars <- c("category", "categoryA")
ll <- lapply(catVars,
FUN = function(X) {
do.call(myfunction, list(dt, X))
})
names(ll) <- catVars
lapply(ll, head, 3)
# $category
# group size category sumTest sumTestTotal share
# [1,] A H 2 46 185 0.2486486
# [2,] A H 3 93 185 0.5027027
# [3,] A H 1 46 185 0.2486486
#
# $categoryA
# group size categoryA sumTest sumTestTotal share
# [1,] A H A 79 200 0.395
# [2,] A H X 68 200 0.340
# [3,] A H Z 53 200 0.265
ddply aggregated column names
You can use summarise
:
agg_data <- ddply(raw_data, .(id, date, classification), summarise, "no_entries" = nrow(piece))
or you can use length(<column_name>)
if nrow(piece)
doesn't work. For instance, here's an example that should be runnable by anyone:
ddply(baseball, .(year), summarise, newColumn = nrow(piece))
or
ddply(baseball, .(year), summarise, newColumn = length(year))
EDIT
Or as Joshua comments, the all caps version, NROW
does the checking for you.
(How) can I use ddply to summarize a dataframe grouped by two factors?
Just remove the c in the .variables
argument, so your code is:
library(plyr)
ddply(ExampleData, .(Condition, Block), summarize, Average=mean(Var1, na.rm=TRUE), SD=sd(Var1),N=length(Var1), Med =median(Var1))
By the way, you might want to switch to using dplyr
instead of plyr
.
https://blog.rstudio.com/2014/01/17/introducing-dplyr/
If you were to do this in dplyr
:
summarize(group_by(ExampleData, Condition, Block), Average=mean(Var1, na.rm=TRUE), SD=sd(Var1),N=length(Var1), Med =median(Var1))
You could also use the piping so this could be:
ExampleData %>%
group_by(Condition, Block) %>%
summarise(Average=mean(Var1, na.rm=TRUE),
SD=sd(Var1),
N=length(Var1),
Med =median(Var1))
Related Topics
Optimized Rolling Functions on Irregular Time Series with Time-Based Window
Equivalent to Rowmeans() for Min()
How Achieve Identical Facet Sizes and Scales in Several Multi-Facet Ggplot2 Graphics
Make a Rectangular Legend, with Rows and Columns Labeled, in Grid
Read CSV File in R with Currency Column as Numeric
Ggplot: Boxplot of Multiple Column Values
Check If a Date Is Within an Interval in R
How to Get the Number of Rows in a CSV File Without Opening It
How to Stop Bookdown Tables from Floating to Bottom of the Page in PDF
Fitting with Ggplot2, Geom_Smooth and Nls
Shiny Leaflet Ploygon Click Event
How to Repeat the Grubbs Test and Flag the Outliers
Get Width of Plot Area in Ggplot2