R Use Ddply or Aggregate

R use ddply or aggregate

I, too, would recommend data.table here, but since you asked for an aggregate solution, here is one which combines aggregate and merge to get all the columns:

merge(events22, aggregate(saleDate ~ custId, events22, max))

Or just aggregate if you only want the "custId" and "DelivDate" columns:

aggregate(list(DelivDate = events22$saleDate), 
          list(custId = events22$custId),
          function(x) events22[["DelivDate"]][which.max(x)])

Finally, here's an option using sqldf:

library(sqldf)
sqldf("select custId, DelivDate, max(saleDate) `saleDate` 
      from events22 group by custId")

Benchmarks

I'm not a benchmarking or data.table expert, but it surprised me that data.table is not faster here. My suspicion is that the results would be quite different on a larger dataset, say for instance, your 400k lines one. Anyway, here's some benchmarking code modeled after @mnel's answer here so you can do some tests on your actual dataset for future reference.

library(rbenchmark)

First, set up your functions for what you want to benchmark.

DDPLY <- function() { 
  x <- ddply(events22, .(custId), .inform = T, 
             function(x) {
               x[x$saleDate == max(x$saleDate),"DelivDate"]}) 
}
DATATABLE <- function() { x <- dt[, .SD[which.max(saleDate), ], by = custId] }
AGG1 <- function() { 
  x <- merge(events22, aggregate(saleDate ~ custId, events22, max)) }
AGG2 <- function() { 
  x <- aggregate(list(DelivDate = events22$saleDate), 
                 list(custId = events22$custId),
                 function(x) events22[["DelivDate"]][which.max(x)]) }
SQLDF <- function() { 
  x <- sqldf("select custId, DelivDate, max(saleDate) `saleDate` 
             from events22 group by custId") }
DOCALL <- function() {
  do.call(rbind, 
          lapply(split(events22, events22$custId), function(x){
            x[which.max(x$saleDate), ]
          })
  )
}

Second, do the benchmarking.

benchmark(DDPLY(), DATATABLE(), AGG1(), AGG2(), SQLDF(), DOCALL(), 
          order = "elapsed")[1:5]
#          test replications elapsed relative user.self
# 4      AGG2()          100   0.285    1.000     0.284
# 3      AGG1()          100   0.891    3.126     0.896
# 6    DOCALL()          100   1.202    4.218     1.204
# 2 DATATABLE()          100   1.251    4.389     1.248
# 1     DDPLY()          100   1.254    4.400     1.252
# 5     SQLDF()          100   2.109    7.400     2.108

Aggregate sum and mean in R with ddply

Antoher solution using dplyr. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.

library(dplyr)
library(ggplot2)

diamonds %>%
    group_by(cut) %>%
    summarise_each(funs(sum, mean), x:z, price) %>%
    select(cut, matches("[xyz]_sum"), price_mean)

A faster way than applying 'ddply' to aggregate a variable by a function by groups

A more efficient data.table option would be with := and set functions. According to ?':='

set is a low-overhead loop-able version of :=. It is particularly useful for repetitively updating rows of certain columns by reference (using a for-loop).

Also, based on ?setorder

setorder (and setorderv) reorders the rows of a data.table based on the columns (and column order) provided. It reorders the table by reference and is therefore very memory efficient.

Below code assigns (:=) by reference grouped by 'oid', 'aid' or 'oid', 'pid' and does the order with setorder thus making it more efficient. The copy was made on the original object so that it won't get changed while doing the assignment

library(data.table)
t3 <- copy(t)

t3[, aj1.consecutive :=  (sum(j2.consecutive) - 
            j2.consecutive)/(.N-1), .(oid, aid)]
setorder(t3, oid, pid)
t3[, apj1.consecutive := (sum(j2.consecutive) - 
            j2.consecutive)/(.N-1), .(oid, pid)]
setorder(t3, aid, o4.in)

-checking with OP's output

all.equal(t2, t3, check.attributes = FALSE)
[1] TRUE

Is there a faster way than applying 'ddply' to aggregate columns by groups with a large dataset?

We can use data.table methods to make it faster

library(data.table)
df2 <- copy(df)
df3 <- setDT(df2)[order(o4.in)]
df3[, oid.lag.a := shift(oid), by = aid
    ][, oid.lag.p := shift(oid), by = pid]
          
df3[, j2.consecutive := fcoalesce(+(oid.lag.a == oid.lag.p), 0L)]

Also, note that some things in the OP's code are unnecessary i.e. using ifelse to convert a logical to binary. It can just be as.integer or coercion with +. The second line again with ifelse can be removed as well with coalesce

library(dplyr)
out <- df %>% 
    arrange(o4.in) %>% 
    group_by(aid) %>% 
    mutate(
      oid.lag.a = lag(oid)
    ) %>%     
    group_by(pid)  %>% 
    mutate(
      oid.lag.p = lag(oid)
    ) %>% 
    ungroup %>% 
    mutate(j2.consecutive = coalesce(+(oid.lag.a == oid.lag.p), 0))

-checking the output from dplyr/data.table

all(out$j2.consecutive == df3$j2.consecutive )
[1] TRUE

How to use a for loop to use ddply on multiple columns?

OP mentioned to use simple for-loop for this transformation on data. I understand that there are many other optimized way to solve this but in order to respect OP desired I tried using for-loop based solution. I have used dplyr as plyr is old now.

library(dplyr)
Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
#small change in the way data.frame is created 
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2, 
 Feature3, stringsAsFactors = FALSE)

Feat <- c(colnames(df.main[3:5])) 

# Ready with Key columns on which grouping is done
resultdf <- unique(select(df.main, Subject, GroupOfInterest))
#> resultdf
#  Subject GroupOfInterest
#1       1               a
#2       1               b
#3       1               c
#7       2               a
#8       2               b
#9       2               c

#For loop for each column
for(q in Feat){
  summean <- paste0('mean(', q, ')')
  summ_name <- paste0(q) #Name of the column to store sum
  df_sum <- df.main %>% 
     group_by(Subject, GroupOfInterest) %>%
    summarise_(.dots = setNames(summean, summ_name)) 
  #merge the result of new sum column in resultdf
  resultdf <- merge(resultdf, df_sum, by = c("Subject", "GroupOfInterest"))
}

# Final result
#> resultdf
#  Subject GroupOfInterest Feature1 Feature2 Feature3
#1       1               a      6.5    473.0      3.5
#2       1               b      4.5    437.0      2.0
#3       1               c     12.0    415.5      3.5
#4       2               a     10.0    437.5      3.0
#5       2               b      3.0    447.0      4.5
#6       2               c      6.0    462.0      2.5

difference between ddply and aggregate

aggregate calls FUN on each column independently, which is why you get independent means. ddply is going to pass all columns to the function. A quick demonstration of what is being passed in aggregate may be in order:

Some sample data for demonstration:

d <- data.frame(a=1:4, b=5:8, c=c(1,1,2,2))

> d
  a b c
1 1 5 1
2 2 6 1
3 3 7 2
4 4 8 2

By using the function print and ignoring the result of the commands aggregate or ddply, we can see what gets passed to the function in each iteration.

aggregate:

tmp <- aggregate(d[1:2], by=list(d$c), print)
[1] 1 2
[1] 3 4
[1] 5 6
[1] 7 8

Note that individual columns are sent to print.

ddply:

tmp <- ddply(d, .(c), print)
  a b c
1 1 5 1
2 2 6 1
  a b c
3 3 7 2
4 4 8 2

Note that data frames are being sent to print.

How to aggregate using ddply when not all elements of a variable exist on R

We can use the by function:

by(data_sample, data_sample$Person, FUN = function(dat){
    sum(dat[dat$Expense_Type == 'j',]$Expense_Value) / sum(dat$Expense_Value)
})

We could also make use of the dplyr package:

library(dplyr)
data_sample %>%
    group_by(Person) %>%
    summarise(Percent_J = sum(ifelse(Expense_Type == 'j', Expense_Value, 0)) / sum(Expense_Value))

# A tibble: 5 × 2
  Person Percent_J
   <dbl>     <dbl>
1      0 0.0000000
2    123 0.6666667
3    134 0.0000000
4    145 1.0000000
5    321 0.0000000

Aggregate over categories that contain NAs with ddply and lapply?

I know the question explicitly requests a ddply()/lapply() solution.

But ... if you are willing to come on over to the dark side, here is a data.table()-based function that should do the trick:

# Convert mydata to a data.table
library(data.table)
dt <- data.table(mydata, key = c("group", "size"))

# Define workhorse function
myfunction <- function(dt, VAR) {
    E <- as.name(substitute(VAR))
    dt[i = !is.na(eval(E)), 
       j = {n <- sum(.SD[,someValue]) 
            .SD[, list(sumTest = sum(someValue),
                       sumTestTotal = n,
                       share = sum(someValue)/n), 
                by = VAR]
           }, 
       by = key(dt)]
}

# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")

ADDED ON EDIT

Here's how you could run this for a vector of different categorical variables:

catVars <- c("category", "categoryA")

ll <- lapply(catVars, 
             FUN = function(X) {
                       do.call(myfunction, list(dt, X))
                   })
names(ll) <- catVars

lapply(ll, head, 3)
# $category
#      group size category sumTest sumTestTotal     share
# [1,]     A    H        2      46          185 0.2486486
# [2,]     A    H        3      93          185 0.5027027
# [3,]     A    H        1      46          185 0.2486486
# 
# $categoryA
#      group size categoryA sumTest sumTestTotal share
# [1,]     A    H         A      79          200 0.395
# [2,]     A    H         X      68          200 0.340
# [3,]     A    H         Z      53          200 0.265

ddply aggregated column names

You can use summarise:

agg_data <- ddply(raw_data, .(id, date, classification), summarise, "no_entries" = nrow(piece))

or you can use length(<column_name>) if nrow(piece) doesn't work. For instance, here's an example that should be runnable by anyone:

ddply(baseball, .(year), summarise, newColumn = nrow(piece))

ddply(baseball, .(year), summarise, newColumn = length(year))

EDIT

Or as Joshua comments, the all caps version, NROW does the checking for you.

(How) can I use ddply to summarize a dataframe grouped by two factors?

Just remove the c in the .variables argument, so your code is:

library(plyr)
ddply(ExampleData, .(Condition, Block), summarize,  Average=mean(Var1, na.rm=TRUE), SD=sd(Var1),N=length(Var1), Med =median(Var1))

By the way, you might want to switch to using dplyr instead of plyr.
https://blog.rstudio.com/2014/01/17/introducing-dplyr/

If you were to do this in dplyr:

summarize(group_by(ExampleData, Condition, Block), Average=mean(Var1, na.rm=TRUE), SD=sd(Var1),N=length(Var1), Med =median(Var1))

You could also use the piping so this could be:

ExampleData %>% 
  group_by(Condition, Block) %>% 
  summarise(Average=mean(Var1, na.rm=TRUE), 
            SD=sd(Var1),
            N=length(Var1), 
            Med =median(Var1))

R Use Ddply or Aggregate