Why Is Plyr So Slow

Why is plyr so slow?

Why it is so slow? A little research located a mail group posting from a Aug. 2011 where @hadley, the package author, states

This is a drawback of the way that ddply always works with data
frames. It will be a bit faster if you use summarise instead of
data.frame (because data.frame is very slow), but I'm still thinking
about how to overcome this fundamental limitation of the ddply
approach.

As for being efficient plyr code I didn't know either. After a bunch of param testing and bench-marking it looks like we can do better.

The summarize() in your command is a just helper function, pure and simple. We can replace it with our own sum function since it isn't helping with anything that isn't already simple and the .data and .(price) arguments can be made more explicit. The result is

ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

The summarize may seem nice, but it just isn't quicker than a simple function call. It makes sense; just look at our little function versus the code for summarize. Running your benchmarks with the revised formula yields a noticeable gain. Don't take that to mean you've used plyr incorrectly, you haven't, it just isn't efficient; nothing you can do with it will make it as fast as other options.

In my opinion the optimized function still stinks as it isn't clear and must be mentally parsed along with still being ridiculously slow compared with data.table ( even with a 60% gain ).

In the same thread mentioned above, regarding the slowness of plyr, a plyr2 project is mentioned. Since the time of the original answer to the question the plyr author has released dplyr as the successor of plyr. While both plyr and dplyr are billed as data manipulation tools and your primary stated interest is aggregation you may still be interested in your benchmark results of the new package for comparison as it has a reworked backend to improve performance.

plyr_Original   <- function(dd) ddply( dd, .(price), summarise, ss=sum(volume))
plyr_Optimized  <- function(dd) ddply( dd[, 2:3], ~price, function(x) sum( x$volume ) )

dplyr <- function(dd) dd %.% group_by(price) %.% summarize( sum(volume) )    

data_table <- function(dd) dd[, sum(volume), keyby=price]

_{^{The dataframe package has been removed from CRAN and subsequently from the tests, along with the matrix function versions.}}

Here's the i=5, j=8 benchmark results:

$`obs= 500,000 unique prices= 158,286 reps= 5`
                  test elapsed relative
9     data_table(d.dt)   0.074    1.000
4          dplyr(d.dt)   0.133    1.797
3          dplyr(d.df)   1.832   24.757
6        l.apply(d.df)   5.049   68.230
5        t.apply(d.df)   8.078  109.162
8            agg(d.df)  11.822  159.757
7            b.y(d.df)  48.569  656.338
2 plyr_Optimized(d.df) 148.030 2000.405
1  plyr_Original(d.df) 401.890 5430.946

No doubt the optimizing helped a bit. Take a look at the d.df functions; they just can't compete.

For a little perspective on the slowness of the data.frame structure here are micro-benchmarks of the aggregation times of data_table and dplyr using a larger test dataset (i=8,j=8).

$`obs= 50,000,000 unique prices= 15,836,476 reps= 5`
Unit: seconds
             expr    min     lq median     uq    max neval
 data_table(d.dt)  1.190  1.193  1.198  1.460  1.574    10
      dplyr(d.dt)  2.346  2.434  2.542  2.942  9.856    10
      dplyr(d.df) 66.238 66.688 67.436 69.226 86.641    10

The data.frame is still left in the dust. Not only that, but here's the elapsed system.time to populate the data structures with the test data:

`d.df` (data.frame)  3.181 seconds.
`d.dt` (data.table)  0.418 seconds.

Both creation and aggregation of the data.frame is slower than that of the data.table.

Working with the data.frame in R is slower than some alternatives but as the benchmarks show the built in R functions blow plyr out of the water. Even managing the data.frame as dplyr does, which improves upon the built-ins, doesn't give optimal speed; where as data.table is faster both in creation and aggregation and data.table does what it does while working with/upon data.frames.

In the end...

Plyr is slow because of the way it works with and manages the data.frame manipulation.

[punt:: see the comments to the original question].

## R version 3.0.2 (2013-09-25)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] microbenchmark_1.3-0 rbenchmark_1.0.0     xts_0.9-7           
## [4] zoo_1.7-11           data.table_1.9.2     dplyr_0.1.2         
## [7] plyr_1.8.1           knitr_1.5.22        
## 
## loaded via a namespace (and not attached):
## [1] assertthat_0.1  evaluate_0.5.2  formatR_0.10.4  grid_3.0.2     
## [5] lattice_0.20-27 Rcpp_0.11.0     reshape2_1.2.2  stringr_0.6.2  
## [9] tools_3.0.2

_{^{Data-Generating gist .rmd}}

R Plyr Sapply seems to be really slow

We could use a couple of options to improve the speed.

1. stringi

Functions in stringi package are usually faster. We can extract the alphanumeric characters using stri_extract_all_regex with the appropriate regex. Here, I am using [[:alnum:]]{2,} based on the example showed. rbind the list elements (do.call(rbind.data.frame,..)), change the column names with setNames, convert the 'data.frame' to 'data.table' (setDT), and paste the 'topic' elements grouped by 'itemID' (toString- is a wrapper for paste(., collapse=', ')).

library(stringi)
library(data.table)
setDT(setNames(do.call(rbind.data.frame,stri_extract_all_regex(df$V1,
       '[[:alnum:]]{2,}')), c('topic', 'itemID')))[, 
          list(topic=toString(topic)), itemID]
#   itemID                     topic
#1:   2286 E11, ECAT, M11, M12, MCAT
#2:   2287                       C24

2. dplyr/tidyr

We can use extract from tidyr to convert the single column to multiple columns by specifying the appropriate regex and paste the 'topic' elements grouped by 'itemID'

library(dplyr)
library(tidyr)
 extract(df, V1, into= c('topic', 'itemID'), '([^ ]+) ([^ ]+).*', 
                        convert=TRUE) %>% 
           group_by(itemID) %>% 
           summarise(topic=toString(topic))
#  itemID                     topic
#1   2286 E11, ECAT, M11, M12, MCAT
#2   2287                       C24

plyr in R very slow during merging

If I understood correctly what you're trying to achieve, this should do what you want, pretty quick, and without too much memory loss.

#toy data
A <- data.frame(
    A=letters[1:10],
    B=letters[11:20],
    CC=1:10
)

ord <- sample(1:10)
B <- data.frame(
    A=letters[1:10][ord],
    B=letters[11:20][ord],
    CC=(1:10)[ord]
)
#combining values
A.comb <- paste(A$A,A$B,sep="-")
B.comb <- paste(B$A,B$B,sep="-")
#matching
A$DD <- B$CC[match(A.comb,B.comb)]
A

This applies only if the combinations are unique. If they're not, you'll have to take care of that first. Without the data it's quite impossible to know what you're trying to achieve exactly in your complete function, but you should be able to port the logic given here to your own case.

Why is dplyr slower than plyr for data aggregation?

Using data.table

library(data.table)
START = proc.time()
 X3 = as.data.table(X)[X[, .I[which.min(flag)] , by = group]$V1]
proc.time() - START
#   user  system elapsed 
#  0.00    0.02    0.02

Or use order

START = proc.time()
 X4 = as.data.table(X)[order(flag), .SD[1L] , by = group]
proc.time() - START
#    user  system elapsed 
#    0.02    0.00    0.01

The corresponding timings with the dplyr and plyr using OP's code are

#   user  system elapsed 
#  0.28    0.04    2.68 

#   user  system elapsed 
#  0.01    0.06    0.67

Also as commented by @Frank, a base R method timing is

START = proc.time()
Z = X[order(X$flag),]
X5 = with(Z, Z[tapply(seq(nrow(X)), group, head, 1), ])
proc.time() - START
#    user  system elapsed 
#    0.15    0.03    0.65

I am guessing the slice is slowering the dplyr.

plyr::aaply is very slow compared to nested loops

Why not just use the base apply function?

apply(obs, c(2,3), min)

It's fast, doesn't require loading an additional package and gives the same result, as per:

all.equal(
  apply(obs, 2:3, min), 
  aaply(obs, 2:3, min), check.attributes=FALSE) 
#[1] TRUE

Timings using system.time() using a 10 x 1350 x 1280 array:

Loop
#   user  system elapsed 
#   3.79    0.00    3.79 

Base apply()
#   user  system elapsed 
#   2.87    0.02    2.89 

plyr::aaply()
#Timing stopped at: 122.1 0.04 122.24

How to improve speed of R code by replacing complex and slow plyr steps with data.table or dplyr?

Your replication with data.table works for me (except that channel is capitalized). Below are my attempts to replicate the first step of your list with dplyr and with data.table.

# required packages
require(plyr)
require(dplyr)
require(data.table)

sample data

mergeddf2 <- data.frame(df.activ.id = 1:5, 
                        channel = 1:8, 
                        mainID = 1:40, 
                        DateTime = Sys.Date() - 80:1, 
                        cat = letters[1:6], 
                        effrespflag = rnorm(240), 
                        othervar = 1, 
                        MarketingChannel = 2)

plyr solution

mergeddf3 <- ddply(mergeddf2, .(df.activ.id, channel), summarize, 
                   spotsids = paste(mainID, collapse = ","), 
                   spotsdt = paste(DateTime, collapse = ","), 
                   spotsinfos = paste(cat, collapse = ","), 
                   effrespflags = paste(effrespflag, collapse = ","))

dplyr solution

mergeddf3.dplyr <- 
  mergeddf2 %>% 
  group_by(df.activ.id, channel) %>%
  summarise_each(funs = funs(paste(., collapse = ",")), mainID, DateTime, cat, effrespflag) %>%
  magrittr::set_colnames(c("df.activ.id", "channel", "spotsids", "spotsdt", "spotsinfos", "effrespflags")) 
# check for equality
all.equal(mergeddf3, as.data.frame(mergeddf3.dplyr))
## [1] TRUE

data.table solution

setDT(mergeddf2)
mergeddf3test <- mergeddf2[, list(spotsids = paste(mainID, collapse = ","), 
                                  spotsdt = paste(DateTime, collapse = ","), 
                                  spotsinfos = paste(cat, collapse = ","), 
                                  effrespflags = paste(effrespflag, collapse = ",")),
                           by=list(df.activ.id,channel)] 
# check for equality
all.equal(mergeddf3, setDF(setkeyv(mergeddf3test, c("df.activ.id", "channel"))))
## [1] TRUE

Why is my code is so slow to execute in R?

Here is a new solution with data.table:

  dt <- as.data.table(df)

  do_stuff <- function(a, b) {
    sdf <- Hmisc::approxExtrap(a, b, xout = c(1:20))
    sdf$z <- stack(pmax(table(factor(as.integer(a), levels = 1:20)), 1))$values - 1
    sdf <- as.data.frame(t(unlist(sdf[c("y", "z")]))) 
    sdf
  }

  df_result <- dt[, do_stuff(a, b), by = prod_id]

And a benchmark with the original:

library(microbenchmark)
library(dplyr)
library(data.table)

microbenchmark(
"original" = {
  df_result <- data.frame()
  prods<- dplyr::distinct(df,prod_id)$prod_id #Distinct Prod_ID
  for(j in 1:NROW(prods)) {
    dfj<- filter(df, prod_id==prods[j]) 
    sdf<-as.data.frame(Hmisc::approxExtrap(dfj$a, dfj$b, xout = c(1:20))) #Extrapolating
    sdf$z<-stack(pmax(table(factor(as.integer(dfj$a), levels = 1:20)), 1))[2:1]$values - 1 #Increment if a value was there more than 1 time
    sdf<-select_(sdf,"y","z") 
    sdf<-as.data.frame(t(unlist(sdf))) 
    df_result<-rbind(df_result,sdf)
  }
},
"new" = {
  dt <- as.data.table(df)

  do_stuff <- function(a, b) {
    sdf <- Hmisc::approxExtrap(a, b, xout = c(1:20))
    sdf$z <- stack(pmax(table(factor(as.integer(a), levels = 1:20)), 1))$values - 1
    sdf <- as.data.frame(t(unlist(sdf[c("y", "z")]))) 
    sdf
  }

  df_result <- dt[, do_stuff(a, b), by = prod_id]
}
)

Results:

Unit: milliseconds
     expr       min        lq     mean    median        uq       max neval
 original 20.090200 20.841403 22.63290 21.705137 23.479769 32.535576   100
      new  2.063369  2.279269  2.61532  2.411447  2.538806  9.312241   100

Getting Slow Motion Video while using QML Media Player

I have tried the to disable the vsync and it's better but not perfect.

QSurfaceFormat format;
    format.setProfile ( QSurfaceFormat::CoreProfile );
    format.setRenderableType ( QSurfaceFormat::OpenGLES );
    format.setSwapInterval ( 0 );
    format.setVersion ( 3, 0 );  
    QSurfaceFormat::setDefaultFormat ( format );

UPDATE: I found the reason is the performance of the underlying media service. It seems Qt uses windows native player (e.g. DirectShow). So one solution is to use custome gstreamer pipeline powered by nvidia decoder. So I'm gonna try that and integrate that in qml.

Why Is Plyr So Slow