How Does One Aggregate and Summarize Data Quickly

How does one aggregate and summarize data quickly?

You should look at the package data.table for faster aggregation operations on large data frames. For your problem, the solution would look like:

library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']

Summarize data at different aggregate levels - R and tidyverse

Another alternative:

library(tidyverse)  

iris %>% 
  mutate_at("Species", as.character) %>%
  list(group_by(.,Species), .) %>%
  map(~summarize(.,mean_s_length = mean(Sepal.Length),
                 max_s_width = max(Sepal.Width))) %>%
  bind_rows() %>%
  replace_na(list(Species="Overall"))
#> # A tibble: 4 x 3
#>   Species    mean_s_length max_s_width
#>   <chr>              <dbl>       <dbl>
#> 1 setosa              5.01         4.4
#> 2 versicolor          5.94         3.4
#> 3 virginica           6.59         3.8
#> 4 Overall             5.84         4.4

Aggregate / summarize multiple variables per group (e.g. sum, mean)

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
#  year month         x1           x2
1  2000     1  -80.83405 -224.9540159
2  2000     2 -223.76331 -288.2418017
3  2000     3 -188.83930 -481.5601913
4  2000     4 -197.47797 -473.7137420
5  2000     5 -259.07928 -372.4563522

Use data.table to count and aggregate / summarize a column

The post you are referring to gives a method on how to apply one aggregation method to several columns. If you want to apply different aggregation methods to different columns, you can do:

dat[, .(count = .N, var = sum(VAR)), by = MNTH]

this results in:

     MNTH count var
1: 201501     4   2
2: 201502     3   0
3: 201503     5   2
4: 201504     4   2

You can also add these values to your existing dataset by updating your dataset by reference:

dat[, `:=` (count = .N, var = sum(VAR)), by = MNTH]

this results in:

> dat
      MNTH VAR count var
 1: 201501   1     4   2
 2: 201501   1     4   2
 3: 201501   0     4   2
 4: 201501   0     4   2
 5: 201502   0     3   0
 6: 201502   0     3   0
 7: 201502   0     3   0
 8: 201503   0     5   2
 9: 201503   0     5   2
10: 201503   1     5   2
11: 201503   1     5   2
12: 201503   0     5   2
13: 201504   1     4   2
14: 201504   0     4   2
15: 201504   1     4   2
16: 201504   0     4   2

For further reading about how to use data.table syntax, see the Getting started guides on the GitHub wiki.

How to use aggregate and summary function to get unique columns in a dataframe?

Since aggregate's simplify parameter defaults to TRUE, it's simplifying the results of calling the function (here, summary) to a matrix. You can reconstruct the data.frame, coercing the column into its own data.frame:

df <- data.frame(Result = c(1,1,2,100,50,30,45,20, 10, 8),
                 Location = c("Alpha", "Beta", "Gamma", "Alpha", "Beta", "Gamma", "Alpha", "Beta", "Gamma", "Alpha"))

Agg <- aggregate(df$Result, list(df$Location), summary)

data.frame(Location = Agg$Group.1, Agg$x)
#>   Location Min. X1st.Qu. Median     Mean X3rd.Qu. Max.
#> 1    Alpha    1     6.25   26.5 38.50000    58.75  100
#> 2     Beta    1    10.50   20.0 23.66667    35.00   50
#> 3    Gamma    2     6.00   10.0 14.00000    20.00   30

Alternately, dplyr's summarise family of functions can handle multiple summary statistics well:

library(dplyr)

df %>% group_by(Location) %>% summarise_all(funs(min, median, max))
#> # A tibble: 3 x 4
#>   Location   min median   max
#>   <fct>    <dbl>  <dbl> <dbl>
#> 1 Alpha       1.   26.5  100.
#> 2 Beta        1.   20.0   50.
#> 3 Gamma       2.   10.0   30.

If you really want all of summary, you can use broom::tidy to turn each group's results into a data frame in a list column, which can be unnested:

df %>% 
    group_by(Location) %>% 
    summarise(x = list(broom::tidy(summary(Result)))) %>% 
    tidyr::unnest()
#> # A tibble: 3 x 7
#>   Location minimum    q1 median  mean    q3 maximum
#>   <fct>      <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>
#> 1 Alpha         1.  6.25   26.5  38.5  58.8    100.
#> 2 Beta          1. 10.5    20.0  23.7  35.0     50.
#> 3 Gamma         2.  6.00   10.0  14.0  20.0     30.

faster way to create variable that aggregates a column by id

For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave is what you want.

df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)

Aggregate takes a long time

Have you loading your initial table with the data.table library? This will save a significant amount of time just loading 100m rows.

DT <- fread("path/to/file.csv")

Then you can aggregate fairly quickly with:

DT[ , AggColumn := sum(time), by = id]

Faster function than aggregate() in R

Im sure the real data is much larger but your solution seems on-point. as some alternatives I benchmarked other approaches:

Tidyverse

tidy_fn <- function(){
    rbind(old.data, new.data) %>% group_by(id) %>% dplyr::summarise_all(
        function(x)sum(x)
    )
}

Plyr and base functions (I know..bad-form)

plyr_base_fn <- function(){

  plyr::ldply(Map(function(x){
    sapply(x[1:3],sum)
    }, rbind(old.data,new.data) %>% split(., .$id)
    ))

}

Your aggregation approach:

agg_fn <- function(){
    aggregate(cbind(x,y,z)~id, rbind(old.data, new.data), sum, na.rm=F)
}

Results from two tests:

1000 reps

> microbenchmark(tidy_fn(),agg_fn(),plyr_base_fn(),times = 1000L)
Unit: milliseconds
           expr      min       lq     mean   median       uq       max neval
      tidy_fn() 2.220585 2.386112 2.823122 2.529649 2.775300 13.425573  1000
       agg_fn() 1.668601 1.795527 2.149068 1.895666 2.062904 16.117802  1000
 plyr_base_fn() 1.253772 1.331501 1.567777 1.402464 1.526089  8.396307  1000

5000 reps

microbenchmark(tidy_fn(),agg_fn(),plyr_base_fn(),times = 5000L)
    Unit: milliseconds
               expr      min       lq     mean   median       uq       max neval
          tidy_fn() 2.227752 2.400265 2.696034 2.542617 2.722082  12.46249  5000
           agg_fn() 1.673647 1.792085 2.067232 1.897011 2.019915 301.84694  5000
     plyr_base_fn() 1.247306 1.336010 1.503682 1.411608 1.503290  14.24656  5000

Summarizing count and conditional aggregate functions on the same factor

Assuming that your original dataset is similar to the one you created (i.e. with NA as character. You could specify na.strings while reading the data using read.table. But, I guess NAs would be detected automatically.

The price column is factor which needs to be converted to numeric class. When you use as.numeric, all the non-numeric elements (i.e. "NA", FALSE) gets coerced to NA) with a warning.

library(dplyr)
df %>%
     mutate(price=as.numeric(as.character(price))) %>%  
     group_by(company, year, product) %>%
     summarise(total.count=n(), 
               count=sum(is.na(price)), 
               avg.price=mean(price,na.rm=TRUE),
               max.price=max(price, na.rm=TRUE))

data

I am using the same dataset (except the ... row) that was showed.

df = tbl_df(data.frame(company=c("Acme", "Meca", "Emca", "Acme", "Meca","Emca"),
 year=c("2011", "2010", "2009", "2011", "2010", "2013"), product=c("Wrench", "Hammer",
 "Sonic Screwdriver", "Fairy Dust", "Kindness", "Helping Hand"), price=c("5.67",
 "7.12", "12.99", "10.99", "NA",FALSE)))

How Does One Aggregate and Summarize Data Quickly