How Does One Aggregate and Summarize Data Quickly

How does one aggregate and summarize data quickly?

You should look at the package data.table for faster aggregation operations on large data frames. For your problem, the solution would look like:

library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']

Summarize data at different aggregate levels - R and tidyverse

Another alternative:

library(tidyverse)  

iris %>%
mutate_at("Species", as.character) %>%
list(group_by(.,Species), .) %>%
map(~summarize(.,mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))) %>%
bind_rows() %>%
replace_na(list(Species="Overall"))
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4

Aggregate / summarize multiple variables per group (e.g. sum, mean)

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522

Use data.table to count and aggregate / summarize a column

The post you are referring to gives a method on how to apply one aggregation method to several columns. If you want to apply different aggregation methods to different columns, you can do:

dat[, .(count = .N, var = sum(VAR)), by = MNTH]

this results in:

     MNTH count var
1: 201501 4 2
2: 201502 3 0
3: 201503 5 2
4: 201504 4 2

You can also add these values to your existing dataset by updating your dataset by reference:

dat[, `:=` (count = .N, var = sum(VAR)), by = MNTH]

this results in:

> dat
MNTH VAR count var
1: 201501 1 4 2
2: 201501 1 4 2
3: 201501 0 4 2
4: 201501 0 4 2
5: 201502 0 3 0
6: 201502 0 3 0
7: 201502 0 3 0
8: 201503 0 5 2
9: 201503 0 5 2
10: 201503 1 5 2
11: 201503 1 5 2
12: 201503 0 5 2
13: 201504 1 4 2
14: 201504 0 4 2
15: 201504 1 4 2
16: 201504 0 4 2

For further reading about how to use data.table syntax, see the Getting started guides on the GitHub wiki.

How to use aggregate and summary function to get unique columns in a dataframe?

Since aggregate's simplify parameter defaults to TRUE, it's simplifying the results of calling the function (here, summary) to a matrix. You can reconstruct the data.frame, coercing the column into its own data.frame:

df <- data.frame(Result = c(1,1,2,100,50,30,45,20, 10, 8),
Location = c("Alpha", "Beta", "Gamma", "Alpha", "Beta", "Gamma", "Alpha", "Beta", "Gamma", "Alpha"))

Agg <- aggregate(df$Result, list(df$Location), summary)

data.frame(Location = Agg$Group.1, Agg$x)
#> Location Min. X1st.Qu. Median Mean X3rd.Qu. Max.
#> 1 Alpha 1 6.25 26.5 38.50000 58.75 100
#> 2 Beta 1 10.50 20.0 23.66667 35.00 50
#> 3 Gamma 2 6.00 10.0 14.00000 20.00 30

Alternately, dplyr's summarise family of functions can handle multiple summary statistics well:

library(dplyr)

df %>% group_by(Location) %>% summarise_all(funs(min, median, max))
#> # A tibble: 3 x 4
#> Location min median max
#> <fct> <dbl> <dbl> <dbl>
#> 1 Alpha 1. 26.5 100.
#> 2 Beta 1. 20.0 50.
#> 3 Gamma 2. 10.0 30.

If you really want all of summary, you can use broom::tidy to turn each group's results into a data frame in a list column, which can be unnested:

df %>% 
group_by(Location) %>%
summarise(x = list(broom::tidy(summary(Result)))) %>%
tidyr::unnest()
#> # A tibble: 3 x 7
#> Location minimum q1 median mean q3 maximum
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Alpha 1. 6.25 26.5 38.5 58.8 100.
#> 2 Beta 1. 10.5 20.0 23.7 35.0 50.
#> 3 Gamma 2. 6.00 10.0 14.0 20.0 30.

faster way to create variable that aggregates a column by id

For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave is what you want.

df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)

Aggregate takes a long time

Have you loading your initial table with the data.table library? This will save a significant amount of time just loading 100m rows.

DT <- fread("path/to/file.csv")

Then you can aggregate fairly quickly with:

DT[ , AggColumn := sum(time), by = id]

Faster function than aggregate() in R

Im sure the real data is much larger but your solution seems on-point. as some alternatives I benchmarked other approaches:

Tidyverse

tidy_fn <- function(){
rbind(old.data, new.data) %>% group_by(id) %>% dplyr::summarise_all(
function(x)sum(x)
)
}

Plyr and base functions (I know..bad-form)

plyr_base_fn <- function(){

plyr::ldply(Map(function(x){
sapply(x[1:3],sum)
}, rbind(old.data,new.data) %>% split(., .$id)
))

}

Your aggregation approach:

agg_fn <- function(){
aggregate(cbind(x,y,z)~id, rbind(old.data, new.data), sum, na.rm=F)
}

Results from two tests:

1000 reps

> microbenchmark(tidy_fn(),agg_fn(),plyr_base_fn(),times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
tidy_fn() 2.220585 2.386112 2.823122 2.529649 2.775300 13.425573 1000
agg_fn() 1.668601 1.795527 2.149068 1.895666 2.062904 16.117802 1000
plyr_base_fn() 1.253772 1.331501 1.567777 1.402464 1.526089 8.396307 1000
5000 reps

microbenchmark(tidy_fn(),agg_fn(),plyr_base_fn(),times = 5000L)
Unit: milliseconds
expr min lq mean median uq max neval
tidy_fn() 2.227752 2.400265 2.696034 2.542617 2.722082 12.46249 5000
agg_fn() 1.673647 1.792085 2.067232 1.897011 2.019915 301.84694 5000
plyr_base_fn() 1.247306 1.336010 1.503682 1.411608 1.503290 14.24656 5000

Summarizing count and conditional aggregate functions on the same factor

Assuming that your original dataset is similar to the one you created (i.e. with NA as character. You could specify na.strings while reading the data using read.table. But, I guess NAs would be detected automatically.

The price column is factor which needs to be converted to numeric class. When you use as.numeric, all the non-numeric elements (i.e. "NA", FALSE) gets coerced to NA) with a warning.

library(dplyr)
df %>%
mutate(price=as.numeric(as.character(price))) %>%
group_by(company, year, product) %>%
summarise(total.count=n(),
count=sum(is.na(price)),
avg.price=mean(price,na.rm=TRUE),
max.price=max(price, na.rm=TRUE))

data

I am using the same dataset (except the ... row) that was showed.

df = tbl_df(data.frame(company=c("Acme", "Meca", "Emca", "Acme", "Meca","Emca"),
year=c("2011", "2010", "2009", "2011", "2010", "2013"), product=c("Wrench", "Hammer",
"Sonic Screwdriver", "Fairy Dust", "Kindness", "Helping Hand"), price=c("5.67",
"7.12", "12.99", "10.99", "NA",FALSE)))


Related Topics



Leave a reply



Submit