How does one aggregate and summarize data quickly?
You should look at the package data.table
for faster aggregation operations on large data frames. For your problem, the solution would look like:
library(data.table)
data_t = data.table(data_tab)
ans = data_t[,list(A = sum(count), B = mean(count)), by = 'PID,Time,Site']
Summarize data at different aggregate levels - R and tidyverse
Another alternative:
library(tidyverse)
iris %>%
mutate_at("Species", as.character) %>%
list(group_by(.,Species), .) %>%
map(~summarize(.,mean_s_length = mean(Sepal.Length),
max_s_width = max(Sepal.Width))) %>%
bind_rows() %>%
replace_na(list(Species="Overall"))
#> # A tibble: 4 x 3
#> Species mean_s_length max_s_width
#> <chr> <dbl> <dbl>
#> 1 setosa 5.01 4.4
#> 2 versicolor 5.94 3.4
#> 3 virginica 6.59 3.8
#> 4 Overall 5.84 4.4
Aggregate / summarize multiple variables per group (e.g. sum, mean)
Where is this year()
function from?
You could also use the reshape2
package for this task:
require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
# year month x1 x2
1 2000 1 -80.83405 -224.9540159
2 2000 2 -223.76331 -288.2418017
3 2000 3 -188.83930 -481.5601913
4 2000 4 -197.47797 -473.7137420
5 2000 5 -259.07928 -372.4563522
Use data.table to count and aggregate / summarize a column
The post you are referring to gives a method on how to apply one aggregation method to several columns. If you want to apply different aggregation methods to different columns, you can do:
dat[, .(count = .N, var = sum(VAR)), by = MNTH]
this results in:
MNTH count var
1: 201501 4 2
2: 201502 3 0
3: 201503 5 2
4: 201504 4 2
You can also add these values to your existing dataset by updating your dataset by reference:
dat[, `:=` (count = .N, var = sum(VAR)), by = MNTH]
this results in:
> dat
MNTH VAR count var
1: 201501 1 4 2
2: 201501 1 4 2
3: 201501 0 4 2
4: 201501 0 4 2
5: 201502 0 3 0
6: 201502 0 3 0
7: 201502 0 3 0
8: 201503 0 5 2
9: 201503 0 5 2
10: 201503 1 5 2
11: 201503 1 5 2
12: 201503 0 5 2
13: 201504 1 4 2
14: 201504 0 4 2
15: 201504 1 4 2
16: 201504 0 4 2
For further reading about how to use data.table syntax, see the Getting started guides on the GitHub wiki.
How to use aggregate and summary function to get unique columns in a dataframe?
Since aggregate
's simplify
parameter defaults to TRUE
, it's simplifying the results of calling the function (here, summary
) to a matrix. You can reconstruct the data.frame, coercing the column into its own data.frame:
df <- data.frame(Result = c(1,1,2,100,50,30,45,20, 10, 8),
Location = c("Alpha", "Beta", "Gamma", "Alpha", "Beta", "Gamma", "Alpha", "Beta", "Gamma", "Alpha"))
Agg <- aggregate(df$Result, list(df$Location), summary)
data.frame(Location = Agg$Group.1, Agg$x)
#> Location Min. X1st.Qu. Median Mean X3rd.Qu. Max.
#> 1 Alpha 1 6.25 26.5 38.50000 58.75 100
#> 2 Beta 1 10.50 20.0 23.66667 35.00 50
#> 3 Gamma 2 6.00 10.0 14.00000 20.00 30
Alternately, dplyr's summarise
family of functions can handle multiple summary statistics well:
library(dplyr)
df %>% group_by(Location) %>% summarise_all(funs(min, median, max))
#> # A tibble: 3 x 4
#> Location min median max
#> <fct> <dbl> <dbl> <dbl>
#> 1 Alpha 1. 26.5 100.
#> 2 Beta 1. 20.0 50.
#> 3 Gamma 2. 10.0 30.
If you really want all of summary
, you can use broom::tidy
to turn each group's results into a data frame in a list column, which can be unnest
ed:
df %>%
group_by(Location) %>%
summarise(x = list(broom::tidy(summary(Result)))) %>%
tidyr::unnest()
#> # A tibble: 3 x 7
#> Location minimum q1 median mean q3 maximum
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Alpha 1. 6.25 26.5 38.5 58.8 100.
#> 2 Beta 1. 10.5 20.0 23.7 35.0 50.
#> 3 Gamma 2. 6.00 10.0 14.0 20.0 30.
faster way to create variable that aggregates a column by id
For any kind of aggregation where you want a resulting vector the same length as the input vector with replicates grouped across the grouping vector ave
is what you want.
df$perc.total <- ave(df$cand.perc, df$id, FUN = sum)
Aggregate takes a long time
Have you loading your initial table with the data.table
library? This will save a significant amount of time just loading 100m rows.
DT <- fread("path/to/file.csv")
Then you can aggregate fairly quickly with:
DT[ , AggColumn := sum(time), by = id]
Faster function than aggregate() in R
Im sure the real data is much larger but your solution seems on-point. as some alternatives I benchmarked other approaches:
Tidyverse
tidy_fn <- function(){
rbind(old.data, new.data) %>% group_by(id) %>% dplyr::summarise_all(
function(x)sum(x)
)
}
Plyr and base functions (I know..bad-form)
plyr_base_fn <- function(){
plyr::ldply(Map(function(x){
sapply(x[1:3],sum)
}, rbind(old.data,new.data) %>% split(., .$id)
))
}
Your aggregation approach:
agg_fn <- function(){
aggregate(cbind(x,y,z)~id, rbind(old.data, new.data), sum, na.rm=F)
}
Results from two tests:
1000 reps> microbenchmark(tidy_fn(),agg_fn(),plyr_base_fn(),times = 1000L)
Unit: milliseconds
expr min lq mean median uq max neval
tidy_fn() 2.220585 2.386112 2.823122 2.529649 2.775300 13.425573 1000
agg_fn() 1.668601 1.795527 2.149068 1.895666 2.062904 16.117802 1000
plyr_base_fn() 1.253772 1.331501 1.567777 1.402464 1.526089 8.396307 1000
5000 repsmicrobenchmark(tidy_fn(),agg_fn(),plyr_base_fn(),times = 5000L)
Unit: milliseconds
expr min lq mean median uq max neval
tidy_fn() 2.227752 2.400265 2.696034 2.542617 2.722082 12.46249 5000
agg_fn() 1.673647 1.792085 2.067232 1.897011 2.019915 301.84694 5000
plyr_base_fn() 1.247306 1.336010 1.503682 1.411608 1.503290 14.24656 5000
Summarizing count and conditional aggregate functions on the same factor
Assuming that your original dataset is similar to the one you created (i.e. with NA
as character
. You could specify na.strings
while reading the data using read.table
. But, I guess NAs would be detected automatically.
The price
column is factor
which needs to be converted to numeric
class. When you use as.numeric
, all the non-numeric elements (i.e. "NA"
, FALSE) gets coerced to NA
) with a warning.
library(dplyr)
df %>%
mutate(price=as.numeric(as.character(price))) %>%
group_by(company, year, product) %>%
summarise(total.count=n(),
count=sum(is.na(price)),
avg.price=mean(price,na.rm=TRUE),
max.price=max(price, na.rm=TRUE))
data
I am using the same dataset
(except the ...
row) that was showed.
df = tbl_df(data.frame(company=c("Acme", "Meca", "Emca", "Acme", "Meca","Emca"),
year=c("2011", "2010", "2009", "2011", "2010", "2013"), product=c("Wrench", "Hammer",
"Sonic Screwdriver", "Fairy Dust", "Kindness", "Helping Hand"), price=c("5.67",
"7.12", "12.99", "10.99", "NA",FALSE)))
Related Topics
R Lubridate Converting Seconds to Date
Loops with Captions with Knitr
Create Multilines from Points, Grouped by Id with Sf Package
Topoplot in Ggplot2 - 2D Visualisation of E.G. Eeg Data
Add Missing Xts/Zoo Data with Linear Interpolation in R
Rcmdr Launch Error in Yosemite (Os X 10.10)
How to Remove All Rows from a Data.Frame
R: How to Select Files in Directory Which Satisfy Conditions Both on the Beginning and End of Name
Import Multiple Text Files in R and Assign Them Names from a Predetermined List
Aggregating Multiple Columns in Data.Table
Faster Way to Find the First True Value in a Vector
Keyboard Shortcut for Inserting Roxygen #' Comment Start
Create Multiple Data Frames from One Based Off Values with a for Loop
Plot Margin of PDF Plot Device: Y-Axis Label Falling Outside Graphics Window