Ddply + Summarize for Repeating Same Statistical Function Across Large Number of Columns

ddply + summarize for repeating same statistical function across large number of columns

You can use numcolwise() to run a summary over all numeric columns.

Here is an example using iris:

ddply(iris, .(Species), numcolwise(mean))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026

Similarly, there is catcolwise() to summarise over all categorical columns.

See ?numcolwise for more help and examples.


EDIT

An alternative approach is to use reshape2 (proposed by @gsk3). This has more keystrokes in this example, but gives you enormous flexibility:

library(reshape2)

miris <- melt(iris, id.vars="Species")
x <- ddply(miris, .(Species, variable), summarize, mean=mean(value))

dcast(x, Species~variable, value.var="mean")
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026

Better & faster way to sum & ifelse for a large set of columns in a big data frame using ddply R

We could possibly reduce the time by switching to dplyr. Also, instead of doing the sum and then using ifelse to check and reconvert, this can be directly done by checking any value greater than 0

library(dplyr)
dummies %>%
dplyr::select(id, where(is.numeric)) %>%
dplyr::group_by(id) %>%
dplyr::summarise(across(everything(), ~ +(any(. > 0, na.rm = TRUE))))

or using data.table

library(data.table)
setDT(dummies)[, lapply(.SD, function(x)
+(any(x > 0, na.rm = TRUE))), id, .SDcols = patterns('group')]

Problems with ddply for splitting a large number of categories in R

I posted a new answer to your original question here How to assign number of repeats to dataframe based on elements of an identifying vector in R?.

That will hopefully help you there and here.

How to use ddply to get weighted-mean of class in dataframe?

You might find what you want in the ?summarise function. I can replicate your code with summarise as follows:

library(plyr)
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE), x=rnorm(20),
x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class), summarise,
x2 = weighted.mean(x2, weights))

To do this for x as well, just add that line to be passed into the summarise function:

ddply(frame, .(class), summarise, 
x = weighted.mean(x, weights),
x2 = weighted.mean(x2, weights))

Edit: If you want to do an operation over many columns, use colwise or numcolwise instead of summarise, or do summarise on a melted data frame with the reshape2 package, then cast back to original form. Here's an example.


That would give:

wmean.vars <- c("x", "x2")

ddply(frame, .(class), function(x)
colwise(weighted.mean, w = x$weights)(x[wmean.vars]))

Finally, if you don't like having to specify wmean.vars, you can also do:

ddply(frame, .(class), function(x)
numcolwise(weighted.mean, w = x$weights)(x[!colnames(x) %in% "weights"]))

which will compute a weighted-average for every numerical field, excluding the weights themselves.

Aggregate sum and mean in R with ddply

Antoher solution using dplyr. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.

library(dplyr)
library(ggplot2)

diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)

Use ddply within a function and include variable of interest as an argument

I just moved a couple things around in the example function you gave and showed how to get more than one column back out. Does this do what you want?

myFunction2 <- function(x, y, col){
z <- ddply(x, y, .fun = function(xx){
c(mean = mean(xx[,col],na.rm=TRUE),
max = max(xx[,col],na.rm=TRUE) ) })
return(z)
}

myFunction2(mtcars, "cyl", "hp")

How to use a for loop to use ddply on multiple columns?

OP mentioned to use simple for-loop for this transformation on data. I understand that there are many other optimized way to solve this but in order to respect OP desired I tried using for-loop based solution. I have used dplyr as plyr is old now.

library(dplyr)
Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
#small change in the way data.frame is created
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2,
Feature3, stringsAsFactors = FALSE)

Feat <- c(colnames(df.main[3:5]))

# Ready with Key columns on which grouping is done
resultdf <- unique(select(df.main, Subject, GroupOfInterest))
#> resultdf
# Subject GroupOfInterest
#1 1 a
#2 1 b
#3 1 c
#7 2 a
#8 2 b
#9 2 c

#For loop for each column
for(q in Feat){
summean <- paste0('mean(', q, ')')
summ_name <- paste0(q) #Name of the column to store sum
df_sum <- df.main %>%
group_by(Subject, GroupOfInterest) %>%
summarise_(.dots = setNames(summean, summ_name))
#merge the result of new sum column in resultdf
resultdf <- merge(resultdf, df_sum, by = c("Subject", "GroupOfInterest"))
}

# Final result
#> resultdf
# Subject GroupOfInterest Feature1 Feature2 Feature3
#1 1 a 6.5 473.0 3.5
#2 1 b 4.5 437.0 2.0
#3 1 c 12.0 415.5 3.5
#4 2 a 10.0 437.5 3.0
#5 2 b 3.0 447.0 4.5
#6 2 c 6.0 462.0 2.5

Summarizing multiple columns with dplyr?

In dplyr (>=1.00) you may use across(everything() in summarise to apply a function to all variables:

library(dplyr)

df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06

Alternatively, the purrrlyr package provides the same functionality:

library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06

Also don't forget about data.table (use keyby to sort sort groups):

library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678

Let's try to compare performance.

library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3

Equivalent to ddply(...,transform,...) in data.table

Use backquoted := like this...

DT[ , `:=`( freq = .N , sum = sum(mpg) ) , by=cyl ]
head( DT , 3 )
# mpg cyl disp hp drat wt qsec vs am gear carb freq sum
#1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 7 138.2
#2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7 138.2
#3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 11 293.3

Group By Row Value Difference

We can use base R aggregate and group by Year and Month to calculate the difference between the two rows.

abs(aggregate(.~Year + Month, df, diff))

# Year Month ValueA ValueB ValueC
#1 2016 1 15 33 18
#2 2017 2 7 1 3


Related Topics



Leave a reply



Submit