ddply + summarize for repeating same statistical function across large number of columns
You can use numcolwise()
to run a summary over all numeric columns.
Here is an example using iris
:
ddply(iris, .(Species), numcolwise(mean))
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Similarly, there is catcolwise()
to summarise over all categorical columns.
See ?numcolwise
for more help and examples.
EDIT
An alternative approach is to use reshape2
(proposed by @gsk3). This has more keystrokes in this example, but gives you enormous flexibility:
library(reshape2)
miris <- melt(iris, id.vars="Species")
x <- ddply(miris, .(Species, variable), summarize, mean=mean(value))
dcast(x, Species~variable, value.var="mean")
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1 setosa 5.006 3.428 1.462 0.246
2 versicolor 5.936 2.770 4.260 1.326
3 virginica 6.588 2.974 5.552 2.026
Better & faster way to sum & ifelse for a large set of columns in a big data frame using ddply R
We could possibly reduce the time by switching to dplyr
. Also, instead of doing the sum
and then using ifelse
to check and reconvert, this can be directly done by checking any
value greater than 0
library(dplyr)
dummies %>%
dplyr::select(id, where(is.numeric)) %>%
dplyr::group_by(id) %>%
dplyr::summarise(across(everything(), ~ +(any(. > 0, na.rm = TRUE))))
or using data.table
library(data.table)
setDT(dummies)[, lapply(.SD, function(x)
+(any(x > 0, na.rm = TRUE))), id, .SDcols = patterns('group')]
Problems with ddply for splitting a large number of categories in R
I posted a new answer to your original question here How to assign number of repeats to dataframe based on elements of an identifying vector in R?.
That will hopefully help you there and here.
How to use ddply to get weighted-mean of class in dataframe?
You might find what you want in the ?summarise
function. I can replicate your code with summarise
as follows:
library(plyr)
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE), x=rnorm(20),
x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class), summarise,
x2 = weighted.mean(x2, weights))
To do this for x
as well, just add that line to be passed into the summarise
function:
ddply(frame, .(class), summarise,
x = weighted.mean(x, weights),
x2 = weighted.mean(x2, weights))
Edit: If you want to do an operation over many columns, use colwise
or numcolwise
instead of summarise
, or do summarise
on a melt
ed data frame with the reshape2
package, then cast
back to original form. Here's an example.
That would give:
wmean.vars <- c("x", "x2")
ddply(frame, .(class), function(x)
colwise(weighted.mean, w = x$weights)(x[wmean.vars]))
Finally, if you don't like having to specify wmean.vars
, you can also do:
ddply(frame, .(class), function(x)
numcolwise(weighted.mean, w = x$weights)(x[!colnames(x) %in% "weights"]))
which will compute a weighted-average for every numerical field, excluding the weights themselves.
Aggregate sum and mean in R with ddply
Antoher solution using dplyr
. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)
Use ddply within a function and include variable of interest as an argument
I just moved a couple things around in the example function you gave and showed how to get more than one column back out. Does this do what you want?
myFunction2 <- function(x, y, col){
z <- ddply(x, y, .fun = function(xx){
c(mean = mean(xx[,col],na.rm=TRUE),
max = max(xx[,col],na.rm=TRUE) ) })
return(z)
}
myFunction2(mtcars, "cyl", "hp")
How to use a for loop to use ddply on multiple columns?
OP
mentioned to use simple for-loop
for this transformation on data. I understand that there are many other optimized way to solve this but in order to respect OP
desired I tried using for-loop
based solution. I have used dplyr
as plyr
is old now.
library(dplyr)
Subject <- c(rep(1, times = 6), rep(2, times = 6))
GroupOfInterest <- c(letters[rep(1:3, times = 4)])
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
#small change in the way data.frame is created
df.main <- data.frame(Subject,GroupOfInterest, Feature1, Feature2,
Feature3, stringsAsFactors = FALSE)
Feat <- c(colnames(df.main[3:5]))
# Ready with Key columns on which grouping is done
resultdf <- unique(select(df.main, Subject, GroupOfInterest))
#> resultdf
# Subject GroupOfInterest
#1 1 a
#2 1 b
#3 1 c
#7 2 a
#8 2 b
#9 2 c
#For loop for each column
for(q in Feat){
summean <- paste0('mean(', q, ')')
summ_name <- paste0(q) #Name of the column to store sum
df_sum <- df.main %>%
group_by(Subject, GroupOfInterest) %>%
summarise_(.dots = setNames(summean, summ_name))
#merge the result of new sum column in resultdf
resultdf <- merge(resultdf, df_sum, by = c("Subject", "GroupOfInterest"))
}
# Final result
#> resultdf
# Subject GroupOfInterest Feature1 Feature2 Feature3
#1 1 a 6.5 473.0 3.5
#2 1 b 4.5 437.0 2.0
#3 1 c 12.0 415.5 3.5
#4 2 a 10.0 437.5 3.0
#5 2 b 3.0 447.0 4.5
#6 2 c 6.0 462.0 2.5
Summarizing multiple columns with dplyr?
In dplyr
(>=1.00) you may use across(everything()
in summarise
to apply a function to all variables:
library(dplyr)
df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Alternatively, the purrrlyr
package provides the same functionality:
library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
Also don't forget about data.table
(use keyby
to sort sort groups):
library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678
Let's try to compare performance.
library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3
Equivalent to ddply(...,transform,...) in data.table
Use backquoted :=
like this...
DT[ , `:=`( freq = .N , sum = sum(mpg) ) , by=cyl ]
head( DT , 3 )
# mpg cyl disp hp drat wt qsec vs am gear carb freq sum
#1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 7 138.2
#2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7 138.2
#3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 11 293.3
Group By Row Value Difference
We can use base R aggregate
and group by Year
and Month
to calculate the difference between the two rows.
abs(aggregate(.~Year + Month, df, diff))
# Year Month ValueA ValueB ValueC
#1 2016 1 15 33 18
#2 2017 2 7 1 3
Related Topics
How to Increase Smoothness of Spheres3D in Rgl
Print Tibble with Column Breaks as in V1.3.0
Vary the Color Gradient on a Scatter Plot Created with Ggplot2
Download Plotly Using Downloadhandler
Plot Line and Bar Graph (With Secondary Axis for Line Graph) Using Ggplot
Missing Data When Supplying a Dual-Axis--Multiple-Traces to Subplot
How to Change the Default Directory in Rstudio (Or R)
Error in Na.Fail.Default: Missing Values in Object - But No Missing Values
Is There a Limit for the Possible Number of Nested Ifelse Statements
How to Print a Variable Inside a for Loop to the Console in Real Time as the Loop Is Running
How to 'Unlist' a Column in a Data.Table
Rscript Could Not Find Function
Clickable Links in Shiny Datatable
Significance Level Added to Matrix Correlation Heatmap Using Ggplot2
How to Label Histogram Bars with Data Values or Percents in R
How to Use Black-And-White Fill Patterns Instead of Color Coding on Calendar Heatmap