How to replicate a ddply behavior that uses a custom function with dplyr?
As shown in ?do
, you can refer to a group with .
in your expression. The following will replicate your ddply
output:
iris %>% group_by(Species) %>% do(.[1:5, ])
# Source: local data frame [15 x 5]
# Groups: Species
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 7.0 3.2 4.7 1.4 versicolor
# 7 6.4 3.2 4.5 1.5 versicolor
# 8 6.9 3.1 4.9 1.5 versicolor
# 9 5.5 2.3 4.0 1.3 versicolor
# 10 6.5 2.8 4.6 1.5 versicolor
# 11 6.3 3.3 6.0 2.5 virginica
# 12 5.8 2.7 5.1 1.9 virginica
# 13 7.1 3.0 5.9 2.1 virginica
# 14 6.3 2.9 5.6 1.8 virginica
# 15 6.5 3.0 5.8 2.2 virginica
More generally, to apply a custom function to groups with dplyr
, you can do something like the following (thanks @docendodiscimus):
iris %>% group_by(Species) %>% do(mm(.))
Use of ddply + mutate with a custom function?
You're mostly right. ddply
indeed breaks your data down into mini data frames based on the grouper, and applies a function to each piece.
With ddply
, all the work is done with data frames, so the .fun
argument must take a (mini) data frame as input and return a data frame as output.
mutate
and summarize
are functions that fit this bill (they take and return data frames). You can view their individual help pages, or run them on a data frame outside of ddply
to see this, e.g.
mutate(mtcars, mean.mpg = mean(mpg))
summarize(mtcars, mean.mpg = mean(mpg))
If you don't use mutate
or summarize
, that is, you only use a custom function, then your function also needs to take a (mini) data frame as argument, and return a data frame.
If you do use mutate
or summarize
, any other functions you pass to ddply
aren't used by ddply
, they're just passed on to be used by mutate
or summarize
. And functions used by mutate
and summarize
act on the columns of the data, not on the entire data.frame. This is why
ddply(mtcars, "cyl", mutate, mean.mpg = mean(mpg))
Notice that we don't pass mutate
a function. We don't say ddply(mtcars, "cyl", mutate, mean)
. We have to tell it what to take the mean of. In ?mutate
, the description of ...
is "named parameters giving definitions of new columns", not anything to do with functions. (Is mean()
really different from any "custom function"? No.)
Thus it doesn't work with anonymous functions--or functions at all. Pass it an expression! You can define a custom function beforehand.
custom_function <- function(x) {mean(x + runif(length(x))}
ddply(mtcars, "cyl", mutate, jittered.mean.mpg = custom_function(mpg))
ddply(mtcars, "cyl", summarize, jittered.mean.mpg = custom_function(mpg))
This extends well, you can have functions that take multiple arguments, and you can give them different columns as arguments, but if you're using the mutate
or summarize
, you have to give the other functions arguments; you're not just passing the functions.
You seem to want to pass ddply
a function that already "knows" which column to take the mean of. For that, I think you'd need to not use mutate
or summarize
, but you can hack your own version. For summarize
-like behavior, return a data.frame with a single value, for mutate
-like behavior, return the original data.frame with your extra value cbind
ed on
mean.mpg.mutate = function(df) {
cbind.data.frame(df, mean.mpg = mean(df$mpg))
}
mean.mpg.summarize = function(df) {
data.frame(mean.mpg = mean(df$mpg))
}
ddply(mtcars, "cyl", mean.mpg.mutate)
ddply(mtcars, "cyl", mean.mpg.summarize)
tl;dr
Why can't I use mutate with a custom function? Is it just that "built-in" functions return some sort of class that ddply can deal with vs. having to kick out a full data.frame and then call out only the column I care about?
Quite the opposite! mutate
and summarize
take data frames as inputs and kick out data frames as returns. But mutate and summarize are the functions you're passing to ddply, not mean or whatever else.
Mutate and summarize are convenience functions that you'll use 99% of the time you use ddply
.
If you don't use mutate/summarize, then your function needs to take and return a data frame.
If you do use mutate/summarize, then you don't pass them functions, you pass them expressions that can be evaluated with your (mini) data frame. If it's mutate, the return should be a vector to be appended to the data (recycled as necessary). If it's summarize, the return should be a single value. You don't pass a function, like mean
; you pass an expression, like mean(mpg)
.
What about dplyr
?
This was written before dplyr
was a thing, or at least a big thing. dplyr
removes a lot of the confusion from this process because it essentially replaces the nesting of ddply
with mutate
or summarize
as arguments with sequential functions group_by
followed by mutate
or summarize
. The dplyr
version of my answer would be
library(dplyr)
group_by(mtcars, cyl) %>%
mutate(mean.mpg = mean(mpg))
With the new column creation passed directly to mutate
(or summarize
), there isn't confusion about which function does what.
ddply - dplyr: .fun = summarize with several rows
Check if this works:
Output is different because of no
set.seed
dfx %>% group_by(group) %>% do(data.frame(p=p, stats=quantile(.$age, probs=p)))
Source: local data frame [12 x 3]
Groups: group
group p stats
1 A 0.2 27.68069
2 A 0.4 35.36915
3 A 0.6 39.15223
4 A 0.8 46.41073
5 B 0.2 34.68378
6 B 0.4 37.22358
7 B 0.6 40.76185
8 B 0.8 44.48645
9 C 0.2 33.86023
10 C 0.4 36.30515
11 C 0.6 46.80672
12 C 0.8 52.82140
ddply with user defined functions
You could try
library(dplyr)
late %>%
group_by(elapsed, CreditGrade)%>%
summarise_each(funs(mean=mean(., na.rm=TRUE),
sd=sd(., na.rm=TRUE) , median=median(., na.rm=TRUE)))
Using the mtcars
dataset
data(mtcars)
mtcars$wt[c(3,5)] <- NA
mtcars %>%
group_by(gear) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE),
sd=sd(., na.rm=TRUE)), wt, qsec)
# gear wt_mean qsec_mean wt_sd qsec_sd
#1 3 3.924929 17.692 0.8546165 1.349916
#2 4 2.643636 18.965 0.6562739 1.613880
#3 5 2.632600 15.640 0.8189254 1.130487
Or using aggregate
do.call(`data.frame`,aggregate(cbind(wt,qsec)~gear,mtcars,
FUN=function(x) c(mean=mean(x, na.rm=TRUE),
sd=sd(x, na.rm=TRUE)), na.action=NULL))
# gear wt.mean wt.sd qsec.mean qsec.sd
#1 3 3.924929 0.8546165 17.692 1.349916
#2 4 2.643636 0.6562739 18.965 1.613880
#3 5 2.632600 0.8189254 15.640 1.130487
Update
Regarding using the multiple
user defined functions inside ddply
, instead of calling ddply
twice, you can use c
or cbind
inside the first call
res <- ddply(mtcars, .(cyl,gear), function(mtcars.sub)
c(mean(carTest(mtcars.sub)), sd(carTest(mtcars.sub))))
res
# cyl gear V1 V2
#1 4 3 NaN NA
#2 4 4 81.83333 19.65112
#3 4 5 NaN NA
#4 6 3 105.00000 NA
#5 6 4 NaN NA
#6 6 5 NaN NA
#7 8 3 191.00000 31.95483
#8 8 5 335.00000 NA
Replicate %% behavior with |
To use the left hand side more than once on the right hand side define a function on the right hand side when using |> . We have defined the functions inline but could define them prior to the pipeline, if desired.
df_2 |>
(\(x) cbind(x, r = lapply(x[c("x.1", "x.4")], sum)))() |>
(\(x) x[order(x$x.1, x$x.4), ])()
giving:
x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10 r.x.1 r.x.4
10 1 3 2 4 3 1 1 4 1 3 24 24
6 2 3 3 1 1 4 2 2 2 1 24 24
7 2 4 4 3 1 2 1 4 3 1 24 24
8 2 1 2 3 2 1 1 3 3 2 24 24
4 2 1 1 4 1 1 3 1 4 2 24 24
1 3 4 1 1 3 4 1 2 2 2 24 24
9 3 3 3 1 3 2 3 4 4 3 24 24
2 3 2 4 2 4 1 4 1 2 2 24 24
5 3 2 1 2 3 3 1 3 2 4 24 24
3 3 2 1 3 2 3 4 3 3 1 24 24
If the main purpose of this is to use a pipeline without any packages another approach which is pipeline-like is the Bizarro pipe (which is not actually a pipe but looks like one).
df_2 ->.;
cbind(., r = lapply(.[c("x.1", "x.4")], sum)) ->.;
.[order(.$x.1, .$x.4), ]
Using ddply inside function (non-standard evaluation)
The plyr
package and its ddply
function are kind of outdated and evolved into the dplyr
, tidyr
and similar packages (referenced as tidyverse
).
# library(tidyverse)
library(dplyr)
What you are trying to accomplish can be translated like this:
sample_df %>%
group_by(a) %>%
summarize(mean = mean(b), var = var(b))
# # A tibble: 3 × 3
# a mean var
# <int> <dbl> <dbl>
# 1 1 3 0
# 2 2 2 0
# 3 3 1 0
And, for the function approach:
sumfun <- function(df, v) {
df %>%
group_by_(v) %>%
summarize(mean = mean(b), var = var(b))
}
sumfun(sample_df, 'a')
# # A tibble: 3 × 3
# a mean var
# <int> <dbl> <dbl>
# 1 1 3 0
# 2 2 2 0
# 3 3 1 0
Note the final _
in group_by_
present in function needed to do standard evaluation. See vignette("nse")
for details.
custom function applied to groups individually R
After grouping by 'a', the df$
will get the entire column value. Instead, it would be within each group, i.e remove the df$
and also use summarise
to do the computation
library(dplyr)
df %>%
group_by(a) %>%
summarise(Prop = sum(z[x %in% 1:3|y %in% 1:3])/sum(z))
If there are many columns, then use if_any
df %>%
group_by(a) %>%
summarise(Prop = sum(z[if_any(c(x, y), ~.x %in% 1:3)])/sum(z))
Related Topics
Difference Between Mean(C(1,2,21)) and Mean(1,2,21)
Cv.Glmnet' Works in Rstudio But Not Rscript
How to Make a Post Request with Header and Data Options in R Using Httr::Post
Options for Deploying R Models in Production
Error: --With-Readline=Yes (Default) and Headers/Libs Are Not Available
Plot Mixed Effects Model in Ggplot
Add Author Affiliation in R Markdown Beamer Presentation
Save All Plots Already Present in the Panel of Rstudio
Odds Ratios Instead of Logits in Stargazer() Latex Output
Plotly as Png in Knitr/Rmarkdown
Error: Zipping Up Workbook Failed When Trying to Write.Xlsx
How to Stop Emacs from Replacing Underbar with <- in Ess-Mode
Creating Vector of Results of Repeated Function Calls in R
Plot the Equivalent of Correlation Matrix for Factors (Categorical Data)? and Mixed Types
Installing Rmysql in Mavericks
Specifying Xlim and Ylim When Using Log-Scale in R
Separate Columns with Constant Numbers and Condense Them to One Row in R Data.Frame