How to define a function in dplyr?
If you have the most recent rlang
library update v0.4.0 (June 2019), you can use double curly brackets {{ }}
(aka "curly curly") to make programming with dplyr easier.
# Note: needs installation of rlang 0.4.0 or later
get_pivot <- function(data, predictor,target) {
result <-
data %>%
group_by(as.factor( {{ predictor }} )) %>%
summarise(sum=sum( {{ target }} ),total=n()) %>%
mutate(percentage=sum*100/total);
print(result)
}
# Edit -- thank you Rui Barradas
> get_pivot(mtcars, cyl, mpg_cat)
# A tibble: 3 x 4
`as.factor(cyl)` sum total percentage
<fct> <dbl> <int> <dbl>
1 4 11 11 100
2 6 3 7 42.9
3 8 0 14 0
The reason this is required is that dplyr
and other tidyverse
packages use "non-standard evaluation" like you encounter with some base R functions, like lm(mpg~factor(am),data=mtcars)
. This practice often makes "interactive" code shorter, simpler, and easier to read, but at the cost of making programming more complicated. In this case, the {{ }}
operator serves to transport the column you specify into the context of the function.
https://www.tidyverse.org/articles/2019/06/rlang-0-4-0/
Error when using dplyr inside of a function
UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.
See http://dplyr.tidyverse.org/articles/programming.html for more details.
filter_big <- function(spp, LENGTH, WIDTH) {
LENGTH <- enquo(LENGTH) # Create quosure
WIDTH <- enquo(WIDTH) # Create quosure
iris %>%
filter(Species == spp) %>%
select(!!LENGTH, !!WIDTH) %>% # Use !! to unquote the quosure
mutate(sum = (!!LENGTH) + (!!WIDTH)) %>% # Use !! to unquote the quosure
filter(sum > 4) %>%
nrow()
}
filter_big("virginica", Sepal.Length, Sepal.Width)
> filter_big("virginica", Sepal.Length, Sepal.Width)
[1] 50
Calling user defined functions from dplyr::mutate
The function does not know which object you want to modify. Pass the period
object in the function and use it like :
period_to_date <- function(period) {
lubridate::ymd(stringr::str_c(period, "01"))
#Can also use
#as.Date(paste0(period,"01"), "%Y%m%d")
}
tibble_1 %>%
dplyr::mutate(date = period_to_date(period))
# period var_1 var_2 date
# <dbl> <dbl> <dbl> <date>
#1 201901 -0.476 -0.456 2019-01-01
#2 201912 -0.645 1.45 2019-12-01
#3 201902 -0.0939 -0.982 2019-02-01
#4 201903 0.410 0.954 2019-03-01
Using Dplyr within a user-defined function to summarise data then plot it
First of all, inside dplyr functions you don't need to call variables indexing the dataframe like df[, timevar]
. Use just the variable name. Besides that, when indexing a dataframe you have to specify if you are calling columns or rows, so df[timevar]
is wrong.
About the function, it's a problem of evaluation.
This structure below is working:
ConsistencyPlot <- function(df, var1, timevar, lossvar){
var1 <- enquo(var1)
timevar <- enquo(timevar)
lossvar <- enquo(lossvar)
df1 <- df %>%
group_by(!!timevar, !!var1) %>%
summarise(MeanLoss = mean(!!lossvar))
ggplot(df1, aes(x = !!var1, y = MeanLoss, color = !!timevar, group = !!timevar)) +
geom_line() +
geom_point()
}
Look that the parameters were transformed with enquo()
and then passed in the function using !!
. So, you can pass the arguments without quoting them.
ConsistencyPlot(df, JudicialOrientation, Year, Loss)
I hope you find it useful.
How do I write a dplyr pipe-friendly function where a new column name is provided from a function argument?
In this case you can just stick to using the embrace {{}}
option for your variables. If you want to dynamically create column names, you're going to still need to use :=
. The difference here is that you can use the glue-style syntax with the embrace operator to get the name of the symbol. This works with the data provided.
elective_open <- function(.data, name_for_elective, course, tiebreaker){
.data%>%
mutate("{{name_for_elective}}" := ifelse({{tiebreaker}}==max({{tiebreaker}}),1,0)) %>%
mutate("{{name_for_elective}}" := ifelse({{name_for_elective}}==0,{{course}}[{{name_for_elective}}==1],"")) %>%
filter(!({{course}} %in% {{name_for_elective}}))
}
Call function from the global environment with implicit dataframe variables (from the calling env?) inside dplyr::summarise or mutate
Up front, I'm generally against writing functions that defeat functional reproducibility, having spent too much time troubleshooting functions that change behavior based on something not passed to them.
However, try this:
method_1 <- list(
any_vs_four_gears = function(data = cur_data()) with(data, any(vs == 1 & gear == 4)),
any_am_high_hp = function(data = cur_data()) with(data, any(am == 1 & hp > 170)),
all_combined = function(data = cur_data()) with(data, all(any_vs_four_gears, any_am_high_hp))
)
mtcars %>%
group_by(carb) %>%
summarise(
any_vs_four_gears = method_1$any_vs_four_gears()
any_am_high_hp = method_1$any_am_high_hp(),
all_combined = method_1$all_combined()
)
# # A tibble: 6 x 4
# carb any_vs_four_gears any_am_high_hp all_combined
# <dbl> <lgl> <lgl> <lgl>
# 1 1 TRUE FALSE FALSE
# 2 2 TRUE FALSE FALSE
# 3 3 FALSE FALSE FALSE
# 4 4 TRUE TRUE TRUE
# 5 6 FALSE TRUE FALSE
# 6 8 FALSE TRUE FALSE
This uses the cur_data()
pronoun/function found in dplyr
-pipe environments, adds just a little surrounding code (with(data, { ... })
, so {
-expression-friendly), and works "as is".
The errors are not difficult to interpret:
mtcars %>%
select(-vs) %>% # intentionally setting up an error
group_by(carb) %>%
summarise(
any_vs_four_gears = method_1$any_vs_four_gears()
any_am_high_hp = method_1$any_am_high_hp(),
all_combined = method_1$all_combined()
)
# Error: Problem with `summarise()` column `any_vs_four_gears`.
# i `any_vs_four_gears = method_1$any_vs_four_gears()`.
# x object 'vs' not found
# i The error occurred in group 1: carb = 1.
# Run `rlang::last_error()` to see where the error occurred.
Make plyr::ddply code compatible with dplyr-equivalent custom function
After some discussion I now understand that what is desired is to rewrite this function using dplyr rather than plyr such that for inputs such as those listed in the inputs section below it gives the same result.
dd <- function(data, group, var, fun)
plyr::ddply(.data = data, .variables = group, var, .fun = fun)
To do that the new function can use group_by with either summarize or group_modify. dd1 below uses the first and dd2 uses the second. Use whichever you prefer.
Note that the way fun.z was written it assumes a data frame and not a tibble (because data frames return a vector if there is only one column whereas tibble returns another tibble) so we use as.data.frame to ensure that. Also plyr returns a data frame and at the end of dd1 and dd2 we convert the tibble produced to data frame to ensure that the result is identical.
dd1 <- function(data, group, var, fun)
data %>%
group_by(across(all_of(group))) %>%
summarize(V1 = fun(as.data.frame(cur_data()), var), .groups = "drop") %>%
as.data.frame
dd2 <- function(data, group, var, fun)
data %>%
group_by(across(all_of(group))) %>%
group_modify(~ { data.frame(V1 = fun(as.data.frame(.), var)) }) %>%
ungroup %>%
as.data.frame
Now test it out
# inputs - start #
data <- mtcars
trim <- 0
na.rm <- FALSE
var <- "mpg"
group <- c("cyl", "am")
fun.z <- function(x, idx) {
as.numeric(mean(x[, idx], trim = trim, na.rm = na.rm))
}
# inputs - end #
library(dplyr)
dd.out <- dd(data, group, var, fun.z) # plyr
dd1.out <- dd1(data, group, var, fun.z)
dd2.out <- dd2(data, group, var, fun.z)
identical(dd1.out, dd.out)
## [1] TRUE
identical(dd2.out, dd.out)
## [1] TRUE
How do I resolve dplyr::mutate error Unexpected '='?
How about this:
library(dplyr)
library(glue)
data(mtcars)
dat <- mtcars
mpg_table <- function(df, grouping_var, val) {
df %>%
mutate({{grouping_var}} := as.character({{grouping_var}})) %>%
bind_rows(mutate(., {{grouping_var}} := "all")) %>%
group_by({{grouping_var}}) %>%
summarise("{{val}}q25" := quantile({{val}}, prob = .25),
"{{val}}q50" := quantile({{val}}, prob = .50),
"{{val}}q75" := quantile({{val}}, prob = .75),
count = n())
}
mpg_table(dat, cyl, mpg)
#> # A tibble: 4 × 5
#> cyl mpgq25 mpgq50 mpgq75 count
#> <chr> <dbl> <dbl> <dbl> <int>
#> 1 4 22.8 26 30.4 11
#> 2 6 18.6 19.7 21 7
#> 3 8 14.4 15.2 16.2 14
#> 4 all 15.4 19.2 22.8 32
Created on 2022-09-29 by the reprex package (v2.0.1)
The :=
allows you to pass a variable in as the name of a new variable to be created. I also used the same construct for the variable names for the quantiles. This means that if you pass drat
as val
for example, you would get dratq25
, dratq50
and dratq75
as the variables in the output.
The other problem you run into is a format problem. The cyl
variable is numeric and you're trying to bind it to a data frame whose cyl
variable is a character. The first step in the code above changes the grouping_var
to character to avoid this problem.
Passing a user defined function to `dplyr::summarize()` when 'data' is an argument of user defined function
I tried to repair your function and apply it to your data:
library(dplyr)
topht <- function(data, dbh = NULL, ht = NULL, tpa = NULL, n = 40){
##evaluate function parameters in the data environment
tmp <- data %>% pull({{ dbh }})
odata <- data[base::order(tmp, decreasing=TRUE),]
ht <- odata %>% pull({{ ht }})
tpa <- data %>% pull({{ tpa }})
#creating variables for cumulative trees per acre and cumulative height calculations#
cumtpa <- 0
cumht <- 0
outcome <- 0
for(i in 1:nrow(odata)) {
if(cumtpa < n){
cumtpa <- tpa[i] + cumtpa
cumht <- (ht[i] * tpa[i]) + cumht
} else if(cumtpa == n){
break
} else {
delta <- cumtpa - n
cumtpa <- cumtpa - delta
cumht <- cumht - (delta*ht[i])
break
}
if(cumtpa > 0) {
outcome <- cumht / cumtpa
} else {
outcome <- 0
}
}
outcome
}
Now we apply this function to each group:
DF %>%
group_by(groups) %>%
group_modify(~ .x %>% summarize(TOP_HT = topht(., dbh = dbh, ht = ht, tpa = tpa, n = 40))) %>%
ungroup() %>%
as.data.frame()
We want to apply topht
to each group, so we use group_modify
(it's like purrr
's little sister). This returns
groups TOP_HT
1 A 88.75246
2 B 123.01531
A few words of explanation:
- Since your function is named
topht
, you really should not usetopht
as variable name (even inside this function). I changed it tooutcome
. outcome
should be defined / initialised with some value. I chose0
,NA
or something else might also be possible.return()
at the end of a function is unneccessary. Just use the variable name.- To evaluate the function's arguments (like
dbh = dbh
) you need the curly-curly operator. As a reference: https://www.r-bloggers.com/2019/06/curly-curly-the-successor-of-bang-bang/ - Your first
if
-construction should be packed together into anif-else if - else
construction. - To improve readability, you can use some spacing (see http://adv-r.had.co.nz/Style.html).
Related Topics
All Paths in Directed Tree Graph from Root to Leaves in Igraph R
Rselenium on Docker: Where Are Files Downloaded
The Fastest Way to Convert Numeric to Character in R
Change The Color of a Ggplot Geom a Posteriori (After Having Specified Another Color)
Using Mutate Rowwise Over a Subset of Columns
Combine Two Lists of Dataframes, Dataframe by Dataframe
Strange Behaviour Dropping Column from Data.Frame in R
Overlapped Density Plots in Ggplot2
What Happens When Prob Argument in Sample Sums to Less/Greater Than 1
Change Thickness of a Marker in Ggplot2
Combine (Bind) Existing PDF Files in R
Creating a Stacked Bar Chart Centered on Zero Using Ggplot
How to Append R Data Frame into Existing Excel Without Overwriting
Small Ggplot Object (1 Mb) Turns into 7 Gigabyte .Rdata Object When Saved
Axis-Labeling in R Histogram and Density Plots; Multiple Overlays of Density Plots