How to Define a Function in Dplyr

How to define a function in dplyr?

If you have the most recent rlang library update v0.4.0 (June 2019), you can use double curly brackets {{ }} (aka "curly curly") to make programming with dplyr easier.

# Note: needs installation of rlang 0.4.0 or later
get_pivot <- function(data, predictor,target) {
  result <-
    data %>%
    group_by(as.factor( {{ predictor }} )) %>%
    summarise(sum=sum( {{ target }} ),total=n()) %>%
    mutate(percentage=sum*100/total);

  print(result)
}

# Edit -- thank you Rui Barradas
> get_pivot(mtcars, cyl, mpg_cat)
# A tibble: 3 x 4
  `as.factor(cyl)`   sum total percentage
  <fct>            <dbl> <int>      <dbl>
1 4                   11    11      100  
2 6                    3     7       42.9
3 8                    0    14        0

The reason this is required is that dplyr and other tidyverse packages use "non-standard evaluation" like you encounter with some base R functions, like lm(mpg~factor(am),data=mtcars). This practice often makes "interactive" code shorter, simpler, and easier to read, but at the cost of making programming more complicated. In this case, the {{ }} operator serves to transport the column you specify into the context of the function.

https://www.tidyverse.org/articles/2019/06/rlang-0-4-0/

Error when using dplyr inside of a function

UPDATE: As of dplyr 0.7.0 you can use tidy eval to accomplish this.

See http://dplyr.tidyverse.org/articles/programming.html for more details.

filter_big <- function(spp, LENGTH, WIDTH) {
  LENGTH <- enquo(LENGTH)                    # Create quosure
  WIDTH  <- enquo(WIDTH)                     # Create quosure

  iris %>% 
    filter(Species == spp) %>% 
    select(!!LENGTH, !!WIDTH) %>%            # Use !! to unquote the quosure
    mutate(sum = (!!LENGTH) + (!!WIDTH)) %>% # Use !! to unquote the quosure
    filter(sum > 4) %>% 
    nrow()
}

filter_big("virginica", Sepal.Length, Sepal.Width)

> filter_big("virginica", Sepal.Length, Sepal.Width)
[1] 50

Calling user defined functions from dplyr::mutate

The function does not know which object you want to modify. Pass the period object in the function and use it like :

period_to_date <- function(period) {
  lubridate::ymd(stringr::str_c(period, "01"))
  #Can also use
  #as.Date(paste0(period,"01"), "%Y%m%d")
}

tibble_1 %>% 
  dplyr::mutate(date = period_to_date(period))

#  period   var_1  var_2 date      
#   <dbl>   <dbl>  <dbl> <date>    
#1 201901 -0.476  -0.456 2019-01-01
#2 201912 -0.645   1.45  2019-12-01
#3 201902 -0.0939 -0.982 2019-02-01
#4 201903  0.410   0.954 2019-03-01

Using Dplyr within a user-defined function to summarise data then plot it

First of all, inside dplyr functions you don't need to call variables indexing the dataframe like df[, timevar]. Use just the variable name. Besides that, when indexing a dataframe you have to specify if you are calling columns or rows, so df[timevar] is wrong.

About the function, it's a problem of evaluation.

This structure below is working:

ConsistencyPlot <- function(df, var1, timevar, lossvar){
  var1 <- enquo(var1)
  timevar <- enquo(timevar)
  lossvar <- enquo(lossvar)

  df1 <- df %>%
    group_by(!!timevar, !!var1) %>%
    summarise(MeanLoss = mean(!!lossvar))

  ggplot(df1, aes(x = !!var1, y = MeanLoss, color = !!timevar, group = !!timevar)) +
    geom_line() +
    geom_point()
}

Look that the parameters were transformed with enquo() and then passed in the function using !!. So, you can pass the arguments without quoting them.

ConsistencyPlot(df, JudicialOrientation, Year, Loss)

I hope you find it useful.

How do I write a dplyr pipe-friendly function where a new column name is provided from a function argument?

In this case you can just stick to using the embrace {{}} option for your variables. If you want to dynamically create column names, you're going to still need to use :=. The difference here is that you can use the glue-style syntax with the embrace operator to get the name of the symbol. This works with the data provided.

elective_open <- function(.data, name_for_elective, course, tiebreaker){ 
  .data%>%
    mutate("{{name_for_elective}}" := ifelse({{tiebreaker}}==max({{tiebreaker}}),1,0)) %>%
    mutate("{{name_for_elective}}" := ifelse({{name_for_elective}}==0,{{course}}[{{name_for_elective}}==1],"")) %>%
    filter(!({{course}} %in% {{name_for_elective}}))
}

Call function from the global environment with implicit dataframe variables (from the calling env?) inside dplyr::summarise or mutate

Up front, I'm generally against writing functions that defeat functional reproducibility, having spent too much time troubleshooting functions that change behavior based on something not passed to them.

However, try this:

method_1 <- list(
  any_vs_four_gears = function(data = cur_data()) with(data, any(vs == 1 & gear == 4)),
  any_am_high_hp = function(data = cur_data()) with(data, any(am == 1 & hp > 170)),
  all_combined = function(data = cur_data()) with(data, all(any_vs_four_gears, any_am_high_hp))
)

mtcars %>%
  group_by(carb) %>%
  summarise(
    any_vs_four_gears = method_1$any_vs_four_gears()
    any_am_high_hp = method_1$any_am_high_hp(),
    all_combined = method_1$all_combined()
  )
# # A tibble: 6 x 4
#    carb any_vs_four_gears any_am_high_hp all_combined
#   <dbl> <lgl>             <lgl>          <lgl>       
# 1     1 TRUE              FALSE          FALSE       
# 2     2 TRUE              FALSE          FALSE       
# 3     3 FALSE             FALSE          FALSE       
# 4     4 TRUE              TRUE           TRUE        
# 5     6 FALSE             TRUE           FALSE       
# 6     8 FALSE             TRUE           FALSE

This uses the cur_data() pronoun/function found in dplyr-pipe environments, adds just a little surrounding code (with(data, { ... }), so {-expression-friendly), and works "as is".

The errors are not difficult to interpret:

mtcars %>%
  select(-vs) %>%     # intentionally setting up an error
  group_by(carb) %>%
  summarise(
    any_vs_four_gears = method_1$any_vs_four_gears()
    any_am_high_hp = method_1$any_am_high_hp(),
    all_combined = method_1$all_combined()
  )
# Error: Problem with `summarise()` column `any_vs_four_gears`.
# i `any_vs_four_gears = method_1$any_vs_four_gears()`.
# x object 'vs' not found
# i The error occurred in group 1: carb = 1.
# Run `rlang::last_error()` to see where the error occurred.

Make plyr::ddply code compatible with dplyr-equivalent custom function

After some discussion I now understand that what is desired is to rewrite this function using dplyr rather than plyr such that for inputs such as those listed in the inputs section below it gives the same result.

dd <- function(data, group, var, fun) 
  plyr::ddply(.data = data, .variables = group, var, .fun = fun)

To do that the new function can use group_by with either summarize or group_modify. dd1 below uses the first and dd2 uses the second. Use whichever you prefer.

Note that the way fun.z was written it assumes a data frame and not a tibble (because data frames return a vector if there is only one column whereas tibble returns another tibble) so we use as.data.frame to ensure that. Also plyr returns a data frame and at the end of dd1 and dd2 we convert the tibble produced to data frame to ensure that the result is identical.

dd1 <- function(data, group, var, fun)
  data %>% 
    group_by(across(all_of(group))) %>%
    summarize(V1 = fun(as.data.frame(cur_data()), var), .groups = "drop") %>%
    as.data.frame

dd2 <- function(data, group, var, fun)
  data %>%
    group_by(across(all_of(group))) %>%
    group_modify(~ { data.frame(V1 = fun(as.data.frame(.), var)) }) %>%
    ungroup %>%
    as.data.frame

Now test it out

# inputs - start #

data <- mtcars
trim <- 0
na.rm <- FALSE
var <- "mpg"
group <- c("cyl", "am")

fun.z <- function(x, idx) { 
  as.numeric(mean(x[, idx], trim = trim, na.rm = na.rm))
}

# inputs - end #

library(dplyr)

dd.out <- dd(data, group, var, fun.z) # plyr
dd1.out <- dd1(data, group, var, fun.z)
dd2.out <- dd2(data, group, var, fun.z)

identical(dd1.out, dd.out)
## [1] TRUE

identical(dd2.out, dd.out)
## [1] TRUE

How do I resolve dplyr::mutate error Unexpected '='?

How about this:

library(dplyr)
library(glue)
data(mtcars)
dat <- mtcars
mpg_table <- function(df, grouping_var, val) {
  df %>% 
    mutate({{grouping_var}} := as.character({{grouping_var}})) %>% 
    bind_rows(mutate(., {{grouping_var}} := "all")) %>%
    group_by({{grouping_var}}) %>%
    summarise("{{val}}q25" := quantile({{val}}, prob = .25),
              "{{val}}q50" := quantile({{val}}, prob = .50),
              "{{val}}q75" := quantile({{val}}, prob = .75),
              count = n())
}

mpg_table(dat, cyl, mpg)
#> # A tibble: 4 × 5
#>   cyl   mpgq25 mpgq50 mpgq75 count
#>   <chr>  <dbl>  <dbl>  <dbl> <int>
#> 1 4       22.8   26     30.4    11
#> 2 6       18.6   19.7   21       7
#> 3 8       14.4   15.2   16.2    14
#> 4 all     15.4   19.2   22.8    32

^{Created on 2022-09-29 by the reprex package (v2.0.1)}

The := allows you to pass a variable in as the name of a new variable to be created. I also used the same construct for the variable names for the quantiles. This means that if you pass drat as val for example, you would get dratq25, dratq50 and dratq75 as the variables in the output.

The other problem you run into is a format problem. The cyl variable is numeric and you're trying to bind it to a data frame whose cyl variable is a character. The first step in the code above changes the grouping_var to character to avoid this problem.

Passing a user defined function to `dplyr::summarize()` when 'data' is an argument of user defined function

I tried to repair your function and apply it to your data:

library(dplyr)

topht <- function(data, dbh = NULL, ht = NULL, tpa = NULL, n = 40){ 
  
  ##evaluate function parameters in the data environment
  tmp <- data %>% pull({{ dbh }})
  odata <- data[base::order(tmp, decreasing=TRUE),]
  ht <- odata %>% pull({{ ht }})
  tpa <- data %>% pull({{ tpa }})
  
  #creating variables for cumulative trees per acre and cumulative height calculations#
  cumtpa <- 0
  cumht <- 0
  outcome <- 0
  
  for(i in 1:nrow(odata)) {
    
    if(cumtpa < n){ 
      
      cumtpa <- tpa[i] + cumtpa
      cumht <- (ht[i] * tpa[i]) + cumht
      
    } else if(cumtpa == n){
      
      break
      
    } else  {
      
      delta <- cumtpa - n
      cumtpa <- cumtpa - delta
      cumht <- cumht - (delta*ht[i])
      break
      
    }
    
    if(cumtpa > 0) {
      
      outcome <- cumht / cumtpa
      
    } else {
      
      outcome <- 0
      
    }
    
  }   
  
  outcome
}

Now we apply this function to each group:

DF %>% 
  group_by(groups) %>% 
  group_modify(~ .x %>% summarize(TOP_HT = topht(., dbh = dbh, ht = ht, tpa = tpa, n = 40))) %>% 
  ungroup() %>% 
  as.data.frame()

We want to apply topht to each group, so we use group_modify (it's like purrr's little sister). This returns

  groups    TOP_HT
1      A  88.75246
2      B 123.01531

A few words of explanation:

Since your function is named topht, you really should not use topht as variable name (even inside this function). I changed it to outcome.
outcome should be defined / initialised with some value. I chose 0, NA or something else might also be possible.
return() at the end of a function is unneccessary. Just use the variable name.
To evaluate the function's arguments (like dbh = dbh) you need the curly-curly operator. As a reference: https://www.r-bloggers.com/2019/06/curly-curly-the-successor-of-bang-bang/
Your first if-construction should be packed together into an if-else if - else construction.
To improve readability, you can use some spacing (see http://adv-r.had.co.nz/Style.html).

How to Define a Function in Dplyr