Using Tidy Eval for Multiple Dplyr Filter Conditions

Using tidy eval for multiple, arbitrary filter conditions

Create a list of calls and splice them in:

library(dplyr)
library(gapminder)

cols <- list("country", "year")
ops <- list("%in%", ">=")
vals <- list(c("Albania", "France"), 2007)

# Assumes LHS is the name of a variable and OP is
# the name of a function
op_call <- function(op, lhs, rhs) {
call(op, sym(lhs), rhs)
}

my_filter <- function(data, cols, ops, vals) {
exprs <- purrr::pmap(list(ops, cols, vals), op_call)
data %>% dplyr::filter(!!!exprs)
}

gapminder %>% my_filter(cols, ops, vals)
#> # A tibble: 2 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Albania Europe 2007 76.4 3600523 5937.
#> 2 France Europe 2007 80.7 61083916 30470.

Here we don't have to worry about scoping issues because (a) the column names are assumed to be defined in the data mask, (b) the values are passed by value and inlined in the created calls, and (c) the functions are assumed to be binary operators and these are rarely redefined.

To allow custom user functions, there are two ways we could go about it. First, we could take an environment and create quosures manually with new_quosure():

op_call <- function(op, lhs, rhs, env = caller_env()) {
new_quosure(call(op, sym(lhs), rhs), env)
}

my_filter <- function(data, cols, ops, vals, env = caller_env()) {
exprs <- purrr::pmap(list(ops, cols, vals), op_call, env)
data %>% dplyr::filter(!!!exprs)
}

gapminder %>% my_filter(cols, ops, vals)

local({
my_op <- `%in%`
gapminder %>% my_filter(cols, list("my_op", ">="), vals)
})
#> # A tibble: 2 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Albania Europe 2007 76.4 3600523 5937.
#> 2 France Europe 2007 80.7 61083916 30470.

The other way, perhaps simpler, is to allow the call to contain inlined functions. To that end, use rlang::call2() instead of base::call():

op_call <- function(op, lhs, rhs) {
call2(op, sym(lhs), rhs)
}

my_filter <- function(data, cols, ops, vals) {
exprs <- purrr::pmap(list(ops, cols, vals), op_call)
data %>% dplyr::filter(!!!exprs)
}

local({
my_op <- `%in%`
gapminder %>% my_filter(cols, list(my_op, ">="), vals)
})
#> # A tibble: 2 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Albania Europe 2007 76.4 3600523 5937.
#> 2 France Europe 2007 80.7 61083916 30470.

The downside of inlining functions is that this will prevent optimisations and transportability to other dplyr backends.

Using tidy eval for multiple dplyr filter conditions

Using the tidyverse, you could re-write that function as

library(dplyr)
library(purrr) # for map2()

my_filter <- function(df, cols, conds){
fp <- map2(cols, conds, function(x, y) quo((!!(as.name(x))) %in% !!y))
filter(df, !!!fp)
}

my_filter(gapminder::gapminder, cols = list("continent", "country"),
conds = list("Europe", c("Albania", "France")))

This is calling the equivalent of

filter(gapminder, continent %in% "Europe", country %in% c("Albania", "France"))

The main reason this works is that you can pass multiple arguments to filter() and they are implicitly combined with &. And map2() is just a tidyverse equivalent for mapply with two objects to iterate.

What's the tidyeval approach of using dplyr::filter with a database when condition is specified as string?

Use rlang::parse_expr.

library(dplyr)

iris_db %>% filter(!!rlang::parse_expr(filter_str)) %>% collect() %>% count()

# n
# <int>
#1 50

It returns the query:

iris_db %>% filter(!!rlang::parse_expr(filter_str)) %>% show_query()

#<SQL>
#SELECT *
#FROM `iris`
#WHERE (`Species` = 'setosa')

Using tidy evaluation to filter in my own function

You can use deparse + substitute to change unquoted arguments to quoted ones.

library(dplyr)

test <- function(df, condition) {
val <- deparse(substitute(condition))
df %>% filter(supp %in% val)
}

test(ToothGrowth, VC)

Dplyr Multiple Lags Tidy Eval?

From this blog post: multiple lags with tidy evaluation by Romain François

library(rlang)
library(tidyverse)

a <- as_tibble(c(1:100))
n_lags <- 3

lags <- function(var, n = 3) {
var <- enquo(var)
indices <- seq_len(n)

# create a list of quosures by looping over `indices`
# then give them names for `mutate` to use later
map(indices, ~ quo(lag(!!var, !!.x))) %>%
set_names(sprintf("L_%02d.%s", indices, "y"))
}

# unquote the list of quosures so that they are evaluated by `mutate`
a %>%
mutate_at(vars(value), funs(!!!lags(value, n_lags)))

#> # A tibble: 100 x 4
#> value L_01.y L_02.y L_03.y
#> <int> <int> <int> <int>
#> 1 1 NA NA NA
#> 2 2 1 NA NA
#> 3 3 2 1 NA
#> 4 4 3 2 1
#> 5 5 4 3 2
#> 6 6 5 4 3
#> 7 7 6 5 4
#> 8 8 7 6 5
#> 9 9 8 7 6
#> 10 10 9 8 7
#> # ... with 90 more rows

Created on 2019-02-15 by the reprex package (v0.2.1.9000)

How to filter a data frame programmatically with dplyr and tidy evaluation?

Try

filter_starwars <- function(...) {
F <- quos(...)
filter(starwars, !!!F)
}

filter_starwars(species == 'Human', homeworld %in% c('Tatooine', 'Alderaan'), height > 175)
# # A tibble: 7 × 13
# name height mass hair_color skin_color eye_color birth_year
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
# 1 Darth Vader 202 136 none white yellow 41.9
# 2 Owen Lars 178 120 brown, grey light blue 52.0
# 3 Biggs Darklighter 183 84 black light brown 24.0
# 4 Anakin Skywalker 188 84 blond fair blue 41.9
# 5 Cliegg Lars 183 NA brown fair blue 82.0
# 6 Bail Prestor Organa 191 NA black tan brown 67.0
# 7 Raymus Antilles 188 79 brown light brown NA
# # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# # films <list>, vehicles <list>, starships <list>

See https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html. Briefly, quos captures ... as a list, without evaluating the arguments. !!! splices and unquotes the arguments for evaluation in filter().

Join and group_by tidy eval issue

The issue is using str which gets the structure of an object. Assuming that colName is passed as a string, we don't need any wrapping. Inside the function it is converted to symbol with ensym. So, either get the input (assume it is a string) before converting to symbol as a different object or make use of as_string from rlang

 emp_turnover_fun <- function(data, colName, year = "2015") {

# Convert colName to symbol or check if symbol
colName <- ensym(colName)
colName_str <- rlang::as_string(colName) ## converted to string


# Terminations by year and variable in df
term_test <- data %>%
filter(year(DateofTermination) == year) %>%
count(!!(colName)) %>%
clean_names()

# Start employees by var and year
fun_year_job <- paste(year, "-01-01", sep = "")
start_test <- data %>%
select(DateofHire, DateofTermination, !!(colName)) %>%
filter(
DateofHire <= fun_year_job,
DateofTermination > fun_year_job | is.na(DateofTermination)
) %>%
count(!!(colName))

# End employees by year and var
year_pos <- year %>% as.character()
year_num_plus_pos <- as.character(as.numeric(year_pos) + 1)
fun_year2_pos <- paste(year_num_plus_pos, "-01-01", sep = "")

end_test <- data %>%
select(DateofHire, DateofTermination, !!(colName)) %>%
filter(
DateofHire <= fun_year2_pos,
DateofTermination > fun_year2_pos | is.na(DateofTermination)
) %>%
count(!!(colName))

join_turnover_year <- full_join(start_test, end_test,
by = colName_str) %>% # use the string
full_join(y = term_test, by = colName_str) %>% # use the string
setNames(c(colName_str, "Start_Headcount", "End_Headcount",
"Terminations")) %>% # here as well
group_by({{colName}}) %>%
summarise(Turnover = ((Terminations) / (Start_Headcount + End_Headcount)) * 100)

return(join_turnover_year)
}

It is safer to do as_string as opposed to taking the input directly as string i.e. ensym can work with both unquoted or quoted values, thus if we are passing unquoted, then grabbing the input doesn't work i.e. it may need deparse(substitute(colName)). Instead, first convert to symbol and then do the conversion back to string with as_string

How to pass an expression to the filter() verb the tidy way?

We can select the columns of interest and extract the components with ..1, ..2 (or .x, .y - if there are only 2 columns)

library(dplyr)
library(tibble)
library(purrr)
tibble(pick_these = get_these %>%
map(rlang::parse_expr)) %>%
mutate(wat = names(get_these),
goods = list(diamonds)) %>%
mutate(goods = pmap(select(., goods, pick_these), ~ {
..1 %>%
filter(rlang::eval_tidy( ..2))

})) %>%
dplyr::select(goods, wat)

# A tibble: 2 x 2
# goods wat
# <list> <chr>
#1 <tibble [3,903 × 10]> Ideal E
#2 <tibble [307 × 10]> Good J

dplyr filter keep the NAs AND OR conditions

Use | instead of &. With filter, multiple expressions separated by , are taken as &. It is not possible to have a value that is both NA and not equal to 182

library(dplyr)
test_data_1 %>%
filter(is.na(Art) | Art != 182)

-output

# A tibble: 8 × 1
Art
<dbl>
1 188
2 NA
3 NA
4 140
5 NA
6 NA
7 NA
8 NA

The second part of the question is with %in%. We can use | again

test_data_1 %>%
filter(Art %in% c(140,188) | is.na(Art))
# A tibble: 8 × 1
Art
<dbl>
1 188
2 NA
3 NA
4 140
5 NA
6 NA
7 NA
8 NA

NOTE: By default, filter removes the NA elements. In addition, there is no na.rm argument in filter



Related Topics



Leave a reply



Submit