Why Is Using Dplyr Pipe (%>%) Slower Than an Equivalent Non-Pipe Expression, for High-Cardinality Group-By

Why is using dplyr pipe (%%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the summarize clause, so lets microbenchmark something similar to that:

> set.seed(99);z=sample(10000,4,TRUE)
> microbenchmark(z %>% unique %>% list, list(unique(z)))
Unit: microseconds
                  expr     min      lq      mean   median      uq     max neval
 z %>% unique %>% list 142.617 144.433 148.06515 145.0265 145.969 297.735   100
       list(unique(z))   9.289   9.988  10.85705  10.5820  11.804  12.642   100

This is doing something a bit different to your code but illustrates the point. Pipes are slower.

Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to unique and list are pretty fast in R, so the whole difference here is the pipe overhead.

Profiling expressions like this showed me most of the time is spent in the pipe functions:

                         total.time total.pct self.time self.pct
"microbenchmark"              16.84     98.71      1.22     7.15
"%>%"                         15.50     90.86      1.22     7.15
"eval"                         5.72     33.53      1.18     6.92
"split_chain"                  5.60     32.83      1.92    11.25
"lapply"                       5.00     29.31      0.62     3.63
"FUN"                          4.30     25.21      0.24     1.41
 ..... stuff .....

then somewhere down in about 15th place the real work gets done:

"as.list"                      1.40      8.13      0.66     3.83
"unique"                       1.38      8.01      0.88     5.11
"rev"                          1.26      7.32      0.90     5.23

Whereas if you just call the functions as Chambers intended, R gets straight down to it:

                         total.time total.pct self.time self.pct
"microbenchmark"               2.30     96.64      1.04    43.70
"unique"                       1.12     47.06      0.38    15.97
"unique.default"               0.74     31.09      0.64    26.89
"is.factor"                    0.10      4.20      0.10     4.20

Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to glm with a few hundred data points, but that's another story....

How to write an efficient wrapper for data wrangling, allowing to turn off any wrapped part when calling the wrapper

Staying with %>%, you could create a functional sequence:

library(magrittr)

my_wrangling_wrapper =
  . %>%
  janitor::clean_names() %>%
  mutate(across(everything(), convert_true_false_to_1_0)) %>%
  mutate(across(everything(), tolower)) %>%
  pivot_wider(names_from = condition, values_from = score) %>%
  drop_na()

As this sequence behaves like a list, you can decide which steps to use by selecting the elements:

clean_names       = TRUE
convert_tf_to_1_0 = TRUE 
convert_to_lower  = FALSE 
pivot_widr        = FALSE
drp_na            = TRUE

my_wrangling_wrapper[c(clean_names,
                       convert_tf_to_1_0,
                       convert_to_lower,
                       pivot_widr,
                       drp_na)]

#Functional sequence with the following components:
#
# 1. janitor::clean_names(.)
# 2. mutate(., across(everything(), convert_true_false_to_1_0))
# 3. drop_na(.)

df %>% my_wrangling_wrapper[c(clean_names,
                               convert_tf_to_1_0,
                               convert_to_lower,
                               pivot_widr,
                               drp_na)]()

#  id is_male weight hash_numb score
#1  1       1     51     Zm1Xx   343
#2  3       1     99     Xc2rm   703
#3  6       0     62     2r2cP   243
#4 12       0     84     llI0f   297
#5 16       0     72     AO76M   475
#6 18       0     63     zGJmW   376

Without %>%, you could use the equivalent freduce solution:

clean_names  <- function(x) janitor::clean_names(x,dat)   

convert_tf_to_1_0 <- function(x) mutate(x,dat, across(everything(),
                                               convert_true_false_to_1_0)) 

convert_to_lower <- function(x) mutate(x,dat, across(everything(), tolower))
         
pivot_widr <- function(x) pivot_wider(x,dat, names_from = condition,
                                             values_from = score) 

drp_na <- function(x) drop_na(x, dat) 

my_wrangling_list <- list(clean_names, convert_tf_to_1_0, drp_na)
magrittr::freduce(df, my_wrangling_list)

Or with %>% and freduce:

df %>% freduce(my_wrangling_list)

I wouldn't be too concerned by the piping overhead, see this answer in the link you referenced : when comparing milliseconds, piping has an impact, but when it comes to bigger calculations, piping overhead becomes negligible.

R changing Data Frame values based on a secondary Data Frame

The following solution uses only vectorized logic. It uses the lookup table you already made. I think it'll be even faster than Rui's solution

library(dplyr)
x <- data.frame(var1 = c("AA","BB","CC","DD"), 
                var2 = c("--","AA","AA","--"), 
                val1 = c(1,2,1,4), 
                val2 = c(5,5,7,8))

lookup.df <- x[ x[,'var2'] == "--", ]

x[x[,'var2'] %in% x[,'var1'] & x[,'val1'] %in% lookup.df[,'val1'] , 'val1'] <- NA
x[x[,'var2'] %in% x[,'var1'] & x[,'val2'] %in% lookup.df[,'val2'] , 'val2'] <- NA

x
#>   var1 var2 val1 val2
#> 1   AA   --    1    5
#> 2   BB   AA    2   NA
#> 3   CC   AA   NA    7
#> 4   DD   --    4    8

EDIT:

It might be or it might not be.

set.seed(4)
microbenchmark::microbenchmark(na.replace.orig(x), na.replace.1(x), na.replace.2(x), times = 50)
#> Unit: microseconds
#>                expr     min      lq     mean   median      uq      max
#>  na.replace.orig(x) 184.348 192.410 348.4430 202.1615 223.375 6206.546
#>     na.replace.1(x)  68.127  86.621 281.3503  89.8715  93.381 9693.029
#>     na.replace.2(x)  95.885 105.858 210.7638 113.2060 118.668 4993.849
#>  neval
#>     50
#>     50
#>     50

OP, you'll need to test it on your dataset to see how the two scale differently at larger-sized dataframes.

Edit 2: Implemented Rui's suggestion for the lookup table. In order from slowest to fastest benchmark:

lookup.df <- x %>% filter(var2 == "--")
lookup.df <- filter(x, var2 == "--")
lookup.df <- x[x[,'var2'] == "--", ]

Why Is Using Dplyr Pipe (%>%) Slower Than an Equivalent Non-Pipe Expression, for High-Cardinality Group-By