Why Is Using Dplyr Pipe (%>%) Slower Than an Equivalent Non-Pipe Expression, for High-Cardinality Group-By

Why is using dplyr pipe (%%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the summarize clause, so lets microbenchmark something similar to that:

> set.seed(99);z=sample(10000,4,TRUE)
> microbenchmark(z %>% unique %>% list, list(unique(z)))
Unit: microseconds
expr min lq mean median uq max neval
z %>% unique %>% list 142.617 144.433 148.06515 145.0265 145.969 297.735 100
list(unique(z)) 9.289 9.988 10.85705 10.5820 11.804 12.642 100

This is doing something a bit different to your code but illustrates the point. Pipes are slower.

Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to unique and list are pretty fast in R, so the whole difference here is the pipe overhead.

Profiling expressions like this showed me most of the time is spent in the pipe functions:

                         total.time total.pct self.time self.pct
"microbenchmark" 16.84 98.71 1.22 7.15
"%>%" 15.50 90.86 1.22 7.15
"eval" 5.72 33.53 1.18 6.92
"split_chain" 5.60 32.83 1.92 11.25
"lapply" 5.00 29.31 0.62 3.63
"FUN" 4.30 25.21 0.24 1.41
..... stuff .....

then somewhere down in about 15th place the real work gets done:

"as.list"                      1.40      8.13      0.66     3.83
"unique" 1.38 8.01 0.88 5.11
"rev" 1.26 7.32 0.90 5.23

Whereas if you just call the functions as Chambers intended, R gets straight down to it:

                         total.time total.pct self.time self.pct
"microbenchmark" 2.30 96.64 1.04 43.70
"unique" 1.12 47.06 0.38 15.97
"unique.default" 0.74 31.09 0.64 26.89
"is.factor" 0.10 4.20 0.10 4.20

Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to glm with a few hundred data points, but that's another story....

How to write an efficient wrapper for data wrangling, allowing to turn off any wrapped part when calling the wrapper

Staying with %>%, you could create a functional sequence:

library(magrittr)

my_wrangling_wrapper =
. %>%
janitor::clean_names() %>%
mutate(across(everything(), convert_true_false_to_1_0)) %>%
mutate(across(everything(), tolower)) %>%
pivot_wider(names_from = condition, values_from = score) %>%
drop_na()

As this sequence behaves like a list, you can decide which steps to use by selecting the elements:

clean_names       = TRUE
convert_tf_to_1_0 = TRUE
convert_to_lower = FALSE
pivot_widr = FALSE
drp_na = TRUE

my_wrangling_wrapper[c(clean_names,
convert_tf_to_1_0,
convert_to_lower,
pivot_widr,
drp_na)]

#Functional sequence with the following components:
#
# 1. janitor::clean_names(.)
# 2. mutate(., across(everything(), convert_true_false_to_1_0))
# 3. drop_na(.)

df %>% my_wrangling_wrapper[c(clean_names,
convert_tf_to_1_0,
convert_to_lower,
pivot_widr,
drp_na)]()

# id is_male weight hash_numb score
#1 1 1 51 Zm1Xx 343
#2 3 1 99 Xc2rm 703
#3 6 0 62 2r2cP 243
#4 12 0 84 llI0f 297
#5 16 0 72 AO76M 475
#6 18 0 63 zGJmW 376

Without %>%, you could use the equivalent freduce solution:

clean_names  <- function(x) janitor::clean_names(x,dat)   

convert_tf_to_1_0 <- function(x) mutate(x,dat, across(everything(),
convert_true_false_to_1_0))

convert_to_lower <- function(x) mutate(x,dat, across(everything(), tolower))

pivot_widr <- function(x) pivot_wider(x,dat, names_from = condition,
values_from = score)

drp_na <- function(x) drop_na(x, dat)

my_wrangling_list <- list(clean_names, convert_tf_to_1_0, drp_na)
magrittr::freduce(df, my_wrangling_list)

Or with %>% and freduce:

df %>% freduce(my_wrangling_list)

I wouldn't be too concerned by the piping overhead, see this answer in the link you referenced : when comparing milliseconds, piping has an impact, but when it comes to bigger calculations, piping overhead becomes negligible.

R changing Data Frame values based on a secondary Data Frame

The following solution uses only vectorized logic. It uses the lookup table you already made. I think it'll be even faster than Rui's solution

library(dplyr)
x <- data.frame(var1 = c("AA","BB","CC","DD"),
var2 = c("--","AA","AA","--"),
val1 = c(1,2,1,4),
val2 = c(5,5,7,8))

lookup.df <- x[ x[,'var2'] == "--", ]

x[x[,'var2'] %in% x[,'var1'] & x[,'val1'] %in% lookup.df[,'val1'] , 'val1'] <- NA
x[x[,'var2'] %in% x[,'var1'] & x[,'val2'] %in% lookup.df[,'val2'] , 'val2'] <- NA

x
#> var1 var2 val1 val2
#> 1 AA -- 1 5
#> 2 BB AA 2 NA
#> 3 CC AA NA 7
#> 4 DD -- 4 8

EDIT:

It might be or it might not be.

set.seed(4)
microbenchmark::microbenchmark(na.replace.orig(x), na.replace.1(x), na.replace.2(x), times = 50)
#> Unit: microseconds
#> expr min lq mean median uq max
#> na.replace.orig(x) 184.348 192.410 348.4430 202.1615 223.375 6206.546
#> na.replace.1(x) 68.127 86.621 281.3503 89.8715 93.381 9693.029
#> na.replace.2(x) 95.885 105.858 210.7638 113.2060 118.668 4993.849
#> neval
#> 50
#> 50
#> 50

OP, you'll need to test it on your dataset to see how the two scale differently at larger-sized dataframes.

Edit 2: Implemented Rui's suggestion for the lookup table. In order from slowest to fastest benchmark:

lookup.df <- x %>% filter(var2 == "--")
lookup.df <- filter(x, var2 == "--")
lookup.df <- x[x[,'var2'] == "--", ]


Related Topics



Leave a reply



Submit