Comparison Between Dplyr::Do/Purrr::Map, What Advantages

Why use purrr::map instead of lapply?

If the only function you're using from purrr is map(), then no, the
advantages are not substantial. As Rich Pauloo points out, the main
advantage of map() is the helpers which allow you to write compact
code for common special cases:

  • ~ . + 1 is equivalent to function(x) x + 1 (and \(x) x + 1 in R-4.1 and newer)

  • list("x", 1) is equivalent to function(x) x[["x"]][[1]]. These
    helpers are a bit more general than [[ - see ?pluck for details.
    For data
    rectangling, the
    .default argument is particularly helpful.

But most of the time you're not using a single *apply()/map()
function, you're using a bunch of them, and the advantage of purrr is
much greater consistency between the functions. For example:

  • The first argument to lapply() is the data; the first argument to
    mapply() is the function. The first argument to all map functions
    is always the data.

  • With vapply(), sapply(), and mapply() you can choose to
    suppress names on the output with USE.NAMES = FALSE; but
    lapply() doesn't have that argument.

  • There's no consistent way to pass consistent arguments on to the
    mapper function. Most functions use ... but mapply() uses
    MoreArgs (which you'd expect to be called MORE.ARGS), and
    Map(), Filter() and Reduce() expect you to create a new
    anonymous function. In map functions, constant argument always come
    after the function name.

  • Almost every purrr function is type stable: you can predict the
    output type exclusively from the function name. This is not true for
    sapply() or mapply(). Yes, there is vapply(); but there's no
    equivalent for mapply().

You may think that all of these minor distinctions are not important
(just as some people think that there's no advantage to stringr over
base R regular expressions), but in my experience they cause unnecessary
friction when programming (the differing argument orders always used to
trip me up), and they make functional programming techniques harder to
learn because as well as the big ideas, you also have to learn a bunch
of incidental details.

Purrr also fills in some handy map variants that are absent from base R:

  • modify() preserves the type of the data using [[<- to modify "in
    place". In conjunction with the _if variant this allows for (IMO
    beautiful) code like modify_if(df, is.factor, as.character)

  • map2() allows you to map simultaneously over x and y. This
    makes it easier to express ideas like
    map2(models, datasets, predict)

  • imap() allows you to map simultaneously over x and its indices
    (either names or positions). This is makes it easy to (e.g) load all
    csv files in a directory, adding a filename column to each.

    dir("\\.csv$") %>%
    set_names() %>%
    map(read.csv) %>%
    imap(~ transform(.x, filename = .y))
  • walk() returns its input invisibly; and is useful when you're
    calling a function for its side-effects (i.e. writing files to
    disk).

Not to mention the other helpers like safely() and partial().

Personally, I find that when I use purrr, I can write functional code
with less friction and greater ease; it decreases the gap between
thinking up an idea and implementing it. But your mileage may vary;
there's no need to use purrr unless it actually helps you.

Microbenchmarks

Yes, map() is slightly slower than lapply(). But the cost of using
map() or lapply() is driven by what you're mapping, not the overhead
of performing the loop. The microbenchmark below suggests that the cost
of map() compared to lapply() is around 40 ns per element, which
seems unlikely to materially impact most R code.

library(purrr)
n <- 1e4
x <- 1:n
f <- function(x) NULL

mb <- microbenchmark::microbenchmark(
lapply = lapply(x, f),
map = map(x, f)
)
summary(mb, unit = "ns")$median / n
#> [1] 490.343 546.880

What distinguishes dplyr::pull from purrr::pluck and magrittr::extract2?

When you "should" use a function is really a matter of personal preference. Which function expresses your intention most clearly. There are differences between them. For example, pluck works better when you want to do multiple extractions. From help file:

 accessor(x[[1]])$foo 
# is the same as
pluck(x, 1, accessor, "foo")

so while it can be use to just extract a column, it's useful when you have more deeply nested structures or you want to compose with an accessor function.

The pull function is meant to blend in with the result of the dplyr function. It can take the name of a column using any of the ways you can with other functions in the package. For example it will work with !! style expansion where say extract2 will not.

irispull <- function(x) {
iris %>% pull(!!enquo(x))
}
irispull(Sepal.Length)

And extract2 is nothing more than a "more readable" wrapper for the base function [[. In fact it's defined as .Primitive("[[") so it expects column names as character or column indexes and integers.

Why is `furrr::future_map_int()` slower than `purrr::map_int()` when I use `dplyr::mutate()`?

As I have argued in the comments to the original post, my suspicion is that there is an overhead caused by the distribution the very large dataset by the workers.

To substantiate my suspicion, I have used the same code used by the OP with a single modification: I have added a delay of 0.000001 and the results were: purrr --> 192.45 sec and furrr: 44.707 sec (8 workers). The time taken by furrr was only 1/4 of the one taken by purrr -- very far from 1/8!

My code is below, as requested by the OP:

library(stringi)
library(rrapply)
library(tibble)

simulate_data <- function(nrows) {
split_func <- function(x, n) {
unname(split(x, rep_len(1:n, length(x))))
}

randomly_subset_vec <- function(x) {
sample(x, sample(length(x), 1))
}

tibble::tibble(
col_a = rrapply(object = split_func(
x = setNames(1:(nrows * 5),
stringi::stri_rand_strings(nrows * 5,
2)),
n = nrows
),
f = randomly_subset_vec),
col_b = runif(nrows)
)

}

set.seed(2021)

my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine

my_data

library(dplyr, warn.conflicts = T)
library(purrr)
library(furrr)
library(tictoc)

# first with purrr:
##################

######## ----> DELAY <---- ########
f <- function(x) {Sys.sleep(0.000001); length(x)}

tic()
my_data %>%
mutate(length_col_a = map_int(.x = col_a, .f = ~ f(.x)))
toc()

plan(multisession, workers = 8)

tic()
my_data %>%
mutate(length_col_a = future_map_int(col_a, f))
toc()

Using purrr to help transform a large data file

I agree with @det that rowwise isn't the way to go. I think perhaps the pmin function might be the best suited to the task, e.g.

data <- transform(data, earliest_date = pmin(date1, date2, date3, date4, date5, na.rm = TRUE))

Benchmarking (updated to include a data.table solution):

library(tidyverse)
library(lubridate)

set.seed(101)

data <- tibble(date1 = sample(
seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'),
100, replace = TRUE),
date2 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'),
100, replace = TRUE),
date3 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'),
100, replace = TRUE),
date4 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'),
100, replace = TRUE),
date5 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'),
100, replace = TRUE))

rowwise_func <- function(data){
data %>%
rowwise() %>%
mutate(earliest_date = min(c(date1, date2, date3, date4, date5),
na.rm = TRUE)) %>%
ungroup()
}

pmap_func <- function(data){
data %>%
mutate(try_again = pmap(list(date1, date2, date3, date4, date5),
min, na.rm = TRUE))
}

det_func1 <- function(data){
data %>%
mutate(min_date = pmap_dbl(select(., matches("^date")), min) %>% as.Date(origin = "1970-01-01"))
}

det_faster <- function(data){
data[["min_date"]] <- data %>%
mutate(across(where(is.Date), as.integer)) %>%
as.matrix() %>%
apply(1, function(x) x[which.min(x)]) %>%
as.Date(origin = "1970-01-01")
}

transform_func <- function(data){
as_tibble(transform(data, earliest_date = pmin(date1, date2, date3, date4, date5, na.rm = TRUE)))
}

dt_func <- function(data){
setDT(data)
data[, earliest_date := pmin(date1, date2, date3, date4, date5, na.rm = TRUE)]
}

times <- microbenchmark::microbenchmark(rowwise_func(data), pmap_func(data), det_func1(data), det_faster(data), transform_func(data), dt_func(data))
autoplot(times)

data2 <- transform_func(data)
data3 <- rowwise_func(data)
identical(data2, data3)
#> TRUE

example_3.png

Unit: microseconds
expr min lq mean median uq max neval cld
rowwise_func(data) 6764.693 6919.6720 7375.0418 7066.6220 7271.5850 16290.696 100 ab
pmap_func(data) 3994.973 4150.1360 9425.3880 4252.9850 4437.2950 491030.248 100 b
det_func1(data) 5576.240 5724.6820 6249.7573 5845.3305 5985.5940 15106.741 100 ab
det_faster(data) 3182.016 3305.3525 3556.8628 3362.8720 3444.0505 12771.952 100 ab
transform_func(data) 564.194 624.1055 697.5630 680.1130 718.7975 1513.184 100 a
dt_func(data) 650.611 723.7235 956.7916 759.3355 782.0565 10806.902 100 a

So, based on the functions I used above, the transform + pmin method was ~ 10X faster than the rowwise method.

purrr map a t.test onto a split df

Especially when dealing with pipes that require multiple inputs (we don't have Haskell's Arrows here), I find it easier to reason by types/signatures first, then encapsulate logic in functions (which you can unit test), then write a concise chain.

In this case you want to compare all possible pairs of vectors, so I would set a goal of writing a function that takes a pair (i.e. a list of 2) of vectors and returns the 2-way t.test of them.

Once you've done this, you just need some glue. So the plan is:

  1. Write function that takes a list of vectors and performs the 2-way t-test.
  2. Write a function/pipe that fetches the vectors from mtcars (easy).
  3. Map the above over the list of pairs.

It's important to have this plan before writing any code. Things are somehow obfuscated by the fact that R is not strongly typed, but this way you reason about "types" first, implementation second.

Step 1

t.test takes dots, so we use purrr:lift to have it take a list. Since we don't want to match on the names of the elements of the list, we use .unnamed = TRUE. Also we make it extra clear we're using the t.test function with arity of 2 (though this extra step is not needed for the code to work).

t.test2 <- function(x, y) t.test(x, y)
liftedTT <- lift(t.test2, .unnamed = TRUE)

Step 2

Wrap the function we got in step 1 into a functional chain that takes a simple pair (here I use indexes, it should be easy to use cyl factor levels, but I don't have time to figure it out).

doTT <- function(pair) {
mtcars %>%
split(as.character(.$cyl)) %>%
map(~ select(., mpg)) %>%
extract(pair) %>%
liftedTT %>%
broom::tidy
}

Step 3

Now that we have all our lego pieces ready, composition is trivial.

1:length(unique(mtcars$cyl)) %>% 
combn(2) %>%
as.data.frame %>%
as.list %>%
map(~ doTT(.))

$V1
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
1 6.920779 26.66364 19.74286 4.719059 0.0004048495 12.95598 3.751376 10.09018

$V2
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
1 11.56364 26.66364 15.1 7.596664 1.641348e-06 14.96675 8.318518 14.80876

$V3
estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
1 4.642857 19.74286 15.1 5.291135 4.540355e-05 18.50248 2.802925 6.482789

There's quite a bit here to clean up, mainly using factor levels and preserving them in the output (and not using globals in the second function) but I think the core of what you wanted is here. The trick not to get lost, in my experience, is to work from the inside out.

Add multiple output variables using purrr and a predefined function

The best approach I've found (which is still not terribly elegant) is to pipe into bind_cols. To get pmap_dfr to work correctly, the function should return a named list (which may or may not be a data frame):

library(tidyverse)

x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) as.list(set_names((a + b) * n, paste0('new', n)))

x %>% bind_cols(pmap_dfr(., mult, n = 1:2))
#> a b new1 new2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14

To avoid changing the definition of mult, you can wrap it in an anonymous function:

mult <- function(a,b,n) (a + b) * n

x %>% bind_cols(pmap_dfr(
.,
~as.list(set_names(
mult(...),
paste0('new', 1:2)
)),
n = 1:2
))
#> a b new1 new2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14

In this particular case, it's not actually necessary to iterate over rows, though, because you can vectorize the inputs from x and instead iterate over n. The advantage is that usually n > p, so the number of iterations will be [potentially much] lower. To be clear, whether such an approach is possible depends on for which parameters the function can accept vector arguments.

mult still needs to be called on the variables of x. The simplest way to do this is to pass them explicitly:

x %>% bind_cols(map_dfc(1:2, ~mult(x$a, x$b, .x)))
#> a b V1 V2
#> 1 1 2 3 6
#> 2 2 3 5 10
#> 3 3 4 7 14

...but this loses the benefit of pmap that named variables will automatically get passed to the correct parameter. You can get that back by using purrr::lift, which is an adverb that changes the domain of a function so it accepts a list by wrapping it in do.call. The returned function can be called on x and the value of n for that iteration:

x %>% bind_cols(map_dfc(1:2, ~lift(mult)(x, n = .x)))

This is equivalent to

x %>% bind_cols(map_dfc(1:2, ~invoke(mult, x, n = .x)))

but the advantage of the former is that it returns a function which can be partially applied on x so it only has an n parameter left, and thus requires no explicit references to x and so pipes better:

x %>% bind_cols(map_dfc(1:2, partial(lift(mult), .)))

All return the same thing. Names can be fixed after the fact with %>% set_names(~sub('^V(\\d+)$', 'new\\1', .x)), if you like.



Related Topics



Leave a reply



Submit