Comparison Between Dplyr::Do/Purrr::Map, What Advantages

Why use purrr::map instead of lapply?

If the only function you're using from purrr is map(), then no, the
advantages are not substantial. As Rich Pauloo points out, the main
advantage of map() is the helpers which allow you to write compact
code for common special cases:

~ . + 1 is equivalent to function(x) x + 1 (and \(x) x + 1 in R-4.1 and newer)
list("x", 1) is equivalent to function(x) x[["x"]][[1]]. These
helpers are a bit more general than [[ - see ?pluck for details.
For data
rectangling, the
.default argument is particularly helpful.

But most of the time you're not using a single *apply()/map()
function, you're using a bunch of them, and the advantage of purrr is
much greater consistency between the functions. For example:

The first argument to lapply() is the data; the first argument to
mapply() is the function. The first argument to all map functions
is always the data.
With vapply(), sapply(), and mapply() you can choose to
suppress names on the output with USE.NAMES = FALSE; but
lapply() doesn't have that argument.
There's no consistent way to pass consistent arguments on to the
mapper function. Most functions use ... but mapply() uses
MoreArgs (which you'd expect to be called MORE.ARGS), and
Map(), Filter() and Reduce() expect you to create a new
anonymous function. In map functions, constant argument always come
after the function name.
Almost every purrr function is type stable: you can predict the
output type exclusively from the function name. This is not true for
sapply() or mapply(). Yes, there is vapply(); but there's no
equivalent for mapply().

You may think that all of these minor distinctions are not important
(just as some people think that there's no advantage to stringr over
base R regular expressions), but in my experience they cause unnecessary
friction when programming (the differing argument orders always used to
trip me up), and they make functional programming techniques harder to
learn because as well as the big ideas, you also have to learn a bunch
of incidental details.

Purrr also fills in some handy map variants that are absent from base R:

modify() preserves the type of the data using [[<- to modify "in
place". In conjunction with the _if variant this allows for (IMO
beautiful) code like modify_if(df, is.factor, as.character)
map2() allows you to map simultaneously over x and y. This
makes it easier to express ideas like
map2(models, datasets, predict)
imap() allows you to map simultaneously over x and its indices
(either names or positions). This is makes it easy to (e.g) load all
csv files in a directory, adding a filename column to each.
```
dir("\\.csv$") %>%
  set_names() %>%
  map(read.csv) %>%
  imap(~ transform(.x, filename = .y))
```
walk() returns its input invisibly; and is useful when you're
calling a function for its side-effects (i.e. writing files to
disk).

Not to mention the other helpers like safely() and partial().

Personally, I find that when I use purrr, I can write functional code
with less friction and greater ease; it decreases the gap between
thinking up an idea and implementing it. But your mileage may vary;
there's no need to use purrr unless it actually helps you.

Microbenchmarks

Yes, map() is slightly slower than lapply(). But the cost of using
map() or lapply() is driven by what you're mapping, not the overhead
of performing the loop. The microbenchmark below suggests that the cost
of map() compared to lapply() is around 40 ns per element, which
seems unlikely to materially impact most R code.

library(purrr)
n <- 1e4
x <- 1:n
f <- function(x) NULL

mb <- microbenchmark::microbenchmark(
  lapply = lapply(x, f),
  map = map(x, f)
)
summary(mb, unit = "ns")$median / n
#> [1] 490.343 546.880

What distinguishes dplyr::pull from purrr::pluck and magrittr::extract2?

When you "should" use a function is really a matter of personal preference. Which function expresses your intention most clearly. There are differences between them. For example, pluck works better when you want to do multiple extractions. From help file:

 accessor(x[[1]])$foo 
 # is the same as
 pluck(x, 1, accessor, "foo")

so while it can be use to just extract a column, it's useful when you have more deeply nested structures or you want to compose with an accessor function.

The pull function is meant to blend in with the result of the dplyr function. It can take the name of a column using any of the ways you can with other functions in the package. For example it will work with !! style expansion where say extract2 will not.

irispull <- function(x) {
  iris %>% pull(!!enquo(x))
}
irispull(Sepal.Length)

And extract2 is nothing more than a "more readable" wrapper for the base function [[. In fact it's defined as .Primitive("[[") so it expects column names as character or column indexes and integers.

Why is `furrr::future_map_int()` slower than `purrr::map_int()` when I use `dplyr::mutate()`?

As I have argued in the comments to the original post, my suspicion is that there is an overhead caused by the distribution the very large dataset by the workers.

To substantiate my suspicion, I have used the same code used by the OP with a single modification: I have added a delay of 0.000001 and the results were: purrr --> 192.45 sec and furrr: 44.707 sec (8 workers). The time taken by furrr was only 1/4 of the one taken by purrr -- very far from 1/8!

My code is below, as requested by the OP:

library(stringi)
library(rrapply)
library(tibble)

simulate_data <- function(nrows) {
  split_func <- function(x, n) {
    unname(split(x, rep_len(1:n, length(x))))
  }
  
  randomly_subset_vec <- function(x) {
    sample(x, sample(length(x), 1))
  }
  
  tibble::tibble(
    col_a = rrapply(object = split_func(
      x = setNames(1:(nrows * 5),
                   stringi::stri_rand_strings(nrows * 5,
                                              2)),
      n = nrows
    ),
    f      = randomly_subset_vec),
    col_b = runif(nrows)
  )
  
} 

set.seed(2021)

my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine

my_data

library(dplyr, warn.conflicts = T)
library(purrr)
library(furrr)
library(tictoc)

# first with purrr:
##################

######## ---->  DELAY <---- ########
f <- function(x) {Sys.sleep(0.000001); length(x)}

tic()
my_data %>%
  mutate(length_col_a = map_int(.x = col_a, .f = ~ f(.x)))
toc()

plan(multisession, workers = 8)

tic()
my_data %>%
  mutate(length_col_a = future_map_int(col_a, f))
toc()

Using purrr to help transform a large data file

I agree with @det that rowwise isn't the way to go. I think perhaps the pmin function might be the best suited to the task, e.g.

data <- transform(data, earliest_date = pmin(date1, date2, date3, date4, date5, na.rm = TRUE))

Benchmarking (updated to include a data.table solution):

library(tidyverse)
library(lubridate)

set.seed(101)

data <- tibble(date1 = sample(
  seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
  100, replace = TRUE),
  date2 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
                 100, replace = TRUE),
  date3 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
                 100, replace = TRUE),
  date4 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
                 100, replace = TRUE),
  date5 = sample(seq(ymd('2021-03-20'), ymd('2021-05-20'), by = 'day'), 
                 100, replace = TRUE))

rowwise_func <- function(data){
  data %>%
    rowwise() %>%
    mutate(earliest_date = min(c(date1, date2, date3, date4, date5),
                               na.rm = TRUE)) %>% 
    ungroup()
}

pmap_func <- function(data){
  data %>% 
    mutate(try_again = pmap(list(date1, date2, date3, date4, date5), 
                          min, na.rm = TRUE))
  }

det_func1 <- function(data){
  data %>%
  mutate(min_date = pmap_dbl(select(., matches("^date")), min) %>% as.Date(origin = "1970-01-01"))
}

det_faster <- function(data){
  data[["min_date"]] <- data %>% 
    mutate(across(where(is.Date), as.integer)) %>% 
    as.matrix() %>% 
    apply(1, function(x) x[which.min(x)]) %>%
    as.Date(origin = "1970-01-01")
}

transform_func <- function(data){
  as_tibble(transform(data, earliest_date = pmin(date1, date2, date3, date4, date5, na.rm = TRUE)))
}

dt_func <- function(data){
  setDT(data)
  data[, earliest_date := pmin(date1, date2, date3, date4, date5, na.rm = TRUE)]
}

times <- microbenchmark::microbenchmark(rowwise_func(data), pmap_func(data), det_func1(data), det_faster(data), transform_func(data), dt_func(data))
autoplot(times)

data2 <- transform_func(data)
data3 <- rowwise_func(data)
identical(data2, data3)
#> TRUE

Unit: microseconds
                 expr      min        lq      mean    median        uq        max neval cld
   rowwise_func(data) 6764.693 6919.6720 7375.0418 7066.6220 7271.5850  16290.696   100  ab
      pmap_func(data) 3994.973 4150.1360 9425.3880 4252.9850 4437.2950 491030.248   100   b
      det_func1(data) 5576.240 5724.6820 6249.7573 5845.3305 5985.5940  15106.741   100  ab
     det_faster(data) 3182.016 3305.3525 3556.8628 3362.8720 3444.0505  12771.952   100  ab
 transform_func(data)  564.194  624.1055  697.5630  680.1130  718.7975   1513.184   100  a 
        dt_func(data)  650.611  723.7235  956.7916  759.3355  782.0565  10806.902   100  a

So, based on the functions I used above, the transform + pmin method was ~ 10X faster than the rowwise method.

purrr map a t.test onto a split df

Especially when dealing with pipes that require multiple inputs (we don't have Haskell's Arrows here), I find it easier to reason by types/signatures first, then encapsulate logic in functions (which you can unit test), then write a concise chain.

In this case you want to compare all possible pairs of vectors, so I would set a goal of writing a function that takes a pair (i.e. a list of 2) of vectors and returns the 2-way t.test of them.

Once you've done this, you just need some glue. So the plan is:

Write function that takes a list of vectors and performs the 2-way t-test.
Write a function/pipe that fetches the vectors from mtcars (easy).
Map the above over the list of pairs.

It's important to have this plan before writing any code. Things are somehow obfuscated by the fact that R is not strongly typed, but this way you reason about "types" first, implementation second.

Step 1

t.test takes dots, so we use purrr:lift to have it take a list. Since we don't want to match on the names of the elements of the list, we use .unnamed = TRUE. Also we make it extra clear we're using the t.test function with arity of 2 (though this extra step is not needed for the code to work).

t.test2 <- function(x, y) t.test(x, y)
liftedTT <- lift(t.test2, .unnamed = TRUE)

Step 2

Wrap the function we got in step 1 into a functional chain that takes a simple pair (here I use indexes, it should be easy to use cyl factor levels, but I don't have time to figure it out).

doTT <- function(pair) {
  mtcars %>%
    split(as.character(.$cyl)) %>%
    map(~ select(., mpg)) %>% 
    extract(pair) %>% 
    liftedTT %>% 
    broom::tidy
}

Step 3

Now that we have all our lego pieces ready, composition is trivial.

1:length(unique(mtcars$cyl)) %>% 
  combn(2) %>% 
  as.data.frame %>% 
  as.list %>% 
  map(~ doTT(.))

$V1
  estimate estimate1 estimate2 statistic      p.value parameter conf.low conf.high
1 6.920779  26.66364  19.74286  4.719059 0.0004048495  12.95598 3.751376  10.09018

$V2
  estimate estimate1 estimate2 statistic      p.value parameter conf.low conf.high
1 11.56364  26.66364      15.1  7.596664 1.641348e-06  14.96675 8.318518  14.80876

$V3
  estimate estimate1 estimate2 statistic      p.value parameter conf.low conf.high
1 4.642857  19.74286      15.1  5.291135 4.540355e-05  18.50248 2.802925  6.482789

There's quite a bit here to clean up, mainly using factor levels and preserving them in the output (and not using globals in the second function) but I think the core of what you wanted is here. The trick not to get lost, in my experience, is to work from the inside out.

Add multiple output variables using purrr and a predefined function

The best approach I've found (which is still not terribly elegant) is to pipe into bind_cols. To get pmap_dfr to work correctly, the function should return a named list (which may or may not be a data frame):

library(tidyverse)

x <- data.frame(a = 1:3, b = 2:4)
mult <- function(a,b,n) as.list(set_names((a + b) * n, paste0('new', n)))

x %>% bind_cols(pmap_dfr(., mult, n = 1:2))
#>   a b new1 new2
#> 1 1 2    3    6
#> 2 2 3    5   10
#> 3 3 4    7   14

To avoid changing the definition of mult, you can wrap it in an anonymous function:

mult <- function(a,b,n) (a + b) * n

x %>% bind_cols(pmap_dfr(
    ., 
    ~as.list(set_names(
        mult(...), 
        paste0('new', 1:2)
    )), 
    n = 1:2
))
#>   a b new1 new2
#> 1 1 2    3    6
#> 2 2 3    5   10
#> 3 3 4    7   14

In this particular case, it's not actually necessary to iterate over rows, though, because you can vectorize the inputs from x and instead iterate over n. The advantage is that usually n > p, so the number of iterations will be [potentially much] lower. To be clear, whether such an approach is possible depends on for which parameters the function can accept vector arguments.

mult still needs to be called on the variables of x. The simplest way to do this is to pass them explicitly:

x %>% bind_cols(map_dfc(1:2, ~mult(x$a, x$b, .x)))
#>   a b V1 V2
#> 1 1 2  3  6
#> 2 2 3  5 10
#> 3 3 4  7 14

...but this loses the benefit of pmap that named variables will automatically get passed to the correct parameter. You can get that back by using purrr::lift, which is an adverb that changes the domain of a function so it accepts a list by wrapping it in do.call. The returned function can be called on x and the value of n for that iteration:

x %>% bind_cols(map_dfc(1:2, ~lift(mult)(x, n = .x)))

This is equivalent to

x %>% bind_cols(map_dfc(1:2, ~invoke(mult, x, n = .x)))

but the advantage of the former is that it returns a function which can be partially applied on x so it only has an n parameter left, and thus requires no explicit references to x and so pipes better:

x %>% bind_cols(map_dfc(1:2, partial(lift(mult), .)))

All return the same thing. Names can be fixed after the fact with %>% set_names(~sub('^V(\\d+)$', 'new\\1', .x)), if you like.