Why Is 'Vapply' Safer Than 'Sapply'

Why is `vapply` safer than `sapply`?

As has already been noted, vapply does two things:

  • Slight speed improvement
  • Improves consistency by providing limited return type checks.

The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply followed by stopifnot to make sure that the return values are consistent with what you expected, but vapply is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).

Here's an example of vapply ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).

> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"

[[2]]
[1] "d"

[[3]]
[1] "d" "d"

> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2

Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.

As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."

Zero length inputs

One related point is that if the input length is zero, sapply will always return an empty list, regardless of the input type. Compare:

sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)

With vapply, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.

Benchmarks

vapply can be a bit faster because it already knows what format it should be expecting the results in.

input1.long <- rep(input1,10000)

library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)

autoplot

Using vapply instead of sapply

Question 1:

The error with character(2) is because the character vector "integer" is only of length 1 and rightly fails the consistency check against the expected result of character vector of length 2.

Question 2:

vapply() is there as a safer version of sapply() as it makes sure you only get back what you expect from each application of FUN. It is also safer I guess because the output from vapply() is consistent - you don't get a vector or a matrix or a list. You get a vector for length 1 returned sub-components and an array otherwise.

In the specific example you give, you can't use vapply() as what is returned by class isn't consistent. You have to know or expect certain output and vapply() fails if the output from a call to FUN doesn't match what it expects.

In this instance, I suppose you could do

df2 <- data.frame(x = 1:10, y = Sys.time() + 1:10)
vapply(df2, FUN = function(x) paste(class(x), collapse = "; "),
FUN.VALUE = character(1))

> vapply(df2, FUN = function(x) paste(class(x), collapse = "; "),
+ FUN.VALUE = character(1))
x y
"integer" "POSIXct; POSIXt"

but whether that is useful to you or not is a different matter.

Really, using vapply() comes down to knowing what to expect from FUN and wanting to only ever get that output. If you don't know or can't control it, you are probably better off with lapply().

Why are sapply() and options() undesirable ?

If you look at the header for that function,

function(fun = default_undesirable_functions)

you see that it records its choices in default_undesirable_functions, and if you look at that object, you'll see:

...
$options
[1] "use withr::with_options()"
...
$sapply
[1] "use vapply() or lapply()"
...

From the alternatives, you can guess at why the author thinks those functions are "undesirable":

  • options() is bad because it has global side effects. The withr::with_options() alternative keeps any changes to the options local.
  • sapply() is bad because vapply() is safer (as documented in ?sapply).

How to convert sapply code into vapply code

Replace sapply with:

vapply(df, is.factor, logical(1))

Using sapply / vapply for read_html

In short, if you want your return to be a list, which is your case, then use lapply instead of sapply which is a wrapper of lapply that returns a vector, matrix or array.

The same argument against vapply since it should be used, as duly mentioned in the comments, only for simplified objects.

So, the best neat solution in this case is:

docsFor <- lapply(urls, read_html)

R sapply vs apply vs lapply + as.data.frame

Both sapply and apply convert the result to matrices. as.data.frame(lapply(...)) is a safe way to loop over data frame columns.

as.data.frame(
lapply(
df1,
function(column)
{
if(inherits(column, "Date"))
{
pmin(column, Sys.Date())
} else column
}
)
)

It's a little cleaner with ddply from plyr.

library(plyr)
ddply(
df1,
.(id),
colwise(
function(column)
{
if(inherits(column, "Date"))
{
pmin(column, Sys.Date())
} else column
}
)
)


Related Topics



Leave a reply



Submit