Why is `vapply` safer than `sapply`?
As has already been noted, vapply
does two things:
- Slight speed improvement
- Improves consistency by providing limited return type checks.
The second point is the greater advantage, as it helps catch errors before they happen and leads to more robust code. This return value checking could be done separately by using sapply
followed by stopifnot
to make sure that the return values are consistent with what you expected, but vapply
is a little easier (if more limited, since custom error checking code could check for values within bounds, etc.).
Here's an example of vapply
ensuring your result is as expected. This parallels something I was just working on while PDF scraping, where findD
would use a regex to match a pattern in raw text data (e.g. I'd have a list that was split
by entity, and a regex to match addresses within each entity. Occasionally the PDF had been converted out-of-order and there would be two addresses for an entity, which caused badness).
> input1 <- list( letters[1:5], letters[3:12], letters[c(5,2,4,7,1)] )
> input2 <- list( letters[1:5], letters[3:12], letters[c(2,5,4,7,15,4)] )
> findD <- function(x) x[x=="d"]
> sapply(input1, findD )
[1] "d" "d" "d"
> sapply(input2, findD )
[[1]]
[1] "d"
[[2]]
[1] "d"
[[3]]
[1] "d" "d"
> vapply(input1, findD, "" )
[1] "d" "d" "d"
> vapply(input2, findD, "" )
Error in vapply(input2, findD, "") : values must be length 1,
but FUN(X[[3]]) result is length 2
Because two there are two d's in the third element of input2, vapply produces an error. But sapply changes the class of the output from a character vector to a list, which could break code downstream.
As I tell my students, part of becoming a programmer is changing your mindset from "errors are annoying" to "errors are my friend."
Zero length inputs
One related point is that if the input length is zero, sapply
will always return an empty list, regardless of the input type. Compare:
sapply(1:5, identity)
## [1] 1 2 3 4 5
sapply(integer(), identity)
## list()
vapply(1:5, identity, integer(1))
## [1] 1 2 3 4 5
vapply(integer(), identity, integer(1))
## integer(0)
With vapply
, you are guaranteed to have a particular type of output, so you don't need to write extra checks for zero length inputs.
Benchmarks
vapply
can be a bit faster because it already knows what format it should be expecting the results in.
input1.long <- rep(input1,10000)
library(microbenchmark)
m <- microbenchmark(
sapply(input1.long, findD ),
vapply(input1.long, findD, "" )
)
library(ggplot2)
library(taRifx) # autoplot.microbenchmark is moving to the microbenchmark package in the next release so this should be unnecessary soon
autoplot(m)
Using vapply instead of sapply
Question 1:
The error with character(2)
is because the character vector "integer"
is only of length 1 and rightly fails the consistency check against the expected result of character vector of length 2.
Question 2:
vapply()
is there as a safer version of sapply()
as it makes sure you only get back what you expect from each application of FUN
. It is also safer I guess because the output from vapply()
is consistent - you don't get a vector or a matrix or a list. You get a vector for length 1 returned sub-components and an array otherwise.
In the specific example you give, you can't use vapply()
as what is returned by class
isn't consistent. You have to know or expect certain output and vapply()
fails if the output from a call to FUN
doesn't match what it expects.
In this instance, I suppose you could do
df2 <- data.frame(x = 1:10, y = Sys.time() + 1:10)
vapply(df2, FUN = function(x) paste(class(x), collapse = "; "),
FUN.VALUE = character(1))
> vapply(df2, FUN = function(x) paste(class(x), collapse = "; "),
+ FUN.VALUE = character(1))
x y
"integer" "POSIXct; POSIXt"
but whether that is useful to you or not is a different matter.
Really, using vapply()
comes down to knowing what to expect from FUN
and wanting to only ever get that output. If you don't know or can't control it, you are probably better off with lapply()
.
Why are sapply() and options() undesirable ?
If you look at the header for that function,
function(fun = default_undesirable_functions)
you see that it records its choices in default_undesirable_functions
, and if you look at that object, you'll see:
...
$options
[1] "use withr::with_options()"
...
$sapply
[1] "use vapply() or lapply()"
...
From the alternatives, you can guess at why the author thinks those functions are "undesirable":
options()
is bad because it has global side effects. Thewithr::with_options()
alternative keeps any changes to the options local.sapply()
is bad becausevapply()
is safer (as documented in?sapply
).
How to convert sapply code into vapply code
Replace sapply with:
vapply(df, is.factor, logical(1))
Using sapply / vapply for read_html
In short, if you want your return to be a list, which is your case, then use lapply
instead of sapply
which is a wrapper of lapply
that returns a vector
, matrix
or array
.
The same argument against vapply
since it should be used, as duly mentioned in the comments, only for simplified objects.
So, the best neat solution in this case is:
docsFor <- lapply(urls, read_html)
R sapply vs apply vs lapply + as.data.frame
Both sapply
and apply
convert the result to matrices. as.data.frame(lapply(...))
is a safe way to loop over data frame columns.
as.data.frame(
lapply(
df1,
function(column)
{
if(inherits(column, "Date"))
{
pmin(column, Sys.Date())
} else column
}
)
)
It's a little cleaner with ddply
from plyr
.
library(plyr)
ddply(
df1,
.(id),
colwise(
function(column)
{
if(inherits(column, "Date"))
{
pmin(column, Sys.Date())
} else column
}
)
)
Related Topics
Fast Pairwise Simple Linear Regression Between Variables in a Data Frame
Update a Value in One Column Based on Criteria in Other Columns
Anova Test Fails on Lme Fits Created with Pasted Formula
Run a for Loop in Parallel in R
Code to Import Data from a Stack Overflow Query into R
How to Remove Empty Factors from Ggplot2 Facets
Operator == Inconsistent in Logical Columns in Data.Table
Sending Email in R via Outlook
Converting Latitude and Longitude Points to Utm
How to Get Name of Variable in R (Substitute)
Converting a Data Frame to Xts
Plot Polynomial Regression Curve in R
Interpolate Na Values in a Data Frame with Na.Approx