Why is `unlist(lapply)` faster than `sapply`?
In addition to running lapply
, sapply
runs simplify2array
to try and fit the output into an array. To figure out if that is possible, the function needs to check if all the individual outputs have the same length: this is done via a costly unique(lapply(..., length))
which accounts for most of the time difference you were seeing:
b <- lapply(JSON, "[[", "a")
microbenchmark(lapply(JSON, "[[", "a"),
unlist(b),
unique(lapply(b, length)),
sapply(JSON, "[[", "a"),
sapply(JSON, "[[", "a", simplify = FALSE),
unlist(lapply(JSON, "[[", "a"))
)
# Unit: microseconds
# expr min lq median uq max neval
# lapply(JSON, "[[", "a") 14809.151 15384.358 15774.26 16905.226 24944.863 100
# unlist(b) 920.047 1043.719 1158.62 1223.091 8056.231 100
# unique(lapply(b, length)) 10778.065 11060.452 11456.11 12581.414 19717.740 100
# sapply(JSON, "[[", "a") 24827.206 25685.535 26656.88 30519.556 93195.751 100
# sapply(JSON, "[[", "a", simplify = FALSE) 14283.541 14922.780 15526.42 16654.058 26865.022 100
# unlist(lapply(JSON, "[[", "a")) 15334.026 16133.146 16607.12 18476.182 30080.544 100
R: sapply / lapply Different Behaviour with Names
Look carefully at this:
lapply(cc, ff)
#> [[1]]
#> [[1]]$myname
#> [1] "1"
#>
#>
#> [[2]]
#> [[2]]$myname
#> [1] "2"
The output of lapply
itself doesn't have names. Look:
a <- lapply(cc, ff)
names(a)
#> NULL
The output of the lapply
is actually an unnamed list. Each element of a
is a named list.
names(a[[1]])
#> [1] "myname"
names(a[[2]])
#> [1] "myname"
So in fact, USE.NAMES
will apply, and sapply
will assign the contents of cc
as names for the output of the lapply
for which sapply
is a thin wrapper as stated in the documentation. It's quite straightforward to follow the code through:
sapply
#> function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
#> {
#> FUN <- match.fun(FUN)
#> answer <- lapply(X = X, FUN = FUN, ...)
#> if (USE.NAMES && is.character(X) && is.null(names(answer)))
#> names(answer) <- X
#> if (!isFALSE(simplify) && length(answer))
#> simplify2array(answer, higher = (simplify == "array"))
#> else answer
#> }
#> <bytecode: 0x036ae7a8>
#> <environment: namespace:base>
Why is apply() method slower than a for loop in R?
As Chase said: Use the power of vectorization. You're comparing two bad solutions here.
To clarify why your apply solution is slower:
Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :
> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE
Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.
Also the sapply function first uses as.vector(unlist(...))
to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.
IF apply would be a solution here (and it isn't), you could compare :
> system.time(loop_million <- mash(million))
user system elapsed
0.75 0.00 0.75
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
user system elapsed
0.25 0.00 0.25
> system.time(sapply2_million <- matrix(sapply(million,squish)))
user system elapsed
0.34 0.00 0.34
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE
R Fast nested list iteration
Try:
unlist(lapply(d1, function(x) x[["x"]][which.max(x[["Freq"]])]))
As @jay.sf suggests, you may also rather use $
instead of [[
:
unlist(lapply(d1, function(x) x$x[which.max(x$Freq)]))
R for loop faster than sapply
UPDATE2 & potential answer:
I now simplified fx.test4 as follows and it is now equivalent in speed to the for loop. Therefore, it was the extra conversion steps that made the lapply solution slower as @John pointed out. In addition, maybe the assumption that *apply HAD to be faster was faulty as discussed by @Ari B. Friedman and @SimonO101 Thank you All!
fx.test5<-function(vc)
{
L<-strsplit(vc, split=",")
m.res<-t(sapply(seq_along(L), function(X){sort(c(as.numeric(L[[X]]),rep(0,3)),decreasing=TRUE)[1:3]}))
return(m.res)
}
fx.test5(vc)
[,1] [,2] [,3]
[1,] 129 129 120
[2,] 103 67 67
[3,] 4 3 3
[4,] 4 3 1
[5,] 0 0 0
[6,] 5 0 0
[7,] 99 1 1
[8,] 52 44 40
[9,] 20 19 19
[10,] 135 97 96
system.time(fx.test5(vc))
user system elapsed
0.001 0.000 0.001
UPDATE3: Indeed, on a longer example, the *apply function is faster (by a hair).
system.time(fx.test3(vc2))
# user system elapsed
# 3.596 0.006 3.601
system.time(fx.test5(vc2))
# user system elapsed
# 3.355 0.006 3.359
What are the performance differences between for-loops and the apply family of functions?
First of all, it is an already long debunked myth that for
loops are any slower than lapply
. The for
loops in R have been made a lot more performant and are currently at least as fast as lapply
.
That said, you have to rethink your use of lapply
here. Your implementation demands assigning to the global environment, because your code requires you to update the weight during the loop. And that is a valid reason to not consider lapply
.
lapply
is a function you should use for its side effects (or lack of side effects). The function lapply
combines the results in a list automatically and doesn't mess with the environment you work in, contrary to a for
loop. The same goes for replicate
. See also this question:
Is R's apply family more than syntactic sugar?
The reason your lapply
solution is far slower, is because your way of using it creates a lot more overhead.
replicate
is nothing else butsapply
internally, so you actually combinesapply
andlapply
to implement your double loop.sapply
creates extra overhead because it has to test whether or not the result can be simplified. So afor
loop will be actually faster than usingreplicate
.- inside your
lapply
anonymous function, you have to access the dataframe for both x and y for every observation. This means that -contrary to in your for-loop- eg the function$
has to be called every time. - Because you use these high-end functions, your 'lapply' solution calls 49 functions, compared to your
for
solution that only calls 26. These extra functions for thelapply
solution include calls to functions likematch
,structure
,[[
,names
,%in%
,sys.call
,duplicated
, ...
All functions not needed by yourfor
loop as that one doesn't do any of these checks.
If you want to see where this extra overhead comes from, look at the internal code of replicate
, unlist
, sapply
and simplify2array
.
You can use the following code to get a better idea of where you lose your performance with the lapply
. Run this line by line!
Rprof(interval = 0.0001)
f()
Rprof(NULL)
fprof <- summaryRprof()$by.self
Rprof(interval = 0.0001)
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)
Rprof(NULL)
perprof <- summaryRprof()$by.self
fprof$Fun <- rownames(fprof)
perprof$Fun <- rownames(perprof)
Selftime <- merge(fprof, perprof,
all = TRUE,
by = 'Fun',
suffixes = c(".lapply",".for"))
sum(!is.na(Selftime$self.time.lapply))
sum(!is.na(Selftime$self.time.for))
Selftime[order(Selftime$self.time.lapply, decreasing = TRUE),
c("Fun","self.time.lapply","self.time.for")]
Selftime[is.na(Selftime$self.time.for),]
R Looking for faster alternative for sapply()
Depending on your might want to consider alternative packages (while ngram proclaims to be fast). The fastest alternative here (while ng = 1) is to split the word and find unique indices.
stringi_get_unigrams <- function(text)
lengths(lapply(stri_split(text, fixed = " "), unique))
system.time(res3 <- stringi_get_unigrams(df$text))
# user system elapsed
# 0.84 0.00 0.86
If you want to be more complex (eg. ng != 1) you'd need to compare all pairwise combinations of string, which is a bit more complex.
stringi_get_duograms <- function(text){
splits <- stri_split(text, fixed = " ")
comp <- function(x)
nrow(unique(matrix(c(x[-1], x[-length(x)]), ncol = 2)))
res <- sapply(splits, comp)
res[res == 0] <- NA_integer_
res
}
system.time(res <- stringi_get_duograms(df$text))
# user system elapsed
# 5.94 0.02 5.93
Here we have the added benefit of not crashing when there's no word combinations that are matching in the corpus of the specific words.
Times on my CPU
system.time({
res <- get_unigrams(df$text)
})
# user system elapsed
# 12.72 0.16 12.94
alternative parallel implementation:
get_unigrams_par <- function(text) {
require(purrr)
require(ngram)
sapply(text, function(text)
ngram(text, n = 1) %>% get.ngrams() %>% length()
)
}
cl <- parallel::makeCluster(nc <- parallel::detectCores())
print(nc)
# [1] 12
system.time(
res2 <- unname(unlist(parallel::parLapply(cl,
split(df$text,
sort(1:nrow(df)%%nc)),
get_unigrams_par)))
)
# user system elapsed
# 0.20 0.11 2.95
parallel::stopCluster(cl)
And just to check that all results are identical:
identical(unname(res), res2)
# TRUE
identical(res2, res3)
# TRUE
Edit:
Of course there's nothing stopping us from combining parallelization with any result above:
cl <- parallel::makeCluster(nc <- parallel::detectCores())
clusterEvalQ(cl, library(stringi))
system.time(
res4 <- unname(unlist(parallel::parLapply(cl,
split(df$text,
sort(1:nrow(df)%%nc)),
stringi_get_unigrams)))
)
# user system elapsed
# 0.01 0.16 0.27
stopCluster(cl)
How to correctly use lists?
Just to address the last part of your question, since that really points out the difference between a list
and vector
in R:
Why do these two expressions not return the same result?
x = list(1, 2, 3, 4); x2 = list(1:4)
A list can contain any other class as each element. So you can have a list where the first element is a character vector, the second is a data frame, etc. In this case, you have created two different lists. x
has four vectors, each of length 1. x2
has 1 vector of length 4:
> length(x[[1]])
[1] 1
> length(x2[[1]])
[1] 4
So these are completely different lists.
R lists are very much like a hash map data structure in that each index value can be associated with any object. Here's a simple example of a list that contains 3 different classes (including a function):
> complicated.list <- list("a"=1:4, "b"=1:3, "c"=matrix(1:4, nrow=2), "d"=search)
> lapply(complicated.list, class)
$a
[1] "integer"
$b
[1] "integer"
$c
[1] "matrix"
$d
[1] "function"
Given that the last element is the search function, I can call it like so:
> complicated.list[["d"]]()
[1] ".GlobalEnv" ...
As a final comment on this: it should be noted that a data.frame
is really a list (from the data.frame
documentation):
A data frame is a list of variables of the same number of rows with unique row names, given class ‘"data.frame"’
That's why columns in a data.frame
can have different data types, while columns in a matrix cannot. As an example, here I try to create a matrix with numbers and characters:
> a <- 1:4
> class(a)
[1] "integer"
> b <- c("a","b","c","d")
> d <- cbind(a, b)
> d
a b
[1,] "1" "a"
[2,] "2" "b"
[3,] "3" "c"
[4,] "4" "d"
> class(d[,1])
[1] "character"
Note how I cannot change the data type in the first column to numeric because the second column has characters:
> d[,1] <- as.numeric(d[,1])
> class(d[,1])
[1] "character"
How to coerce a list object to type 'double'
If you want to convert all elements of a
to a single numeric vector and length(a)
is greater than 1 (OK, even if it is of length 1), you could unlist
the object first and then convert.
as.numeric(unlist(a))
# [1] 10 38 66 101 129 185 283 374
Bear in mind that there aren't any quality controls here. Also, X$Days
a mighty odd name.
Related Topics
Converting Date Column in Data Frame
How to Add "Author" Metadata to a PDF Created from R
Closing the Lines in a Ggplot2 Radar/Spider Chart
R Find the Distance Between Two Us Zipcode Columns
Add Hline with Population Median for Each Facet
Incremental Nested Lists in Rmarkdown
How to Draw a Contour Plot When Data Are Not on a Regular Grid
Calculating the Distance Between Polygon and Point in R
How to Configure Box.Color in Directlabels "Draw.Rects"
Removing Attributes of Columns in Data.Frames on Multilevel Lists in R
Importing S3 Method from Another Package
Ggplot Piecharts on a Ggmap: Labels Destroy the Small Plots
Modify Lm or Loess Function to Use It Within Ggplot2's Geom_Smooth