Why Is 'Unlist(Lapply)' Faster Than 'Sapply'

Why is `unlist(lapply)` faster than `sapply`?

In addition to running lapply, sapply runs simplify2array to try and fit the output into an array. To figure out if that is possible, the function needs to check if all the individual outputs have the same length: this is done via a costly unique(lapply(..., length)) which accounts for most of the time difference you were seeing:

b <- lapply(JSON, "[[", "a")

microbenchmark(lapply(JSON, "[[", "a"),
unlist(b),
unique(lapply(b, length)),
sapply(JSON, "[[", "a"),
sapply(JSON, "[[", "a", simplify = FALSE),
unlist(lapply(JSON, "[[", "a"))
)

# Unit: microseconds
# expr min lq median uq max neval
# lapply(JSON, "[[", "a") 14809.151 15384.358 15774.26 16905.226 24944.863 100
# unlist(b) 920.047 1043.719 1158.62 1223.091 8056.231 100
# unique(lapply(b, length)) 10778.065 11060.452 11456.11 12581.414 19717.740 100
# sapply(JSON, "[[", "a") 24827.206 25685.535 26656.88 30519.556 93195.751 100
# sapply(JSON, "[[", "a", simplify = FALSE) 14283.541 14922.780 15526.42 16654.058 26865.022 100
# unlist(lapply(JSON, "[[", "a")) 15334.026 16133.146 16607.12 18476.182 30080.544 100

R: sapply / lapply Different Behaviour with Names

Look carefully at this:

lapply(cc, ff)
#> [[1]]
#> [[1]]$myname
#> [1] "1"
#>
#>
#> [[2]]
#> [[2]]$myname
#> [1] "2"

The output of lapply itself doesn't have names. Look:

a <- lapply(cc, ff)
names(a)
#> NULL

The output of the lapply is actually an unnamed list. Each element of a is a named list.

names(a[[1]])
#> [1] "myname"
names(a[[2]])
#> [1] "myname"

So in fact, USE.NAMES will apply, and sapply will assign the contents of cc as names for the output of the lapply for which sapply is a thin wrapper as stated in the documentation. It's quite straightforward to follow the code through:

sapply
#> function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
#> {
#> FUN <- match.fun(FUN)
#> answer <- lapply(X = X, FUN = FUN, ...)
#> if (USE.NAMES && is.character(X) && is.null(names(answer)))
#> names(answer) <- X
#> if (!isFALSE(simplify) && length(answer))
#> simplify2array(answer, higher = (simplify == "array"))
#> else answer
#> }
#> <bytecode: 0x036ae7a8>
#> <environment: namespace:base>

Why is apply() method slower than a for loop in R?

As Chase said: Use the power of vectorization. You're comparing two bad solutions here.

To clarify why your apply solution is slower:

Within the for loop, you actually use the vectorized indices of the matrix, meaning there is no conversion of type going on. I'm going a bit rough over it here, but basically the internal calculation kind of ignores the dimensions. They're just kept as an attribute and returned with the vector representing the matrix. To illustrate :

> x <- 1:10
> attr(x,"dim") <- c(5,2)
> y <- matrix(1:10,ncol=2)
> all.equal(x,y)
[1] TRUE

Now, when you use the apply, the matrix is split up internally in 100,000 row vectors, every row vector (i.e. a single number) is put through the function, and in the end the result is combined into an appropriate form. The apply function reckons a vector is best in this case, and thus has to concatenate the results of all rows. This takes time.

Also the sapply function first uses as.vector(unlist(...)) to convert anything to a vector, and in the end tries to simplify the answer into a suitable form. Also this takes time, hence also the sapply might be slower here. Yet, it's not on my machine.

IF apply would be a solution here (and it isn't), you could compare :

> system.time(loop_million <- mash(million))
user system elapsed
0.75 0.00 0.75
> system.time(sapply_million <- matrix(unlist(sapply(million,squish,simplify=F))))
user system elapsed
0.25 0.00 0.25
> system.time(sapply2_million <- matrix(sapply(million,squish)))
user system elapsed
0.34 0.00 0.34
> all.equal(loop_million,sapply_million)
[1] TRUE
> all.equal(loop_million,sapply2_million)
[1] TRUE

R Fast nested list iteration

Try:

unlist(lapply(d1, function(x) x[["x"]][which.max(x[["Freq"]])]))

As @jay.sf suggests, you may also rather use $ instead of [[:

unlist(lapply(d1, function(x) x$x[which.max(x$Freq)]))

R for loop faster than sapply

UPDATE2 & potential answer:

I now simplified fx.test4 as follows and it is now equivalent in speed to the for loop. Therefore, it was the extra conversion steps that made the lapply solution slower as @John pointed out. In addition, maybe the assumption that *apply HAD to be faster was faulty as discussed by @Ari B. Friedman and @SimonO101 Thank you All!

fx.test5<-function(vc) 
{
L<-strsplit(vc, split=",")
m.res<-t(sapply(seq_along(L), function(X){sort(c(as.numeric(L[[X]]),rep(0,3)),decreasing=TRUE)[1:3]}))
return(m.res)
}

fx.test5(vc)
[,1] [,2] [,3]
[1,] 129 129 120
[2,] 103 67 67
[3,] 4 3 3
[4,] 4 3 1
[5,] 0 0 0
[6,] 5 0 0
[7,] 99 1 1
[8,] 52 44 40
[9,] 20 19 19
[10,] 135 97 96

system.time(fx.test5(vc))
user system elapsed
0.001 0.000 0.001

UPDATE3: Indeed, on a longer example, the *apply function is faster (by a hair).

system.time(fx.test3(vc2))
# user system elapsed
# 3.596 0.006 3.601
system.time(fx.test5(vc2))
# user system elapsed
# 3.355 0.006 3.359

What are the performance differences between for-loops and the apply family of functions?

First of all, it is an already long debunked myth that for loops are any slower than lapply. The for loops in R have been made a lot more performant and are currently at least as fast as lapply.

That said, you have to rethink your use of lapply here. Your implementation demands assigning to the global environment, because your code requires you to update the weight during the loop. And that is a valid reason to not consider lapply.

lapply is a function you should use for its side effects (or lack of side effects). The function lapply combines the results in a list automatically and doesn't mess with the environment you work in, contrary to a for loop. The same goes for replicate. See also this question:

Is R's apply family more than syntactic sugar?

The reason your lapply solution is far slower, is because your way of using it creates a lot more overhead.

  • replicate is nothing else but sapply internally, so you actually combine sapply and lapply to implement your double loop. sapply creates extra overhead because it has to test whether or not the result can be simplified. So a for loop will be actually faster than using replicate.
  • inside your lapply anonymous function, you have to access the dataframe for both x and y for every observation. This means that -contrary to in your for-loop- eg the function $ has to be called every time.
  • Because you use these high-end functions, your 'lapply' solution calls 49 functions, compared to your for solution that only calls 26. These extra functions for the lapply solution include calls to functions like match, structure, [[, names, %in%, sys.call, duplicated, ...
    All functions not needed by your for loop as that one doesn't do any of these checks.

If you want to see where this extra overhead comes from, look at the internal code of replicate, unlist, sapply and simplify2array.

You can use the following code to get a better idea of where you lose your performance with the lapply. Run this line by line!

Rprof(interval = 0.0001)
f()
Rprof(NULL)
fprof <- summaryRprof()$by.self

Rprof(interval = 0.0001)
perceptron(as.matrix(irissubdf[1:2]), irissubdf$y, 1, 10)
Rprof(NULL)
perprof <- summaryRprof()$by.self

fprof$Fun <- rownames(fprof)
perprof$Fun <- rownames(perprof)

Selftime <- merge(fprof, perprof,
all = TRUE,
by = 'Fun',
suffixes = c(".lapply",".for"))

sum(!is.na(Selftime$self.time.lapply))
sum(!is.na(Selftime$self.time.for))
Selftime[order(Selftime$self.time.lapply, decreasing = TRUE),
c("Fun","self.time.lapply","self.time.for")]

Selftime[is.na(Selftime$self.time.for),]

R Looking for faster alternative for sapply()

Depending on your might want to consider alternative packages (while ngram proclaims to be fast). The fastest alternative here (while ng = 1) is to split the word and find unique indices.

stringi_get_unigrams <- function(text)
lengths(lapply(stri_split(text, fixed = " "), unique))

system.time(res3 <- stringi_get_unigrams(df$text))
# user system elapsed
# 0.84 0.00 0.86

If you want to be more complex (eg. ng != 1) you'd need to compare all pairwise combinations of string, which is a bit more complex.

stringi_get_duograms <- function(text){
splits <- stri_split(text, fixed = " ")
comp <- function(x)
nrow(unique(matrix(c(x[-1], x[-length(x)]), ncol = 2)))
res <- sapply(splits, comp)
res[res == 0] <- NA_integer_
res
}
system.time(res <- stringi_get_duograms(df$text))
# user system elapsed
# 5.94 0.02 5.93

Here we have the added benefit of not crashing when there's no word combinations that are matching in the corpus of the specific words.

Times on my CPU

system.time({
res <- get_unigrams(df$text)
})
# user system elapsed
# 12.72 0.16 12.94

alternative parallel implementation:

get_unigrams_par <- function(text) {
require(purrr)
require(ngram)
sapply(text, function(text)
ngram(text, n = 1) %>% get.ngrams() %>% length()
)
}
cl <- parallel::makeCluster(nc <- parallel::detectCores())
print(nc)
# [1] 12
system.time(
res2 <- unname(unlist(parallel::parLapply(cl,
split(df$text,
sort(1:nrow(df)%%nc)),
get_unigrams_par)))
)
# user system elapsed
# 0.20 0.11 2.95
parallel::stopCluster(cl)

And just to check that all results are identical:

identical(unname(res), res2)
# TRUE
identical(res2, res3)
# TRUE

Edit:

Of course there's nothing stopping us from combining parallelization with any result above:

cl <- parallel::makeCluster(nc <- parallel::detectCores())
clusterEvalQ(cl, library(stringi))
system.time(
res4 <- unname(unlist(parallel::parLapply(cl,
split(df$text,
sort(1:nrow(df)%%nc)),
stringi_get_unigrams)))
)
# user system elapsed
# 0.01 0.16 0.27
stopCluster(cl)

How to correctly use lists?

Just to address the last part of your question, since that really points out the difference between a list and vector in R:

Why do these two expressions not return the same result?

x = list(1, 2, 3, 4); x2 = list(1:4)

A list can contain any other class as each element. So you can have a list where the first element is a character vector, the second is a data frame, etc. In this case, you have created two different lists. x has four vectors, each of length 1. x2 has 1 vector of length 4:

> length(x[[1]])
[1] 1
> length(x2[[1]])
[1] 4

So these are completely different lists.

R lists are very much like a hash map data structure in that each index value can be associated with any object. Here's a simple example of a list that contains 3 different classes (including a function):

> complicated.list <- list("a"=1:4, "b"=1:3, "c"=matrix(1:4, nrow=2), "d"=search)
> lapply(complicated.list, class)
$a
[1] "integer"
$b
[1] "integer"
$c
[1] "matrix"
$d
[1] "function"

Given that the last element is the search function, I can call it like so:

> complicated.list[["d"]]()
[1] ".GlobalEnv" ...

As a final comment on this: it should be noted that a data.frame is really a list (from the data.frame documentation):

A data frame is a list of variables of the same number of rows with unique row names, given class ‘"data.frame"’

That's why columns in a data.frame can have different data types, while columns in a matrix cannot. As an example, here I try to create a matrix with numbers and characters:

> a <- 1:4
> class(a)
[1] "integer"
> b <- c("a","b","c","d")
> d <- cbind(a, b)
> d
a b
[1,] "1" "a"
[2,] "2" "b"
[3,] "3" "c"
[4,] "4" "d"
> class(d[,1])
[1] "character"

Note how I cannot change the data type in the first column to numeric because the second column has characters:

> d[,1] <- as.numeric(d[,1])
> class(d[,1])
[1] "character"

How to coerce a list object to type 'double'

If you want to convert all elements of a to a single numeric vector and length(a) is greater than 1 (OK, even if it is of length 1), you could unlist the object first and then convert.

as.numeric(unlist(a))
# [1] 10 38 66 101 129 185 283 374

Bear in mind that there aren't any quality controls here. Also, X$Days a mighty odd name.



Related Topics



Leave a reply



Submit