## Is the *apply family really not vectorized?

First of all, in your example you make tests on a "data.frame" which is not fair for `colMeans`

, `apply`

and `"[.data.frame"`

since they have an overhead:

`system.time(as.matrix(m)) #called by `colMeans` and `apply``

# user system elapsed

# 1.03 0.00 1.05

system.time(for(i in 1:ncol(m)) m[, i]) #in the `for` loop

# user system elapsed

# 12.93 0.01 13.07

On a matrix, the picture is a bit different:

`mm = as.matrix(m)`

system.time(colMeans(mm))

# user system elapsed

# 0.01 0.00 0.01

system.time(apply(mm, 2, mean))

# user system elapsed

# 1.48 0.03 1.53

system.time(for(i in 1:ncol(mm)) mean(mm[, i]))

# user system elapsed

# 1.22 0.00 1.21

Regading the main part of the question, the main difference between `lapply`

/`mapply`

/etc and straightforward R-loops is where the looping is done. As Roland notes, both C and R loops need to evaluate an R function in each iteration which is the most costly. The really fast C functions are those that do everything in C, so, I guess, this should be what "vectorised" is about?

An example where we find the mean in each of a "list"s elements:

*( EDIT May 11 '16 : I believe the example with finding the "mean" is not a good setup for the differences between evaluating an R function iteratively and compiled code, (1) because of the particularity of R's mean algorithm on "numeric"s over a simple sum(x) / length(x) and (2) it should make more sense to test on "list"s with length(x) >> lengths(x). So, the "mean" example is moved to the end and replaced with another.)*

As a simple example we could consider the finding of the opposite of each `length == 1`

element of a "list":

In a `tmp.c`

file:

`#include <R.h>`

#define USE_RINTERNALS

#include <Rinternals.h>

#include <Rdefines.h>

/* call a C function inside another */

double oppC(double x) { return(ISNAN(x) ? NA_REAL : -x); }

SEXP sapply_oppC(SEXP x)

{

SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));

for(int i = 0; i < LENGTH(x); i++)

REAL(ans)[i] = oppC(REAL(VECTOR_ELT(x, i))[0]);

UNPROTECT(1);

return(ans);

}

/* call an R function inside a C function;

* will be used with 'f' as a closure and as a builtin */

SEXP sapply_oppR(SEXP x, SEXP f)

{

SEXP call = PROTECT(allocVector(LANGSXP, 2));

SETCAR(call, install(CHAR(STRING_ELT(f, 0))));

SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));

for(int i = 0; i < LENGTH(x); i++) {

SETCADR(call, VECTOR_ELT(x, i));

REAL(ans)[i] = REAL(eval(call, R_GlobalEnv))[0];

}

UNPROTECT(2);

return(ans);

}

And in R side:

`system("R CMD SHLIB /home/~/tmp.c")`

dyn.load("/home/~/tmp.so")

with data:

`set.seed(007)`

myls = rep_len(as.list(c(NA, runif(3))), 1e7)

#a closure wrapper of `-`

oppR = function(x) -x

for_oppR = compiler::cmpfun(function(x, f)

{

f = match.fun(f)

ans = numeric(length(x))

for(i in seq_along(x)) ans[[i]] = f(x[[i]])

return(ans)

})

Benchmarking:

`#call a C function iteratively`

system.time({ sapplyC = .Call("sapply_oppC", myls) })

# user system elapsed

# 0.048 0.000 0.047

#evaluate an R closure iteratively

system.time({ sapplyRC = .Call("sapply_oppR", myls, "oppR") })

# user system elapsed

# 3.348 0.000 3.358

#evaluate an R builtin iteratively

system.time({ sapplyRCprim = .Call("sapply_oppR", myls, "-") })

# user system elapsed

# 0.652 0.000 0.653

#loop with a R closure

system.time({ forR = for_oppR(myls, "oppR") })

# user system elapsed

# 4.396 0.000 4.409

#loop with an R builtin

system.time({ forRprim = for_oppR(myls, "-") })

# user system elapsed

# 1.908 0.000 1.913

#for reference and testing

system.time({ sapplyR = unlist(lapply(myls, oppR)) })

# user system elapsed

# 7.080 0.068 7.170

system.time({ sapplyRprim = unlist(lapply(myls, `-`)) })

# user system elapsed

# 3.524 0.064 3.598

all.equal(sapplyR, sapplyRprim)

#[1] TRUE

all.equal(sapplyR, sapplyC)

#[1] TRUE

all.equal(sapplyR, sapplyRC)

#[1] TRUE

all.equal(sapplyR, sapplyRCprim)

#[1] TRUE

all.equal(sapplyR, forR)

#[1] TRUE

all.equal(sapplyR, forRprim)

#[1] TRUE

*(Follows the original example of mean finding):*

`#all computations in C`

all_C = inline::cfunction(sig = c(R_ls = "list"), body = '

SEXP tmp, ans;

PROTECT(ans = allocVector(REALSXP, LENGTH(R_ls)));

double *ptmp, *pans = REAL(ans);

for(int i = 0; i < LENGTH(R_ls); i++) {

pans[i] = 0.0;

PROTECT(tmp = coerceVector(VECTOR_ELT(R_ls, i), REALSXP));

ptmp = REAL(tmp);

for(int j = 0; j < LENGTH(tmp); j++) pans[i] += ptmp[j];

pans[i] /= LENGTH(tmp);

UNPROTECT(1);

}

UNPROTECT(1);

return(ans);

')

#a very simple `lapply(x, mean)`

C_and_R = inline::cfunction(sig = c(R_ls = "list"), body = '

SEXP call, ans, ret;

PROTECT(call = allocList(2));

SET_TYPEOF(call, LANGSXP);

SETCAR(call, install("mean"));

PROTECT(ans = allocVector(VECSXP, LENGTH(R_ls)));

PROTECT(ret = allocVector(REALSXP, LENGTH(ans)));

for(int i = 0; i < LENGTH(R_ls); i++) {

SETCADR(call, VECTOR_ELT(R_ls, i));

SET_VECTOR_ELT(ans, i, eval(call, R_GlobalEnv));

}

double *pret = REAL(ret);

for(int i = 0; i < LENGTH(ans); i++) pret[i] = REAL(VECTOR_ELT(ans, i))[0];

UNPROTECT(3);

return(ret);

')

R_lapply = function(x) unlist(lapply(x, mean))

R_loop = function(x)

{

ans = numeric(length(x))

for(i in seq_along(x)) ans[i] = mean(x[[i]])

return(ans)

}

R_loopcmp = compiler::cmpfun(R_loop)

set.seed(007); myls = replicate(1e4, runif(1e3), simplify = FALSE)

all.equal(all_C(myls), C_and_R(myls))

#[1] TRUE

all.equal(all_C(myls), R_lapply(myls))

#[1] TRUE

all.equal(all_C(myls), R_loop(myls))

#[1] TRUE

all.equal(all_C(myls), R_loopcmp(myls))

#[1] TRUE

microbenchmark::microbenchmark(all_C(myls),

C_and_R(myls),

R_lapply(myls),

R_loop(myls),

R_loopcmp(myls),

times = 15)

#Unit: milliseconds

# expr min lq median uq max neval

# all_C(myls) 37.29183 38.19107 38.69359 39.58083 41.3861 15

# C_and_R(myls) 117.21457 123.22044 124.58148 130.85513 169.6822 15

# R_lapply(myls) 98.48009 103.80717 106.55519 109.54890 116.3150 15

# R_loop(myls) 122.40367 130.85061 132.61378 138.53664 178.5128 15

# R_loopcmp(myls) 105.63228 111.38340 112.16781 115.68909 128.1976 15

## Vectorize() vs apply()

`Vectorize`

is just a wrapper for `mapply`

. It just builds you an `mapply`

loop for whatever function you feed it. Thus there are often easier things do to than `Vectorize()`

it and the explicit `*apply`

solutions end up being computationally equivalent or perhaps superior.

Also, for your specific example, you've heard of `mget`

, right?

## Vectorization using apply family of functions in R

You can easily re-write your loop using `sapply`

instead of `for...`

, although, as bzki commented, this alone will not speed-up your code:

`# sapply version:`

fitted_value = sapply(seq_len(nrow(wdbc)),function(i) {

# put all the gubbins in here

# ...

return (x %*% co.data)

})

However, if you have multiple cores available on your computer, or - even better - access to a server with many processors, then an `sapply`

loop can easily be parallelized using `parSapply`

from the 'parallel' package, as shown in this example:

`# slow sapply loop (takes 12s):`

data=123

answer = sapply(1:12,function(i) {

Sys.sleep(1)

return(data+i)

})

# faster parallel version (takes 4s on my laptop with 4 cores):

library(parallel)

mycluster=makeCluster(detectCores()-1) # leave 1 core available for system

data=123

clusterExport(mycluster,"data") # specify variable(s) that should be available to parallel function

answer = parSapply(mycluster,1:12,function(i) {

Sys.sleep(1)

return(data+i)

})

stopCluster(mycluster)

## Calling vectorized functions within vectorized functions

Somehow I feel you're overcomplicating a few things.

I've taken a stab at accomplishing the same output you are describing. Let me know whether the following works for you:

`library(dplyr)`

library(tidyr)

library(purrr)

score <- function(battery) {

battery %>%

pivot_longer(-PID, names_to = 'response_id', values_to = 'response_value') %>%

mutate(

test_name = str_extract(response_id, '^[^_]+_[^_]+(?=_)'),

QID = as.integer(str_extract(response_id, '(?<=QID)\\d+(?=_)'))

) %>%

filter(test_name %in% ls(envir = .GlobalEnv)) %>%

split(f = .$test_name) %>%

imap(.f = function(test_results, test_name){

test_results %>%

left_join(get(test_name), by = 'QID') %>%

filter(!is.na(CorrectResponse)) %>%

mutate(

is_correct = as.integer(response_value == CorrectResponse)

)

}) %>%

do.call(bind_rows, .) %>%

select(PID, response_id, is_correct) %>%

spread(key = response_id, value = is_correct)

}

This is essentially doing the following:

- pivot the response columns into a rowwise representation with
`pivot_longer`

, leaving the`PID`

column in place - extract the
`test_name`

and`QID`

, which I see you need for scoring - filter for only tests where we have the responses loaded
- split the dataframe into a list, so we can ...
- ... left join the correct response df onto each chunk, then score the test
- rejoin the dataframes into once
- select only the
`PID`

column, the original column name and our score - spread those out again into a column format

Tada :)

## Use sapply or Vector?

Fundamentally, `sapply`

, and similarly its siblings of the *apply* family, are loops to build a vector/matrix, or list from a multiple-item object. See this canonical answer on subject: Is the "*apply" family really not vectorized?. However, some operations are vectorized (i.e., loops are run at machine level such as in C or Fortran) and can receive a vector or list and operate in very quick runtime.

Almost always, the non-looped version will run faster. Below shows timings for a much larger sequence input.

`system.time({sapply(1:300000, function(i) dnorm(i,0,1))})`

# user system elapsed

# 1.097 0.026 1.169

system.time({dnorm(1:300000,0,1)})

# user system elapsed

# 0.006 0.001 0.007

As you found out `dnorm`

is such a vectorized function. Many R functions can accept vectors or lists to return equal length outputs including `paste`

, `lengths`

, `toupper`

, `[`

, `file.*`

family, `as.*`

family, `grep`

family. However, more complex, multi-layered operations require iterative calls to return single objects as you found out with `optim`

. Other non-vectorized methods include `read.csv`

, `write.csv`

, `merge`

, `lm`

, `glm`

, and `summary`

. With these such methods, the *apply* family can then iteratively call them and bind all elements into a singular object such as vector/matrix or list.

`kappa <- seq(1,7)`

sapply(kappa, function(i) optimize(function(x) (i^x^2+5*x+6), c(-10,10))$minimum)

# [1] -9.9999263 -1.2407389 -0.9122106 -0.7784485 -0.7022782 -0.6517733 -0.6151620

## Using Apply or Vectorize to apply custom function to a dataframe

For the particular task requested it could be

`celebrities$newcol <- with(celebrities, age + income)`

The `+`

function is inherently vectorized. Using `apply`

with `sum`

is inefficient. Using `apply`

could have been greatly simplified by omitting the first column because that would avoid the coercion to a character matrix caused by the first column.

` celebrities$newcol <- apply(celebrities[-1], function(x) sum(x) )`

That way you would avoid coercing the vectors to "character" and then needing to coerce back the formerly-numeric columns to `numeric`

. Using `sum`

inside apply does get around the fact that sum is not vectorized, but it's an example of inefficient R coding.

You get automatic vectorization if the "inner" algorithm can be constructed completely from vectorized functions: the Math and Ops groups being the usual components. See `?Ops`

. Otherwise, you may need to use `mapply`

or `Vectorize`

.

## Is R's apply family more than syntactic sugar?

The `apply`

functions in R don't provide improved performance over other looping functions (e.g. `for`

). One exception to this is `lapply`

which can be a little faster because it does more work in C code than in R (see this question for an example of this).

But in general, the rule is that ** you should use an apply function for clarity, not for performance**.

I would add to this that ** apply functions have no side effects**, which is an important distinction when it comes to functional programming with R. This can be overridden by using

`assign`

or `<<-`

, but that can be very dangerous. Side effects also make a program harder to understand since a variable's state depends on the history.*Edit:*

Just to emphasize this with a trivial example that recursively calculates the Fibonacci sequence; this could be run multiple times to get an accurate measure, but the point is that none of the methods have significantly different performance:

`> fibo <- function(n) {`

+ if ( n < 2 ) n

+ else fibo(n-1) + fibo(n-2)

+ }

> system.time(for(i in 0:26) fibo(i))

user system elapsed

7.48 0.00 7.52

> system.time(sapply(0:26, fibo))

user system elapsed

7.50 0.00 7.54

> system.time(lapply(0:26, fibo))

user system elapsed

7.48 0.04 7.54

> library(plyr)

> system.time(ldply(0:26, fibo))

user system elapsed

7.52 0.00 7.58

*Edit 2:*

Regarding the usage of parallel packages for R (e.g. rpvm, rmpi, snow), these do generally provide `apply`

family functions (even the `foreach`

package is essentially equivalent, despite the name). Here's a simple example of the `sapply`

function in `snow`

:

`library(snow)`

cl <- makeSOCKcluster(c("localhost","localhost"))

parSapply(cl, 1:20, get("+"), 3)

This example uses a socket cluster, for which no additional software needs to be installed; otherwise you will need something like PVM or MPI (see Tierney's clustering page). `snow`

has the following apply functions:

`parLapply(cl, x, fun, ...)`

parSapply(cl, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

parApply(cl, X, MARGIN, FUN, ...)

parRapply(cl, x, fun, ...)

parCapply(cl, x, fun, ...)

It makes sense that `apply`

functions should be used for parallel execution since they *have no side effects*. When you change a variable value within a

`for`

loop, it is globally set. On the other hand, all `apply`

functions can safely be used in parallel because changes are local to the function call (unless you try to use `assign`

or `<<-`

, in which case you can introduce side effects). Needless to say, it's critical to be careful about local vs. global variables, especially when dealing with parallel execution.*Edit:*

Here's a trivial example to demonstrate the difference between `for`

and `*apply`

so far as side effects are concerned:

`> df <- 1:10`

> # *apply example

> lapply(2:3, function(i) df <- df * i)

> df

[1] 1 2 3 4 5 6 7 8 9 10

> # for loop example

> for(i in 2:3) df <- df * i

> df

[1] 6 12 18 24 30 36 42 48 54 60

Note how the `df`

in the parent environment is altered by `for`

but not `*apply`

.

## Loop not vectorized due to reason '1300'

Basically the body of the loop is so simple that it's more efficient to compile it as it is rather than vectorize it as the runtime cost of the vectorization would be greater than executing the code as it is.

There's really no point in trying to force it, as the compiler is telling you that the vectorized version would be less efficient that the non-vectorized version. If you add more computations to the loop the compiler may choose to vectorize it.

### Related Topics

How to Add Row and Column to a Dataframe of Different Length

Relative Frequencies/Proportions With Dplyr

Plot Two Graphs in Same Plot in R

How to Use R'S Ellipsis Feature When Writing Your Own Function

How to Create a Lag Variable Within Each Group

Data.Table VS Dplyr: Can One Do Something Well the Other Can't or Does Poorly

Convert Data.Frame Columns from Factors to Characters

How to Use a Variable to Specify Column Name in Ggplot

Collapse Text by Group in Data Frame

Test If a Vector Contains a Given Element

Counting the Number of Elements With the Values of X in a Vector

Difference Between Require() and Library()

Formatting Decimal Places in R

Select/Assign to Data.Table When Variable Names Are Stored in a Character Vector

How to Get Summary Statistics by Group

Use First Row Data as Column Names in R

Merge 2 Data Frames in a Loop for Each Column in One of Them