Does the Term "Vectorization" Mean Different Things in Different Contexts

Does the term vectorization mean different things in different contexts?

"Vectorization" in R, is a vector processing in R's interpreter's view. Take the function cumsum as an example. On entry, R interpreter sees that a vector x is passed into this function. However, the work is then passed to C language that R interpreter can not analyze / track. While C is doing work, R is just waiting. By the time that R's interpreter comes back to work, a vector has been processed. So in R's view, it has issued a single instruction but processed a vector. This is an analogy to the concept of SIMD - "single instruction, multiple data".

Not just the cumsum function that takes a vector and returns a vector is seen as "vectorization" in R, functions like sum that takes a vector and returns a scalar is also a "vectorization".

Simply put: whenever R calls some compiled code for a loop, it is a "vectorization". If you wonder why this kind of "vectorization" is useful, it is because a loop written by a compiled language is faster than a loop written in an interpreted language. The C loop is translated to machine language that a CPU can understand. However, if a CPU wants to execute an R loop, it needs R's interpreter's help to read it, iteration by iteration. This is like, if you know Chinese (the hardest human language), you can respond to someone speaking Chinese to you faster; otherwise, you need a translator to first translator Chinese to you sentence after sentence in English, then you respond in English, and the translator make it back to Chinese sentence by sentence. The effectiveness of communication is largely reduced.

x <- runif(1e+7)

## R loop
system.time({
  sumx <- 0
  for (x0 in x) sumx <- sumx + x0
  sumx
  })
#   user  system elapsed 
#  1.388   0.000   1.347 

## C loop
system.time(sum(x))
#   user  system elapsed 
#  0.032   0.000   0.030

Be aware that "vectorization" in R is just an analogy to SIMD but not a real one. A real SIMD uses CPU's vector registers for computations hence is a true parallel computing via data parallelism. R is not a language where you can program CPU registers; you have to write compiled code or assembly code for that purpose.

R's "vectorization" does not care how a loop written in a compiled language is really executed; after all that is beyond R's interpreter's knowledge. Regarding whether these compiled code will be executed with SIMD, read Does R leverage SIMD when doing vectorized calculations?

More on "vectorization" in R

I am not a Julia user, but Bogumił Kamiński has demonstrated an impressive feature of that language: loop fusion. Julia can do this, because, as he points out, "vectorization in Julia is implemented in Julia", not outside the language.

This reveals a downside of R's vectorization: speed often comes at a price of memory usage. I am not saying that Julia won't have this problem (as I don't use it, I don't know), but this is definitely true for R.

Here is an example: Fastest way to compute row-wise dot products between two skinny tall matrices in R. rowSums(A * B) is a "vectorization" in R, as both "*" and rowSums are coded in C language as a loop. However, R can not fuse them into a single C loop to avoid generating the temporary matrix C = A * B into RAM.

Another example is R's recycling rule or any computations relying on such rule. For example, when you add a scalar a to a matrix A by A + a, what really happens is that a is first replicated to be a matrix B that has the same dimension with A, i.e., B <- matrix(a, nrow(A), ncol(A)), then an addition between two matrices are calculated: A + B. Clearly the generation of the temporary matrix B is undesired, but sorry, you can't do it better unless you write your own C function for A + a and call it in R. This is described as "such a fusion is possible only if explicitly implemented" in Bogumił Kamiński's answer.

To deal with the memory effects of many temporary results, R has a sophisticated mechanism called "garbage collection". It helps, but memory can still explode if you generate some really big temporary result somewhere in your code. A good example is the function outer. I have written many answers using this function, but it is particularly memory-unfriendly.

I might have been off-topic in this edit, as I begin to discuss the side effect of "vectorization". Use it with care.

Put memory usage in mind; there might be a more memory efficient vectorized implementation. For example, as mentioned in the linked thread on row-wise dot products between two matrices, c(crossprod(x, y)) is better than sum(x * y).
Be prepared to use CRAN R packages that have compiled code. If you find existing vectorized functions in R limited to do your task, explore CRAN for possible R packages that can do it. You can ask a question with your coding bottleneck on Stack Overflow, and somebody may point you to the right function in the right package.
Be happy to write your own compiled code.

Why does vectorizing this simple R loop give a different result?

warm-up

As a warm-up, consider two simpler examples.

## example 1
x <- 1:11
for (i in 1:10) x[i] <- x[i + 1]
x
# [1]  2  3  4  5  6  7  8  9 10 11 11

x <- 1:11
x[1:10] <- x[2:11]
x
# [1]  2  3  4  5  6  7  8  9 10 11 11

## example 2
x <- 1:11
for (i in 1:10) x[i + 1] <- x[i]
x
# [1] 1 1 1 1 1 1 1 1 1 1 1

x <- 1:11
x[2:11] <- x[1:10]
x
# [1]  1  1  2  3  4  5  6  7  8  9 10

"Vectorization" is successful in the 1st example but not the 2nd. Why?

Here is prudent analysis. "Vectorization" starts by loop unrolling, then executes several instructions in parallel. Whether a loop can be "vectorized" depends on the data dependency carried by the loop.

Unrolling the loop in example 1 gives

x[1]  <- x[2]
x[2]  <- x[3]
x[3]  <- x[4]
x[4]  <- x[5]
x[5]  <- x[6]
x[6]  <- x[7]
x[7]  <- x[8]
x[8]  <- x[9]
x[9]  <- x[10]
x[10] <- x[11]

Executing these instructions one by one and executing them simultaneously give identical result. So this loop can be "vectorized".

The loop in example 2 is

x[2]  <- x[1]
x[3]  <- x[2]
x[4]  <- x[3]
x[5]  <- x[4]
x[6]  <- x[5]
x[7]  <- x[6]
x[8]  <- x[7]
x[9]  <- x[8]
x[10] <- x[9]
x[11] <- x[10]

Unfortunately, executing these instructions one by one and executing them simultaneously would not give identical result. For example, when executing them one by one, x[2] is modified in the 1st instruction, then this modified value is passed to x[3] in the 2nd instruction. So x[3] would have the same value as x[1]. However, in parallel execution, x[3] equals x[2]. As the result, this loop can not be "vectorized".

In "vectorization" theory,

Example 1 has a "write-after-read" dependency in data: x[i] is modified after it is read;
Example 2 has a "read-after-write" dependency in data: x[i] is read after it is modified.

A loop with "write-after-read" data dependency can be "vectorized", while a loop with "read-after-write" data dependency can not.

in depth

Perhaps many people have been confused by now. "Vectorization" is a "parallel-processing"?

Yes. In 1960's when people wondered what kind of parallel processing computer be designed for high performance computing, Flynn classified the design ideas into 4 types. The category "SIMD" (single instruction, multiple data) is corned "vectorization", and a computer with "SIMD" cabability is called a "vector processor" or "array processor".

In 1960's there were not many programming languages. People wrote assembly (then FORTRAN when a compiler was invented) to program CPU registers directly. A "SIMD" computer is able to load multiple data into a vector register with a single instruction and do the same arithmetic on those data at the same time. So data processing is indeed parallel. Consider our example 1 again. Suppose a vector register can hold two vector elements, then the loop can be executed with 5 iterations using vector processing rather than 10 iterations as in scalar processing.

reg <- x[2:3]  ## load vector register
x[1:2] <- reg  ## store vector register
-------------
reg <- x[4:5]  ## load vector register
x[3:4] <- reg  ## store vector register
-------------
reg <- x[6:7]  ## load vector register
x[5:6] <- reg  ## store vector register
-------------
reg <- x[8:9]  ## load vector register
x[7:8] <- reg  ## store vector register
-------------
reg <- x[10:11] ## load vector register
x[9:10] <- reg  ## store vector register

Today there are many programming languages, like R. "Vectorization" no longer unambiguously refers to "SIMD". R is not a language where we can program CPU registers. The "vectorization" in R is just an analogy to "SIMD". In a previous Q & A: Does the term "vectorization" mean different things in different contexts? I have tried to explain this. The following map illustrates how this analogy is made:

single (assembly) instruction    -> single R instruction
CPU vector registers             -> temporary vectors
parallel processing in registers -> C/C++/FORTRAN loops with temporary vectors

So, the R "vectorization" of the loop in example 1 is something like

## the C-level loop is implemented by function "["
tmp <- x[2:11]  ## load data into a temporary vector
x[1:10] <- tmp  ## fill temporary vector into x

Most of the time we just do

x[1:10] <- x[2:10]

without explicitly assigning the temporary vector to a variable. The temporary memory block created is not pointed to by any R variable, and is therefore subject to garbage collection.

a complete picture

In the above, "vectorization" is not introduced with the simplest example. Very often, "vectorization" is introduced with something like

a[1] <- b[1] + c[1]
a[2] <- b[2] + c[2]
a[3] <- b[3] + c[3]
a[4] <- b[4] + c[4]

where a, b and c are not aliased in memory, that is, the memory blocks storing vectors a, b and c do not overlap. This is an ideal case, as no memory aliasing implies no data dependency.

Apart from "data dependency", there is also "control dependency", that is, dealing with "if ... else ..." in "vectorization". However, for time and space reason I will not elaborate on this issue.

back to the example in the question

Now it is time to investigate the loop in the question.

set.seed(0)
x <- round(runif(10), 2)
sig <- sample.int(10)
# [1]  1  2  9  5  3  4  8  6  7 10
for (i in seq_along(sig)) x[i] <- x[sig[i]]

Unrolling the loop gives

x[1]  <- x[1]
x[2]  <- x[2]
x[3]  <- x[9]   ## 3rd instruction
x[4]  <- x[5]
x[5]  <- x[3]   ## 5th instruction
x[6]  <- x[4]
x[7]  <- x[8]
x[8]  <- x[6]
x[9]  <- x[7]
x[10] <- x[10]

There is "read-after-write" data dependency between the 3rd and the 5th instruction, so the loop can not be "vectorized" (see Remark 1).

Well then, what does x[] <- x[sig] do? Let's first explicitly write out the temporary vector:

tmp <- x[sig]
x[] <- tmp

Since "[" is called twice, there are actually two C-level loops behind this "vectorized" code:

tmp[1]  <- x[1]
tmp[2]  <- x[2]
tmp[3]  <- x[9]
tmp[4]  <- x[5]
tmp[5]  <- x[3]
tmp[6]  <- x[4]
tmp[7]  <- x[8]
tmp[8]  <- x[6]
tmp[9]  <- x[7]
tmp[10] <- x[10]

x[1]  <- tmp[1]
x[2]  <- tmp[2]
x[3]  <- tmp[3]
x[4]  <- tmp[4]
x[5]  <- tmp[5]
x[6]  <- tmp[6]
x[7]  <- tmp[7]
x[8]  <- tmp[8]
x[9]  <- tmp[9]
x[10] <- tmp[10]

So x[] <- x[sig] is equivalent to

for (i in 1:10) tmp[i] <- x[sig[i]]
for (i in 1:10) x[i] <- tmp[i]
rm(tmp); gc()

which is not at all the original loop given in the question.

Remark 1

If implementing the loop in Rcpp is seen as a "vectorization" then let it be. But there is no chance to further "vectorize" the C / C++ loop with "SIMD".

Remark 2

This Q & A is motivated by this Q & A. OP originally presented a loop

for (i in 1:num) {
  for (j in 1:num) {
    mat[i, j] <- mat[i, mat[j, "rm"]]
  }
}

It is tempting to "vectorize" it as

mat[1:num, 1:num] <- mat[1:num, mat[1:num, "rm"]]

but it is potentially wrong. Later OP changed the loop to

for (i in 1:num) {
  for (j in 1:num) {
    mat[i, j] <- mat[i, 1 + num + mat[j, "rm"]]
  }
}

which eliminates the memory aliasing issue, because the columns to be replaced are the first num columns, while the columns to be looked up are after the first num columns.

Remark 3

I got some comments regarding whether the loop in the question is making "in-place" modification of x. Yes, it is. We can use tracemem:

set.seed(0)
x <- round(runif(10), 2)
sig <- sample.int(10)
tracemem(x)
#[1] "<0x28f7340>"
for (i in seq_along(sig)) x[i] <- x[sig[i]]
tracemem(x)
#[1] "<0x28f7340>"

My R session has allocated a memory block pointed by address <0x28f7340> for x and you may see a different value when you run the code. However, the output of tracemem will not change after the loop, which means that no copy of x is made. So the loop is indeed doing "in-place" modification without using extra memory.

However, the loop is not doing "in-place" permutation. "In-place" permutation is a more complicated operation. Not only elements of x need be swapped along the loop, elements of sig also need be swapped (and in the end, sig would be 1:10).

What is vectorization ?

Many CPUs have "vector" or "SIMD" instruction sets which apply the same operation simultaneously to two, four, or more pieces of data. Modern x86 chips have the SSE instructions, many PPC chips have the "Altivec" instructions, and even some ARM chips have a vector instruction set, called NEON.

"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

I chose 4 because it's what modern hardware is most likely to directly support for 32-bit floats or ints.

The difference between vectorization and loop unrolling:
Consider the following very simple loop that adds the elements of two arrays and stores the results to a third array.

for (int i=0; i<16; ++i)
    C[i] = A[i] + B[i];

Unrolling this loop would transform it into something like this:

for (int i=0; i<16; i+=4) {
    C[i]   = A[i]   + B[i];
    C[i+1] = A[i+1] + B[i+1];
    C[i+2] = A[i+2] + B[i+2];
    C[i+3] = A[i+3] + B[i+3];
}

Vectorizing it, on the other hand, produces something like this:

for (int i=0; i<16; i+=4)
    addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);

Where "addFourThingsAtOnceAndStoreResult" is a placeholder for whatever intrinsic(s) your compiler uses to specify vector instructions.

Terminology:

Note that most modern ahead-of-time compilers are able to auto vectorize very simple loops like this, which can often be enabled via a compile option (on by default with full optimization in modern C and C++ compilers, like gcc -O3 -march=native). OpenMP #pragma omp simd is sometimes helpful to hint the compiler, especially for "reduction" loops like summing an FP array where vectorization requires pretending that FP math is associative.

More complex algorithms still require help from the programmer to generate good vector code; we call this manual vectorization, often with intrinsics like x86 _mm_add_ps that map to a single machine instruction as in SIMD prefix sum on Intel cpu or How to count character occurrences using SIMD. Or even use SIMD for short non-looping problems like Most insanely fastest way to convert 9 char digits into an int or unsigned int or How to convert a binary integer number to a hex string?

The term "vectorization" is also used to describe a higher level software transformation where you might just abstract away the loop altogether and just describe operating on arrays instead of the elements that comprise them. e.g. writing C = A + B in some language that allows that when those are arrays or matrices, unlike C or C++. In lower-level languages like that, you could describe calling BLAS or Eigen library functions instead of manually writing loops as a vectorized programming style. Some other answers on this question focus on that meaning of vectorization, and higher-level languages.

NumPy vectorization without the use of numpy.vectorize

"vectorization" can mean be different things depending on context. Use of low level C code with BLAS or SIMD is just one.

In physics 101, a vector represents a point or velocity whose numeric representation can vary with coordinate system. Thus I think of "vectorization", broadly speaking, as performing math operations on the "whole" array, without explicit control over numerical elements.

numpy basically adds a ndarray class to python. It has a large number of methods (and operators and ufunc) that do indexing and math in compiled code (not necessarily using processor specific SIMD). The big gain in speed, relative to Python level iteration, is the use of compiled code optimized for the ndarray data structure. Python level iteration (interpreted code) on arrays is actually slower than on equivalent lists.

I don't think numpy formally defines "vectorization". There isn't a "vector" class. I haven't searched the documentation for those terms. Here, and possibly on other forums, it just means, writing code that makes optimal use of ndarray methods. Generally that means avoiding python level iteration.

np.vectorize is a tool for applying arrays to functions that only accept scalar inputs. It doesn't not compile or otherwise "look inside" that function. But it does accept and apply arguments in a fully broadcasted sense, such as in:

In [162]: vfunc(np.arange(3)[:,None],np.arange(4))
Out[162]: 
array([[0, 1, 2, 3],
       [1, 2, 3, 4],
       [2, 1, 4, 5]])

Speedwise np.vectorize is slower than the equivalent list comprehension, at least for smaller sample cases. Recent testing shows that it scales better, so for large inputs it may be better. But still the performance is nothing like your myfunc_2.

myfunc is not "vectorized" simply because expressions like if a > b do not work with arrays.

np.where(a > b, a - b, a + b) is "vectorized" because all arguments to the where work with arrays, and where itself uses them with full broadcasting powers.

In [163]: a,b = np.arange(3)[:,None], np.arange(4)
In [164]: np.where(a>b, a-b, a+b)
Out[164]: 
array([[0, 1, 2, 3],
       [1, 2, 3, 4],
       [2, 1, 4, 5]])

myfunc_2 is "vectorized", at least in a:

In [168]: myfunc_2(a,2)
Out[168]: 
array([[4],
       [1],
       [2]])

It does not work when b is array; it's trickier to match the a[cond] shape with anything but a scalar:

In [169]: myfunc_2(a,b)
Traceback (most recent call last):
  Input In [169] in <cell line: 1>
    myfunc_2(a,b)
  Input In [159] in myfunc_2
    a[cond] -= b
IndexError: boolean index did not match indexed array along dimension 1; dimension is 1 but corresponding boolean dimension is 4

===

What are the requirements to say a function is "vectorized" if not converted via numpy.vectorize? Just broadcastable in its entirety?

In your examples, my_func is not "vectorized" because it only works with scalars. vfunc is full "vectorized", but not faster. where is also "vectorized" and (probably) faster, though this may be scale dependent. my_func2 is only "vectorized" in a.

How does NumPy determine if a function is "vectorized"/broadcastable?

numpy doesn't determine anything like this. numpy is a ndarray class with many methods. It's just the use of those methods that makes a block of code "vectorized".

Why isn't this form of vectorization documented anywhere? (i.e., why doesn't NumPy have a "How to write a vectorized function" page?

Keep in mind the distinction between "vectorization" as a performance strategy, and the basic idea of operating on whole arrays.

Why is vectorization, faster in general, than loops?

Vectorization (as the term is normally used) refers to SIMD (single instruction, multiple data) operation.

That means, in essence, that one instruction carries out the same operation on a number of operands in parallel. For example, to multiply a vector of size N by a scalar, let's call M the number of operands that size that it can operate on simultaneously. If so, then the number of instructions it needs to execute is approximately N/M, where (with purely scalar operations) it would have to carry out N operations.

For example, Intel's current AVX 2 instruction set uses 256-bit registers. These can be used to hold (and operate on) a set of 4 operands of 64-bits apiece, or 8 operands of 32 bits apiece.

So, assuming you're dealing with 32-bit, single-precision real numbers, that means a single instruction can do 8 operations (multiplications, in your case) at once, so (at least in theory) you can finish N multiplications using only N/8 multiplication instructions. At least, in theory, this should allow the operation to finish about 8 times as fast as executing one instruction at a time would allow.

Of course, the exact benefit depends on how many operands you support per instruction. Intel's first attempts only supported 64-bit registers, so to operate on 8 items at once, those items could only be 8 bits apiece. They currently support 256-bit registers, and they've announced support for 512-bit (and they may have even shipped that in a few high-end processors, but not in normal consumer processors, at least yet). Making good use of this capability can also be non-trivial, to put it mildly. Scheduling instructions so you actually have N operands available and in the right places at the right times isn't necessarily an easy task (at all).

To put things in perspective, the (now ancient) Cray 1 gained a lot of its speed exactly this way. Its vector unit operated on sets of 64 registers of 64 bits apiece, so it could do 64 double-precision operations per clock cycle. On optimally vectorized code, it was much closer to the speed of a current CPU than you might expect based solely on its (much lower) clock speed. Taking full advantage of that wasn't always easy though (and still isn't).

Keep in mind, however, that vectorization is not the only way in which a CPU can carry out operations in parallel. There's also the possibility of instruction-level parallelism, which allows a single CPU (or the single core of a CPU) to execute more than one instruction at a time. Most modern CPUs include hardware to (theoretically) execute up to around 4 instructions per clock cycle¹ if the instructions are a mix of loads, stores, and ALU. They can fairly routinely execute close to 2 instructions per clock on average, or more in well-tuned loops when memory isn't a bottleneck.

Then, of course, there's multi-threading--running multiple streams of instructions on (at least logically) separate processors/cores.

So, a modern CPU might have, say, 4 cores, each of which can execute 2 vector multiplies per clock, and each of those instructions can operate on 8 operands. So, at least in theory, it can be carrying out 4 * 2 * 8 = 64 operations per clock.

Some instructions have better or worse throughput. For example, FP adds throughput is lower than FMA or multiply on Intel before Skylake (1 vector per clock instead of 2). But boolean logic like AND or XOR has 3 vectors per clock throughput; it doesn't take many transistors to build an AND/XOR/OR execution unit, so CPUs replicate them. Bottlenecks on the total pipeline width (the front-end that decodes and issues into the out-of-order part of the core) are common when using high-throughput instructions, rather than bottlenecks on a specific execution unit.

But, over time CPUs tend to have more resources available, so this number rises.

Is the *apply family really not vectorized?

First of all, in your example you make tests on a "data.frame" which is not fair for colMeans, apply and "[.data.frame" since they have an overhead:

system.time(as.matrix(m))  #called by `colMeans` and `apply`
#   user  system elapsed 
#   1.03    0.00    1.05
system.time(for(i in 1:ncol(m)) m[, i])  #in the `for` loop
#   user  system elapsed 
#  12.93    0.01   13.07

On a matrix, the picture is a bit different:

mm = as.matrix(m)
system.time(colMeans(mm))
#   user  system elapsed 
#   0.01    0.00    0.01 
system.time(apply(mm, 2, mean))
#   user  system elapsed 
#   1.48    0.03    1.53 
system.time(for(i in 1:ncol(mm)) mean(mm[, i]))
#   user  system elapsed 
#   1.22    0.00    1.21

Regading the main part of the question, the main difference between lapply/mapply/etc and straightforward R-loops is where the looping is done. As Roland notes, both C and R loops need to evaluate an R function in each iteration which is the most costly. The really fast C functions are those that do everything in C, so, I guess, this should be what "vectorised" is about?

An example where we find the mean in each of a "list"s elements:

(EDIT May 11 '16 : I believe the example with finding the "mean" is not a good setup for the differences between evaluating an R function iteratively and compiled code, (1) because of the particularity of R's mean algorithm on "numeric"s over a simple sum(x) / length(x) and (2) it should make more sense to test on "list"s with length(x) >> lengths(x). So, the "mean" example is moved to the end and replaced with another.)

As a simple example we could consider the finding of the opposite of each length == 1 element of a "list":

In a tmp.c file:

#include <R.h>
#define USE_RINTERNALS 
#include <Rinternals.h>
#include <Rdefines.h>

/* call a C function inside another */
double oppC(double x) { return(ISNAN(x) ? NA_REAL : -x); }
SEXP sapply_oppC(SEXP x)
{
    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));
    for(int i = 0; i < LENGTH(x); i++) 
        REAL(ans)[i] = oppC(REAL(VECTOR_ELT(x, i))[0]);

    UNPROTECT(1);
    return(ans);
}

/* call an R function inside a C function;
 * will be used with 'f' as a closure and as a builtin */    
SEXP sapply_oppR(SEXP x, SEXP f)
{
    SEXP call = PROTECT(allocVector(LANGSXP, 2));
    SETCAR(call, install(CHAR(STRING_ELT(f, 0))));

    SEXP ans = PROTECT(allocVector(REALSXP, LENGTH(x)));     
    for(int i = 0; i < LENGTH(x); i++) { 
        SETCADR(call, VECTOR_ELT(x, i));
        REAL(ans)[i] = REAL(eval(call, R_GlobalEnv))[0];
    }

    UNPROTECT(2);
    return(ans);
}

And in R side:

system("R CMD SHLIB /home/~/tmp.c")
dyn.load("/home/~/tmp.so")

with data:

set.seed(007)
myls = rep_len(as.list(c(NA, runif(3))), 1e7)

#a closure wrapper of `-`
oppR = function(x) -x

for_oppR = compiler::cmpfun(function(x, f)
{
    f = match.fun(f)  
    ans = numeric(length(x))
    for(i in seq_along(x)) ans[[i]] = f(x[[i]])
    return(ans)
})

Benchmarking:

#call a C function iteratively
system.time({ sapplyC =  .Call("sapply_oppC", myls) }) 
#   user  system elapsed 
#  0.048   0.000   0.047 

#evaluate an R closure iteratively
system.time({ sapplyRC =  .Call("sapply_oppR", myls, "oppR") }) 
#   user  system elapsed 
#  3.348   0.000   3.358 

#evaluate an R builtin iteratively
system.time({ sapplyRCprim =  .Call("sapply_oppR", myls, "-") }) 
#   user  system elapsed 
#  0.652   0.000   0.653 

#loop with a R closure
system.time({ forR = for_oppR(myls, "oppR") })
#   user  system elapsed 
#  4.396   0.000   4.409 

#loop with an R builtin
system.time({ forRprim = for_oppR(myls, "-") })
#   user  system elapsed 
#  1.908   0.000   1.913 

#for reference and testing 
system.time({ sapplyR = unlist(lapply(myls, oppR)) })
#   user  system elapsed 
#  7.080   0.068   7.170 
system.time({ sapplyRprim = unlist(lapply(myls, `-`)) }) 
#   user  system elapsed 
#  3.524   0.064   3.598 

all.equal(sapplyR, sapplyRprim)
#[1] TRUE 
all.equal(sapplyR, sapplyC)
#[1] TRUE
all.equal(sapplyR, sapplyRC)
#[1] TRUE
all.equal(sapplyR, sapplyRCprim)
#[1] TRUE
all.equal(sapplyR, forR)
#[1] TRUE
all.equal(sapplyR, forRprim)
#[1] TRUE

(Follows the original example of mean finding):

#all computations in C
all_C = inline::cfunction(sig = c(R_ls = "list"), body = '
    SEXP tmp, ans;
    PROTECT(ans = allocVector(REALSXP, LENGTH(R_ls)));

    double *ptmp, *pans = REAL(ans);

    for(int i = 0; i < LENGTH(R_ls); i++) {
        pans[i] = 0.0;

        PROTECT(tmp = coerceVector(VECTOR_ELT(R_ls, i), REALSXP));
        ptmp = REAL(tmp);

        for(int j = 0; j < LENGTH(tmp); j++) pans[i] += ptmp[j];}

Does the Term "Vectorization" Mean Different Things in Different Contexts