Does R Leverage Simd When Doing Vectorized Calculations

Does R leverage SIMD when doing vectorized calculations?

I just got a "good answer" badge two years after my initial answer. Thanks for acknowledging the quality of this answer. In return, I would substantially enrich the original contents. This answer / article is now intended for any R user that wants to try out SIMD. It has the following sections:

Background: what is SIMD?
Summary: how can we leverage SIMD for our assembly or compiled code?
R side: how can R possibly use SIMD?
Performance: is vector code always faster than scalar code?
Writing R extensions: write compiled code with OpenMP SIMD

Background: what is SIMD?

Many R programmers may not know SIMD if they don't write assembly or compiled code.

SIMD (single instruction, multiple data) is a data-level parallel processing technology that has a very long history. Before personal computers were around, SIMD unambiguously referred to the vector processing in a vector processor, and was the main route to high-performance computing. When personal computers later came into the market, they didn't have any features resembling those of a vector processor. However, as the demands for processing multi-media data grew higher and higher, they began to have vector registers and corresponding sets of vector instructions to use those registers for vector data load, vector data arithmetics and vector data store. The capacity of vector registers is getting bigger, and the functionality of vector instruction sets are also increasingly versatile. Till today, they are able to do stream load / store, strided load / store, scattered load / store, vector elements shuffling, vector arithmetics (including fused arithmetics like fused multiply-add), vector logical operations, masking, etc. So they are more and more alike a mini vector processors of the old days.

Although SIMD has been with personal computers for nearly two decades, many programmers are unaware of it. Instead, many are familiar with thread-level parallelism like multi-core computing (which can be referred to as MIMD). So if you are new to SIMD, you are highly recommended to watch this YouTube video Utilizing the other 80% of your system's performance: Starting with Vectorization by Ulrich Drepper from Red Hat Linux.

Since vector instruction sets are extensions to the original architecture instruction sets, you have to invest extra efforts to use them. If you are to write assembly code, you can call these instructions straightaway. If you are to write compiled code like C, C++ and Fortran, you have to write inline assembly or use vector intrinsics. A vector intrinsic appears like a function, but it is in fact an inline assembly mapping to a vector assembly instruction. These intrinsics (or "functions") are not part of standard libraries of a compiled language; they are provided by the architecture / machine. To use them, we need including appropriate header files and compiler-specific flags.

Let's first define the following for ease of later discussions:

Writing assembly or compiled code that does not use vector instruction sets is called "writing scalar code";
Writing assembly or compiled code that uses vector instruction sets is called "writing vector code".

So these two paths are "write A, get A" and "write B, get B". However, compilers are getting stronger and there is another "write A, get B" path:

They have a power to translate your written scalar compiled code to vector assembly code, a compiler optimization called "auto-vectorization".

Some compilers like GCC considers auto-vectorization as part of highest-level optimization, and is enabled by flag -O3; while other more aggressive compiles like ICC (intel C++ compiler) and clang would enable it at -O2. Auto-vectorization can also be directly controlled by specific flags. For example, GCC has -ftree-vectorize. When exploiting auto-vectorization, it is advised to further hint compilers to taylor vector assembly code for machine. For example, for GCC we may do -march=native and for ICC we use -xHost. This makes sense, because even on the x86-64 architecture family, later microarchitectures come with more vector instruction sets. For example, sandybridge supports vector instruction sets up to AVX, haswell further supports AVX2 and FMA3 and skylake further supports AVX-512. Without -march=native, GCC only generates vector instructions using instruction sets up to SSE2, which is a much smaller subset common to all x86-64.

Summary: how can we leverage SIMD for our assembly or compiled code?

There are five ways to implement SIMD:

Writing machine-specific vector assembly code directly. For example, on x86-64 we use SSE / AVX instruction sets, and on ARM architectures we use NEON instruction sets.
- Pros: Code can be hand-tuned for better performance;
- Cons: We have to write different versions of assembly code for different machines.
Writing vector compiled code by using machine-specific vector intrinsics and compiling it with compiler-specific flags. For example, on x86-64 we use SSE / AVX intrinsics, and for GCC we set -msse2, -mavx, etc (or simply -march=native for auto-detection). A variant of this option is to write compiler-specific inline-assembly. For example, introduction to GCC's inline assembly can be found here.
- Pros: Writing compiled code is easier than writing assembly code, and the code is more readable hence easier to maintain;
- Cons: We have to write different versions of code for different machines, and adapt Makefile for different compilers.
Writing vector compiled code by using compiler-specific vector extensions. Some compilers have defined their own vector data type. For example, GCC's vector extensions can be found here;
- Pros: We don't need to worry about the difference across architectures, as compiler can generate machine-specific assembly;
- Cons: We have to write different versions of code for different compilers, and adapt Makefile likewise.
Writing scalar compiled code and using compiler-specific flags for auto-vectorization. Optionally we can insert compiler-specific pragmas along our compiled code to give compilers more hints on, say, data alignment, loop unrolling depth, etc.
- Pros: Writing scalar compiled code is easier than writing vector compiled code, and is more readable to broad audience.
- Cons: We have to adapt Makefile for different compilers, and in case we have used pragmas, they also need be versioned.
writing scalar compiled code and inserting OpenMP pragmas (requiring OpenMP 4.0+) #pragma opm simd.
- Pros: same as option 4, and additionaly, we can use a single version of pragmas as many main stream compilers support OpenMP standard;
- Cons: We have to adapt Makefile for different compilers as they may have different flags to enable OpenMP and machine-specific tuning.

From top to bottom, programmers progressively do less and compilers do increasingly more. Implementing SIMD is interesting, but this article unfortunately does not have the room for a decent coverage with examples. I would provide the most informative references I found.

For options 1 and 2 on x86-64, SSE / AVX intrinsics is definitely the best reference card, but not the right place to start learning these instructions. Where to start is individual-specific. I picked up intrinsics / assembly from BLISLab when I tried to write my own high-performance DGEMM (to be introduced later). After digesting example code over there I started practising, and posted a few questions on StackOverflow or CodeReview when I got stucked.
For options 4, a good explanation is given by A guide to auto-vectorization with intel C++ compilers. Although the manual is for ICC, the principle of how auto-vectorization works applies to GCC as well. The official website for GCC's auto-vectorization is so out-dated, and this presentation slide is more useful: GCC auto-vectorization.
For option 5, there is a very good technical report by Oak Ridge National Laboratory: Effective Vectorization with OpenMP 4.5.

In terms of portability,

Options 1 to 3 are not easily portable, because the version of vector code depends on machine and / or compiler;
Option 4 is much better as we get rid of machine-dependency, but we still have problem with compiler-dependency;
Option 5 is very close to portable, as adapting Makefile is way much easier than adapting code.

In terms of performance, conventionally it is believed that option 1 is the best, and performance would degrade as we move downward. However, compilers are getting better, and newer machines have hardware improvement (for example, the performance penalty for unaligned vector load is smaller). So auto-vectorization is very positive. As part of my own DGEMM case study, I found that on an Intel Xeon E5-2650 v2 workstation with a peak performance of 18 GFLOPs per CPU, GCC's auto-vectorization has attained 14 ~ 15 GFLOPs which is rather impressive.

R side: how can R possibly use SIMD?

R can only use SIMD by calling compiled code that exploits SIMD. Compiled code in R has three sources:

R packages with "base" priority, like base, stats, utils, etc, that come with R's source code;
Other packages on CRAN that require compilation;
External scientific libraries like BLAS and LAPACK.

Since R software itself is portable across architectures, platforms and operating systems, and CRAN policy expects that an R package to be equally portable, compiled code in sources 1 and 2 can not be written in assembly code or compiler-dependent compiled code, ruling out options 1 to 3 for SIMD implementation. Auto-vectorization is the only chance left for R to leverage SIMD.

If you have built R with compiler's auto-vectorization enabled, compiled code from sources 1 can exploit SIMD. In fact, although R is written in a portable way, you can tune it for your machine when building it. For example, I would do icc -xHost -O2 with ICC or gcc -march=native -O2 -ftree-vectorize -ffast-math with GCC. Flags are set at R's build time and recorded in RHOME/etc/Makeconf (on Linux). Usually people would just do a quick build, so flag configurations are auto-decided. The result can be different depending your machine and your default compiler. On a Linux machine with GCC, optimization flag is often automatically set at -O2, hence auto-vectorization is off; instead, on a Mac OS X machine with clang, auto-vectorization is on at -O2. So I suggest you checking your Makeconf.

Flags in Makeconf are used when you run R CMD SHLIB, invoked by R CMD INSTALL or install.packages when installing CRAN R packages that needs compilation. By default, if Makeconf says that auto-vectorization is off, compiled code from source 2 can not leverage SIMD. However, it is possible to override Makeconf flags by providing a user Makevars file (like ~/.R/Makevars on Linux), so that R CMD SHLIB can take these flags and auto-vectorize compiled code from source 2.

BLAS and LAPACK are not part of R project or CRAN mirror. R simply takes it as it is, and does not even check whether it is a valid one! For example, on Linux if you alias your BLAS library to an arbitrary library foo.so, R will "stupidly" load foo.so instead on its startup and cause you trouble! The loose relationship between R and BLAS makes it easy to link different versions of BLAS to R so that benchmarking different BLAS libraries in R becomes straightforward (or course you have to restart R after you update the linkage). For Linux users with root privilege, switching between different BLAS libraries are recommoned by using sudo update-alternatives --config. If you don't have root privilege, this thread on StackOverflow will help you: Without root access, run R with tuned BLAS when it is linked with reference BLAS.

In case you don't know what BLAS is, here is brief introduction. BLAS originally referred to a coding standard for vector-vector, matrix-vector and matrix-matrix computations in scientific computations. For example, it was recommended that a general matrix-matrix multiplication should be C <- beta * C + alpha * op(A) %*% op(B), known as DGEMM. Note that this operation is more than just C <- A %*% B, and the motivation of this design was to maximize code reuse. For example, C <- C + A %*% B, C <- 2 * C + t(A) %*% B, etc can all be computed using DGEMM. A model implementation using FORTRAN 77 is provided with such standard for a reference, and this model library is commonly known as the reference BLAS library. Such library is static; it is there to motivate people to tune its performance for any specific machines. BLAS optimization is actually a very difficult job. In the end of optimization, everything changes except its user-interface. I.e., everything inside a BLAS function is changed, expect that you still call it in the same way. The various optimized BLAS libraries are known as tuned BLAS libraries, and include ATLAS, OpenBLAS or Intel MKL for example. All tuned BLAS libraries exploit SIMD as part of their optimization. Optimized BLAS library is remarkably faster than the reference one, and the performance gap will be increasely wider for new machines.

R relies on BLAS. For example, the matrix-matrix multiply operator "%*%" in R will call DGEMM. Functions crossprod, tcrossprod are also mapped to DGEMM. BLAS lies in the centre of scientific computations. Without BLAS, R would largely be broken. It is advocated so much to link an optimized BLAS library to R. It used to be difficult to check which BLAS library is linked to R (as this can be obscured by alias), but from R 3.4.0 this is no longer the case. sessionInfo() will show the full paths to the library or executable files providing the BLAS / LAPACK implementations currently in use (not available on Windows).

LAPACK is a more advanced scientific library built on top of BLAS. R relies on LAPACK for various matrix factorizations. For example, qr(, pivot = TRUE), chol, svd and eigen in R are mapped to LAPACK for QR factorization, Cholesky factorization, singular value decomposition and eigen decomposition. Note that all tuned BLAS libraries include a clone of LAPACK, so if R is linked to a tuned BLAS library, sessionInfo() will show that both libraries come from the same path; instead, if R is linked to the reference BLAS library, sessionInfo() will have two difference paths for BLAS and LAPACK. There have been plenty of questions taged r regarding the drastic performance difference of matrix multiplication across platforms, like Large performance differences between OS for matrix computation. In fact, if you just look at the output of sessionInfo(), you get an immediate clue that R is linked to a tuned BLAS on the first platform and a reference BLAS on the second.

Performance: is vector code always faster than scalar code?

Vector code looks fast, but they may not be realistically faster than scalar code. Here is a case study: Why is this SIMD multiplication not faster than non-SIMD multiplication?. And what a coincidence, the vector operation examined there is exactly what OP here took for example: Hadamard product. People often forget that the processing speed of CPU is not the deciding factor for practical performance. If data can not be transported from memory to CPU as fast as CPU requests, a CPU would just sit there and wait for most of the time. The Hadamard product example just falls into this situation: for every multiplication, three data must be fetched from memory, so Hadamard product is a memory-bound operation. The processing power of a CPU can only be realized, when substantially more arithmetics are done than the number of data movement. The classic matrix-matrix multiplication in BLAS belongs to this case, and this explains why SIMD implementation from a tuned BLAS library is so rewarding.

In light of this, I don't think you need to worry that much if you did not build your R software with compiler auto-vectorization turned on. It is hard to know whether R will really be faster.

Writing R extensions: write compiled code with OpenMP SIMD

If you decide to make contributions to CRAN by writing your own R packages, you can consider using SIMD option 5: OpenMP auto-vectorization if some section of your compiled code can benefit from SIMD. The reason for not choosing option 4, is that when you write a distributable package, you have no idea of what compiler will be used by an end-user. So there is no way you can write compiler-specific code and get it published on CRAN.

As we pointed out earlier in the SIMD options list, using OpenMP SIMD requires us adapting Makefile. In fact, R makes this very easy for you. You never need to write a Makefile alongside an R package. All you need is a Makevars file. When your package is compiled, compiler flags specified in your package Makevars and the RHOME/etc/Makeconf in the user's machine will be passed to R CMD SHLIB. Although you don't know what compiler that user might be using, RHOME/etc/Makeconf knows! All you need to do is to specify in your package Makevars that you want OpenMP support.

The only thing you can't do in your package Makevars is giving hint for machine-specific tuning. You may instead advise your package users to do the following:

If the RHOME/etc/Makeconf on the user's machine already have such tuning configuration (that is, the user have configured flags when they built R), your compiled code should be transformed to the tuned vector assembly code and there is nothing further to do;
Otherwise, you need to advise users to edit there personal Makevars file (like ~/.R/Makevars on Linux). You need to produce a table (maybe in your package vignette or documentation) about what tuning flags should be set for what compilers. Say -xHost for ICC and -march=native for GCC.

Does the term vectorization mean different things in different contexts?

"Vectorization" in R, is a vector processing in R's interpreter's view. Take the function cumsum as an example. On entry, R interpreter sees that a vector x is passed into this function. However, the work is then passed to C language that R interpreter can not analyze / track. While C is doing work, R is just waiting. By the time that R's interpreter comes back to work, a vector has been processed. So in R's view, it has issued a single instruction but processed a vector. This is an analogy to the concept of SIMD - "single instruction, multiple data".

Not just the cumsum function that takes a vector and returns a vector is seen as "vectorization" in R, functions like sum that takes a vector and returns a scalar is also a "vectorization".

Simply put: whenever R calls some compiled code for a loop, it is a "vectorization". If you wonder why this kind of "vectorization" is useful, it is because a loop written by a compiled language is faster than a loop written in an interpreted language. The C loop is translated to machine language that a CPU can understand. However, if a CPU wants to execute an R loop, it needs R's interpreter's help to read it, iteration by iteration. This is like, if you know Chinese (the hardest human language), you can respond to someone speaking Chinese to you faster; otherwise, you need a translator to first translator Chinese to you sentence after sentence in English, then you respond in English, and the translator make it back to Chinese sentence by sentence. The effectiveness of communication is largely reduced.

x <- runif(1e+7)

## R loop
system.time({
  sumx <- 0
  for (x0 in x) sumx <- sumx + x0
  sumx
  })
#   user  system elapsed 
#  1.388   0.000   1.347 

## C loop
system.time(sum(x))
#   user  system elapsed 
#  0.032   0.000   0.030

Be aware that "vectorization" in R is just an analogy to SIMD but not a real one. A real SIMD uses CPU's vector registers for computations hence is a true parallel computing via data parallelism. R is not a language where you can program CPU registers; you have to write compiled code or assembly code for that purpose.

R's "vectorization" does not care how a loop written in a compiled language is really executed; after all that is beyond R's interpreter's knowledge. Regarding whether these compiled code will be executed with SIMD, read Does R leverage SIMD when doing vectorized calculations?

More on "vectorization" in R

I am not a Julia user, but Bogumił Kamiński has demonstrated an impressive feature of that language: loop fusion. Julia can do this, because, as he points out, "vectorization in Julia is implemented in Julia", not outside the language.

This reveals a downside of R's vectorization: speed often comes at a price of memory usage. I am not saying that Julia won't have this problem (as I don't use it, I don't know), but this is definitely true for R.

Here is an example: Fastest way to compute row-wise dot products between two skinny tall matrices in R. rowSums(A * B) is a "vectorization" in R, as both "*" and rowSums are coded in C language as a loop. However, R can not fuse them into a single C loop to avoid generating the temporary matrix C = A * B into RAM.

Another example is R's recycling rule or any computations relying on such rule. For example, when you add a scalar a to a matrix A by A + a, what really happens is that a is first replicated to be a matrix B that has the same dimension with A, i.e., B <- matrix(a, nrow(A), ncol(A)), then an addition between two matrices are calculated: A + B. Clearly the generation of the temporary matrix B is undesired, but sorry, you can't do it better unless you write your own C function for A + a and call it in R. This is described as "such a fusion is possible only if explicitly implemented" in Bogumił Kamiński's answer.

To deal with the memory effects of many temporary results, R has a sophisticated mechanism called "garbage collection". It helps, but memory can still explode if you generate some really big temporary result somewhere in your code. A good example is the function outer. I have written many answers using this function, but it is particularly memory-unfriendly.

I might have been off-topic in this edit, as I begin to discuss the side effect of "vectorization". Use it with care.

Put memory usage in mind; there might be a more memory efficient vectorized implementation. For example, as mentioned in the linked thread on row-wise dot products between two matrices, c(crossprod(x, y)) is better than sum(x * y).
Be prepared to use CRAN R packages that have compiled code. If you find existing vectorized functions in R limited to do your task, explore CRAN for possible R packages that can do it. You can ask a question with your coding bottleneck on Stack Overflow, and somebody may point you to the right function in the right package.
Be happy to write your own compiled code.

Vectorize min() for matrix

Just use temp_mat <- pmin(temp_mat, 1). See ?pmin for more use of parallel minima.

Example:

set.seed(0); A <- matrix(sample(1:3, 25, replace = T), 5)
#> A
#     [,1] [,2] [,3] [,4] [,5]
#[1,]    3    1    1    3    3
#[2,]    1    3    1    2    3
#[3,]    2    3    1    3    1
#[4,]    2    2    3    3    2
#[5,]    3    2    2    2    1
B <- pmin(A, 2)
#> B
#     [,1] [,2] [,3] [,4] [,5]
#[1,]    2    1    1    2    2
#[2,]    1    2    1    2    2
#[3,]    2    2    1    2    1
#[4,]    2    2    2    2    2
#[5,]    2    2    2    2    1

update

Since you have background in computational science, I would like to provide more information.

pmin is fast, but is far from high performance. Its prefix "parallel" only suggests element-wise. The meaning of "vectorization" in R is not the same as "SIMD vectorization" in HPC. R is an interpreted language, so "vectorization" in R means opting for C level loop rather than R level loop. Therefore, pmin is just coded with a trivial C loop.

Real high performance computing should benefit from SIMD vectorization. I believe you know SSE/AVX intrinsics. So if you write a simple C code, using _mm_min_pd from SSE2, you will get ~2 times speedup from pmin; if you see _mm256_min_pd from AVX, you will get ~4 times speedup from pmin.

Unfortunately, R itself can not do any SIMD. I have an answer to a post at Does R leverage SIMD when doing vectorized calculations? regarding this issue. For your question, even if you link your R to a HPC BLAS, pmin will not benefit from SIMD, simply because pmin does not involve any BLAS operations. So a better bet is to write compiled code yourself.

Multithreaded & SIMD vectorized Mandelbrot in R using Rcpp & OpenMP

Do not use OpenMP with Rcpp's *Vector or *Matrix objects as they mask SEXP functions / memory allocations that are single-threaded. OpenMP is a multi-threaded approach.

This is why the code is crashing.

One way to get around this limitation is to use a non-R data structure to store the results. One of the following will be sufficient: arma::mat or Eigen::MatrixXd or std::vector<T>... As I favor armadillo, I will change the res matrix to arma::mat from Rcpp::NumericMatrix. Thus, the following will execute your code in parallel:

#include <RcppArmadillo.h> // Note the changed include and new attribute
// [[Rcpp::depends(RcppArmadillo)]]

// Avoid including header if openmp not on system
#ifdef _OPENMP
#include <omp.h>
#endif
// [[Rcpp::plugins(openmp)]]

// Note the changed return type
// [[Rcpp::export]]
arma::mat mandelRcpp(const double x_min, const double x_max,
                     const double y_min, const double y_max,
                     const int res_x, const int res_y, const int nb_iter) {
  arma::mat ret(res_x, res_y); // note change
  double x_step = (x_max - x_min) / res_x;
  double y_step = (y_max - y_min) / res_y;
  unsigned r,c;

  #pragma omp parallel for shared(res)
  for (r = 0; r < res_y; r++) {
    for (c = 0; c < res_x; c++) {
      double zx = 0.0, zy = 0.0, new_zx;
      double cx = x_min + c*x_step, cy = y_min + r*y_step;
      unsigned n = 0;
      for (;  (zx*zx + zy*zy < 4.0 ) && ( n < nb_iter ); n++ ) {
        new_zx = zx*zx - zy*zy + cx;
        zy = 2.0*zx*zy + cy;
        zx = new_zx;
      }

      if(n == nb_iter) {
        n = 0;
      }

      ret(r, c) = n;
    }
  }

  return ret;
}

With the test code (note y and x were not defined, thus I assumed y = ylims and x = xlims) we have:

xlims = ylims = c(-2.0, 2.0)

x_res = y_res = 400L
nb_iter = 256L

system.time(m <-
              mandelRcpp(xlims[[1]], xlims[[2]],
                         ylims[[1]], ylims[[2]], 
                         x_res, y_res, nb_iter))

rainbow = c(
  rgb(0.47, 0.11, 0.53),
  rgb(0.27, 0.18, 0.73),
  rgb(0.25, 0.39, 0.81),
  rgb(0.30, 0.57, 0.75),
  rgb(0.39, 0.67, 0.60),
  rgb(0.51, 0.73, 0.44),
  rgb(0.67, 0.74, 0.32),
  rgb(0.81, 0.71, 0.26),
  rgb(0.89, 0.60, 0.22),
  rgb(0.89, 0.39, 0.18),
  rgb(0.86, 0.13, 0.13)
)

cols = c(colorRampPalette(rainbow)(100),
         rev(colorRampPalette(rainbow)(100)),
         "black") # palette
par(mar = c(0, 0, 0, 0))

image(m,
      col = cols,
      asp = diff(range(ylims)) / diff(range(xlims)),
      axes = F)

For:

Sample Image

256-bit vectorization via OpenMP SIMD prevents compiler's optimization (say function inlining)?

I desperately needed to resolve this issue, because in my real C project, if no template trick were used for auto generation of different function versions (simply called "versioning" hereafter), I would need to write a total of 1400 lines of code for 9 different versions, instead of just 200 lines for a single template.

I was able to find a way out, and am now posting a solution using the toy example in the question.

I planed to utilize an inline function sum_template for versioning. If successful, it occurs at compile time when a compiler performs optimization. However, OpenMP pragma turns out to fail this compile time versioning. The option is then to do versioning at the pre-processing stage using macros only.

To get rid of the inline function sum_template, I manually inline it in the macro macro_define_sum:

#include <stdlib.h>

// j can be 0 or 1
#define macro_define_sum(FUN, j)                            \
void FUN (size_t n, double *A, double *c) {                 \
  if (n == 0) return;                                       \
  size_t i;                                                 \
  double *a = A, * b = A + n;                               \
  double c0 = 0.0, c1 = 0.0;                                \
  #pragma omp simd reduction (+: c0, c1) aligned (a, b: 32) \
  for (i = 0; i < n; i++) {                                 \
    c0 += a[i];                                             \
    if (j > 0) c1 += b[i];                                  \
    }

Does R Leverage Simd When Doing Vectorized Calculations