How to Define a Vectorized Function in R

How to define a vectorized function in R

A loop at the R level is not vectorized. An R loop will be calling the same R code for each element of a vector, which will be inefficient. Vectorized functions usually refer to those that take a vector and operate on the entire vector in an efficient way. Ultimately this will involve some form of loop, but as that loop is being performed in a low-level language such as C it can be highly efficient and tailored to the particular task.

Consider this silly function to add pairwise the elements of two vectors

sillyplus <- function(x, y) {
    out <- numeric(length = length(x))
    for(i in seq_along(x)) {
        out[i] <- x[i] + y[i]
    }
    out
}

It gives the right result

R> sillyplus(1:10, 1:10)
 [1]  2  4  6  8 10 12 14 16 18 20

and is vectorised in the sense that it can operate on entire vectors at once, but it is not vectorised in the sense I describe above because it is exceptionally inefficient. + is vectorised at the C level in R so we really only need 1:10 + 1:10, not an explicit loop in R.

The usual way to write a vectorised function is to use existing R functions that are already vectorised. If you want to start from scratch and the thing you want to do with the function doesn't exist as a vectorised function in R (odd, but possible) then you will need to get your hands dirty and write the guts of the function in C and prepare a little wrapper in R to call the C function you wrote with the vector of data you want it to work on. There are ways with functions like Vectorize() to fake vectorisation for R functions that are not vectorised.

C is not the only option here, FORTRAN is a possibility as is C++ and, thanks to Dirk Eddelbuettel & Romain Francois, the latter is much easier to do now with the Rcpp package.

R how to vectorize a function with multiple if else conditions

Here is a vectorized way. It creates logical vectors i1, i2, i3 and i4 corresponding to the 4 conditions. Then it assigns the new values to the positions indexed by them.

Trial_func2 <- function(df1){
  i1 <- df1[["Obs_Type"]] == 1
  i2 <- df1[["Obs_Type"]] == 2
  i3 <- df1[["Obs_Type"]] == 3
  i4 <- df1[["Obs_Type"]] == 4

  #If Type == 1; then a=-Inf, b = Upper_Bound
  df1[i1, "draw_value"] <- rtruncnorm(sum(i1), a =-Inf, 
                                      b = df1[i1, "Upper_bound"], 
                                      mean = df1[i1, "mean"], sd = 1)
  #If Type == 2; then a=-10, b = Upper_Bound
  df1[i2, "draw_value"] <- rtruncnorm(sum(i2), a = -10,
                                      b = df1[i2 , "Upper_bound"],
                                      mean = df1[i2, "mean"], sd = 1)
  #If Type == 3; then a=Lower_bound, b = Inf
  df1[i3,"draw_value"] <- rtruncnorm(sum(i3), 
                                     a = df1[i3, "Lower_bound"],
                                     b = Inf, mean = df1[i3, "mean"], 
                                     sd = 1)
  #If Type == 3; then a=Lower_bound, b = 10
  df1[i4, "draw_value"] <- rtruncnorm(sum(i4), 
                                      a = df1[i4, "Lower_bound"],
                                      b = 10,
                                      mean = df1[i4,"mean"],
                                      sd = 1)
  df1
}

In the speed test I have named @Dave2e's answer Trial_func3.

mbm <- microbenchmark(
  loop = Trial_func(df1 = df1),
  vect = Trial_func2(df1 = df1),
  cwhen = Trial_func3(df1 = df1),
  times = 10)

print(mbm, order = "median")
#Unit: milliseconds
#  expr         min          lq       mean      median          uq         max neval cld
#  vect    4.349444    4.371169    4.40920    4.401384    4.450024    4.487453    10  a 
# cwhen   13.458946   13.484247   14.16045   13.528792   13.787951   19.363104    10  a 
#  loop 2125.665690 2138.792497 2211.20887 2157.185408 2201.391083 2453.658767    10   b

How do I know a function or an operation in R is vectorized?

Vectorization in R basically means that any looping is moved to a faster, compiled language such as C or FORTRAN. For that to occur the vector(s) in question should be "atomic" - i.e. it should be "flat" and homogeneous - and the vector type, which you can check with typeof(), should make sense for the operation(s) being performed. If it is atomic then it is vectorized.

You can check if a vector is atomic using is.atomic(). Another type of vector that is not vectorized is called "recursive", which you can check using is.recursive(). Recursive objects can contain other objects of any type, i.e. they can be heterogeneous. Lists and data frames are recursive.

Try something like the following to gain some insight into atomic vs. recursive:

# Atomic:
1
1:3
c("a", "b", "c")
c(T, F, T)

# Recursive:
list(nums = 1:3, letts = c("a", "b", "c"), logics = c(T, F, T))
data.frame(nums = 1:3, letts = c("a", "b", "c"), logics = c(T, F, T))

# Vectors can be atomic or recursive:
is.vector(1:9) # TRUE
is.atomic(1:9) # TRUE
is.recursive(1:9) # FALSE

is.vector(list(nums = 1:9, chars = "x")) # TRUE
is.atomic(list(1:9)) # FALSE
is.recursive(list(1:9)) # TRUE

# Matrices are atomic, data frames are recursive:
is.vector(matrix(1:9, 3)) # FALSE
is.atomic(matrix(1:9, 3)) # TRUE
is.recursive(matrix(1:9, 3)) # FALSE

is.vector(as.data.frame(matrix(1:9, 3))) # FALSE
is.atomic(as.data.frame(matrix(1:9, 3))) # FALSE
is.recursive(as.data.frame(matrix(1:9, 3))) # TRUE

I think you can assume that many, if not most, of the R functions that you use most frequently are vectorized. I don't think there is any way to check this other than by looking at the documentation or the function internals. Whenever you think about writing a for loop to do simple element-wise operations, think about how to do it using vectorization. With enough practice it will become second nature to you. For more details I can recommend this blog post from Noam Ross.

How to vectorize a function in R

First and foremost - case specific optimization - remove the cases where nFast > nSlow as it doesn't make sense technically.

Secondly - you are creating objects and copying them over and over again. This is very expensive.

Thirdly - you can code this better perhaps by creating a matrix of signals in one loop and doing rest of the operations in vectorized manner.

I would code what you are doing something like this.

Please read help pages of mapply, do.call, merge and sapply if you don't understand.

require(quantmod)
getSymbols("LUNA")

#Choose the Adjusted Close of a Symbol
stock <- Ad(LUNA)

# I want to create a table with all possible combinations from the ranges below
i = c(2:50)
k = c(4:50)
j = c(2:50)

# stores possible combinations into z
z <- expand.grid(i,k,j)

IMO : This is where your first optimization should be. Remove cases where i > k

z <- z[z[,1]<z[,2], ]

It reduces the number of cases from 112847 to 57575

#Calculate only once. No need to calculate this in every iteration.
stockret <- ROC(stock)

getStratRet <- function(nFast, nSlow, nSig, stock, stockret) {
    x  <- MACD((stock), nFast=nFast, nSlow=nSlow, nSig=nSig, maType="EMA")
    x <- na.omit(x)
    sig <- Lag(ifelse((x$macd <= x$signal),-1, 0)) + Lag(ifelse((x$macd >= x$signal),1, 0))
    return(na.omit(stockret * sig))
}

RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[,1], nSlow = z[,2], nSig = z[,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE))

getAnnualSharpe <- function(ret) {
    ret <- na.omit(ret)
    return ((mean(ret)/sd(ret)) * sqrt(252))
}

SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)

Results will be as below. Which column belongs to which combo of i, j, k is trivial.

head(RETURNSLIST[, 1:3])
##            LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2
## 2007-01-10   0.012739026    -0.012739026               0
## 2007-01-11  -0.051959739     0.051959739               0
## 2007-01-12  -0.007968170    -0.007968170               0
## 2007-01-16  -0.007905180    -0.007905180               0
## 2007-01-17  -0.005235614    -0.005235614               0
## 2007-01-18   0.028315920    -0.028315920               0

SHARPELIST
##   LUNA.Adjusted LUNA.Adjusted.1 LUNA.Adjusted.2 LUNA.Adjusted.3 LUNA.Adjusted.4 LUNA.Adjusted.5 LUNA.Adjusted.6 
##      0.04939150     -0.07428392             NaN      0.02626382     -0.06789803     -0.22584987     -0.07305477 
## LUNA.Adjusted.7 LUNA.Adjusted.8 LUNA.Adjusted.9 
##     -0.05831643     -0.08864845     -0.08221986 

system.time(
+ RETURNSLIST <- do.call(merge, mapply(FUN = getStratRet, nFast = z[1:100,1], nSlow = z[1:100,2], nSig = z[1:100,3], MoreArgs = list(stock = stock, stockret = stockret), SIMPLIFY = TRUE)),
+ SHARPELIST <- sapply(RETURNSLIST, FUN = getAnnualSharpe)
+ )
   user  system elapsed 
   2.28    0.00    2.29

Vectorized functions in R's data.table

You could use Map/mapply :

library(data.table)
dt[, weeks_for_filter_table := mapply(get_weeks, START, END)]
dt

#   ID      START        END     weeks_for_filter_table
#1:  1 2020-01-01 2020-01-15 2020 W01,2020 W02,2020 W03
#2:  2 2020-03-01 2020-03-12 2020 W09,2020 W10,2020 W11
#3:  3 2020-03-14 2020-03-26 2020 W11,2020 W12,2020 W13

How do I vectorize this is_prime function in R?

Half Vectorized

It is possible to vectorize some of the function by dealing with even numbers (and a few other numbers) in a vectorized fashion. The rest is taken care of using vapply.

helper <- function(x) {
  for (k in seq(3, round(sqrt(x)) + 1, 2)) {
    if (x %% k == 0)
      return(FALSE)
  }
  return(TRUE)
}
is.prime <- function(v) {
  out <- rep(TRUE, length(v))
  out[v %% 2 == 0 | v %in% c(1)] <- FALSE
  out[v %in% c(2, 3, 5)] <- TRUE
  indices <- which(v > 5 && v == FALSE)
  out[indices] <- vapply(v[indices], helper, logical(1))
  return(out)
}
is.prime(c(17,5,10,22,109,55))
# [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

Full Vectorized

If performance is at stake, you might consider using `Rcpp`:

c++ file

#include <Rcpp.h>
#include <math.h>
using namespace Rcpp;

bool is_prime(int n) {
  if ((n == 2) || (n == 3) || (n == 5)) {
    return true;
  }
  if ((n % 2 == 0) || (n == 1)) {
    return false;
  }
  int i = 3;
  while (i < round(sqrt(n)) + 1) {
    if (n % i == 0) {
      return false;
    }
    i += 2;
  }
  return true;
}

// [[Rcpp::export]]
LogicalVector is_prime(IntegerVector v) {
  int n = v.length();
  LogicalVector out = LogicalVector(n);
  for (int i = 0; i < n; i++) {
    out[i] = is_prime(v[i]);
  }
  return out;
}

R File

library(Rcpp)
sourceCpp('prime_fun.cpp')  # if cpp file in same dir
is_prime(c(17,5,10,22,109,55))
# [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE

Vectorized function for dplyr::mutate()

any would always return only one logical value as output. You should collapse your favorite_cars regex as length 1 string.

is_favorite <- function(x) {
  stringr::str_detect(x, paste0(favorite_cars, collapse = "|"))
  #Will also work with base R grepl
  #grepl(paste0(favorite_cars, collapse = "|"), x)
}

and then use :

library(dplyr)
mtcars %>%  mutate(fav_car = is_favorite(car))

#                   car  mpg cyl  disp  hp drat   wt qsec vs am gear carb fav_car
#1            Mazda RX4 21.0   6 160.0 110 3.90 2.62 16.5  0  1    4    4   FALSE
#2        Mazda RX4 Wag 21.0   6 160.0 110 3.90 2.88 17.0  0  1    4    4   FALSE
#3           Datsun 710 22.8   4 108.0  93 3.85 2.32 18.6  1  1    4    1   FALSE
#4       Hornet 4 Drive 21.4   6 258.0 110 3.08 3.21 19.4  1  0    3    1   FALSE
#5    Hornet Sportabout 18.7   8 360.0 175 3.15 3.44 17.0  0  0    3    2   FALSE
#6              Valiant 18.1   6 225.0 105 2.76 3.46 20.2  1  0    3    1   FALSE
#7           Duster 360 14.3   8 360.0 245 3.21 3.57 15.8  0  0    3    4   FALSE
#8            Merc 240D 24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2    TRUE
#9             Merc 230 22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2    TRUE
#10            Merc 280 19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4    TRUE
#11           Merc 280C 17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4    TRUE
#...
#...

where the pattern that we are looking for becomes

paste0(favorite_cars, collapse = "|")
#[1] "^Merc|Firebird$"

How to Define a Vectorized Function in R