Does Ifelse Really Calculate Both of Its Vectors Every Time? Is It Slow

Does ifelse really calculate both of its vectors every time? Is it slow?

Yes. (With exception)

ifelse calculates both its yes value and its no value. Except in the case where the test condition is either all TRUE or all FALSE.

We can see this by generating random numbers and observing how many numbers are actually generated. (by reverting the seed).

# TEST CONDITION, ALL TRUE
set.seed(1)
dump  <- ifelse(rep(TRUE, 200), rnorm(200), rnorm(200))
next.random.number.after.all.true <- rnorm(1)

# TEST CONDITION, ALL FALSE
set.seed(1)
dump  <- ifelse(rep(FALSE, 200), rnorm(200), rnorm(200))
next.random.number.after.all.false <- rnorm(1)

# TEST CONDITION, MIXED
set.seed(1)
dump   <- ifelse(c(FALSE, rep(TRUE, 199)), rnorm(200), rnorm(200))
next.random.number.after.some.TRUE.some.FALSE <- rnorm(1)

# RESET THE SEED, GENERATE SEVERAL RANDOM NUMBERS TO SEARCH FOR A MATCH
set.seed(1)
r.1000 <- rnorm(1000)


cat("Quantity of random numbers generated during the `ifelse` statement when:", 
    "\n\tAll True  ", which(r.1000 == next.random.number.after.all.true) - 1,
    "\n\tAll False ", which(r.1000 == next.random.number.after.all.false) - 1,
    "\n\tMixed T/F ", which(r.1000 == next.random.number.after.some.TRUE.some.FALSE) - 1 
  )

Gives the following output:

Quantity of random numbers generated during the `ifelse` statement when: 
  All True   200 
  All False  200 
  Mixed T/F  400   <~~ Notice TWICE AS MANY numbers were
                       generated when `test` had both
                       T & F values present

We can also see it in the source code itself:

.
.
if (any(test[!nas]))    
    ans[test & !nas] <- rep(yes, length.out = length(ans))[test &   # <~~~~ This line and the one below
        !nas]
if (any(!test[!nas])) 
    ans[!test & !nas] <- rep(no, length.out = length(ans))[!test &  # <~~~~ ... are the cluprits
        !nas]
.
.

Notice that yes and no are computed only if there
is some non-NA value of test that is TRUE or FALSE (respectively).

At which point -- and this is the imporant part when it comes to efficiency -- the entirety of each vector is computed.

Ok, but is it slower?

Lets see if we can test it:

library(microbenchmark)

# Create some sample data
  N <- 1e4
  set.seed(1)
  X <- sample(c(seq(100), rep(NA, 100)), N, TRUE)
  Y <- ifelse(is.na(X), rnorm(X), NA)  # Y has reverse NA/not-NA setup than X

These two statements generate the same results

yesifelse <- quote(sort(ifelse(is.na(X), Y+17, X-17 ) ))
noiflese  <- quote(sort(c(Y[is.na(X)]+17, X[is.na(Y)]-17)))

identical(eval(yesifelse), eval(noiflese))
# [1] TRUE

but one is twice as fast as the other

microbenchmark(eval(yesifelse), eval(noiflese), times=50L)

N = 1,000
Unit: milliseconds
            expr      min       lq   median       uq      max neval
 eval(yesifelse) 2.286621 2.348590 2.411776 2.537604 10.05973    50
  eval(noiflese) 1.088669 1.093864 1.122075 1.149558 61.23110    50

N = 10,000
Unit: milliseconds
            expr      min       lq   median       uq      max neval
 eval(yesifelse) 30.32039 36.19569 38.50461 40.84996 98.77294    50
  eval(noiflese) 12.70274 13.58295 14.38579 20.03587 21.68665    50

Is ifelse() in R efficient for determining which function to call on a large vector?

The way you've coded does worse than ifelse, but as suggested in the warning section of ?ifelse it's possible to do better. With your simple functions, x^2 and x / 2, the test3() function below is faster - about 2 to 3 times faster than ifelse and 30 times faster than test2(). With more computationally intensive functions (but still vectorized!) the margin might be bigger.

The speed gain is (I think) mostly due to two sources:

ifelse does input checking and error handling that test3() skips. ifelse is more general and more flexible... test3() is hardcoded to only return a numeric vector).
As demonstrated at Does ifelse really calculate both of its vectors every time? Is it slow?, ifelse will calculate its entire TRUE response vector as long as there is at least 1 TRUE value of the test, and similarly for its FALSE. test3() bypasses the extra calculations by creating TRUE and FALSE sub-vectors.

I've modified your test1() and test2() to simplify a bit, pulling out the data simulation (since that's not what we want to test). I added test3 that uses logical subsets. I also drastically reduced the size of the test vector so it runs reasonably quickly.

set.seed(47)
x <- sample(1:1e6, 1e4, replace = TRUE)

test1 <- function(x) {
  ifelse(x %% 2 == 0, x**2, x/2)
}

test2 <- function(x) {
  y <- numeric(length(x))
  for (i in seq_along(x)) {
    if (x[i] %% 2 == 0) {
      y[i] <- x[i]**2
    } else {
      y[i] <- x[i]/2
    }
  }
  return(y)
}

test3 <- function(x) {
    y = numeric(length(x))
    cond = x %% 2 == 0
    y[cond] = x[cond] ^ 2
    y[!cond] = x[!cond] / 2
    return(y)
}

identical(test1(x), test2(x))
# TRUE
identical(test1(x), test3(x))
# TRUE
microbenchmark::microbenchmark(test1(x), test2(x), test3(x), times = 1000)
# Unit: microseconds
#      expr       min         lq       mean     median        uq        max neval cld
#  test1(x)  1563.270  1642.3540  1701.3877  1669.2180  1697.894   3159.743  1000  b 
#  test2(x) 17909.833 18788.9635 23682.1516 19882.8600 20679.436 116206.536  1000   c
#  test3(x)   627.241   668.7445   691.8433   680.6675   696.061   1340.507  1000 a

Is `if` faster than ifelse?

This is more of an extended comment building on Roman's answer, but I need the code utilities to expound:

Roman is correct that if is faster than ifelse, but I am under the impression that the speed boost of if isn't particularly interesting since it isn't something that can easily be harnessed through vectorization. That is to say, if is only advantageous over ifelse when the cond/test argument is of length 1.

Consider the following function which is an admittedly weak attempt at vectorizing if without having the side effect of evaluating both the yes and no conditions as ifelse does.

ifelse2 <- function(test, yes, no){
 result <- rep(NA, length(test))
 for (i in seq_along(test)){
   result[i] <- `if`(test[i], yes[i], no[i])
 }
 result
}

ifelse2a <- function(test, yes, no){
  sapply(seq_along(test),
         function(i) `if`(test[i], yes[i], no[i]))
}

ifelse3 <- function(test, yes, no){
  result <- rep(NA, length(test))
  logic <- test
  result[logic] <- yes[logic]
  result[!logic] <- no[!logic]
  result
}


set.seed(pi)
x <- rnorm(1000)

library(microbenchmark)
microbenchmark(
  standard = ifelse(x < 0, x^2, x),
  modified = ifelse2(x < 0, x^2, x),
  modified_apply = ifelse2a(x < 0, x^2, x),
  third = ifelse3(x < 0, x^2, x),
  fourth = c(x, x^2)[1L + ( x < 0 )],
  fourth_modified = c(x, x^2)[seq_along(x) + length(x) * (x < 0)]
)

Unit: microseconds
            expr     min      lq      mean  median       uq      max neval cld
        standard  52.198  56.011  97.54633  58.357  68.7675 1707.291   100 ab 
        modified  91.787  93.254 131.34023  94.133  98.3850 3601.967   100  b 
  modified_apply 645.146 653.797 718.20309 661.568 676.0840 3703.138   100   c
           third  20.528  22.873  76.29753  25.513  27.4190 3294.350   100 ab 
          fourth  15.249  16.129  19.10237  16.715  20.9675   43.695   100 a  
 fourth_modified  19.061  19.941  22.66834  20.528  22.4335   40.468   100 a

SOME EDITS: Thanks to Frank and Richard Scriven for noticing my shortcomings.

As you can see, the process of breaking up the vector to be suitable to pass to if is a time consuming process and ends up being slower than just running ifelse (which is probably why no one has bothered to implement my solution).

If you're really desperate for an increase in speed, you can use the ifelse3 approach above. Or better yet, Frank's less obvious* but brilliant solution.

by 'less obvious' I mean, it took me two seconds to realize what he did. And per nicola's comment below, please note that this works only when yes and no have length 1, otherwise you'll want to stick with ifelse3

Speeding up ifelse() without writing C/C++?

I have encountered this before. We don't have to use ifelse() all the time. If you have a look at how ifelse is written, by typing "ifelse" in your R console, you can see that this function is written in R language, and it does various checking which is really inefficient.

Instead of using ifelse(), we can do this:

getScore <- function(history, similarities) {
  ######## old code #######
  # nh <- ifelse(similarities < 0, 6 - history, history)
  ######## old code #######
  ######## new code #######
  nh <- history
  ind <- similarities < 0
  nh[ind] <- 6 - nh[ind]
  ######## new code #######
  x <- nh * abs(similarities) 
  contados <- !is.na(history)
  sum(x, na.rm=TRUE) / sum(abs(similarities[contados]), na.rm = TRUE)
  }

And then let's check profiling result again:

Rprof("foo.out")
for (i in (1:10)) getScore(history, similarities)
Rprof(NULL)
summaryRprof("foo.out")

# $by.total
#            total.time total.pct self.time self.pct
# "getScore"       2.10    100.00      0.88    41.90
# "abs"            0.32     15.24      0.32    15.24
# "*"              0.26     12.38      0.26    12.38
# "sum"            0.26     12.38      0.26    12.38
# "<"              0.14      6.67      0.14     6.67
# "-"              0.14      6.67      0.14     6.67
# "!"              0.06      2.86      0.06     2.86
# "is.na"          0.04      1.90      0.04     1.90

# $sample.interval
# [1] 0.02

# $sampling.time
# [1] 2.1

We have a 2+ times boost in performance. Furthermore, the profile is more like a flat profile, without any single part dominating execution time.

In R, vector indexing / reading / writing is at speed of C code, so whenever we can, use a vector.

Testing @Matthew's answer

mat_getScore <- function(history, similarities) {
  ######## old code #######
  # nh <- ifelse(similarities < 0, 6 - history, history)
  ######## old code #######
  ######## new code #######
  ind <- similarities < 0
  nh <- ind*(6-history) + (!ind)*history
  ######## new code #######
  x <- nh * abs(similarities) 
  contados <- !is.na(history)
  sum(x, na.rm=TRUE) / sum(abs(similarities[contados]), na.rm = TRUE)
  }

Rprof("foo.out")
for (i in (1:10)) mat_getScore(history, similarities)
Rprof(NULL)
summaryRprof("foo.out")

# $by.total
#                total.time total.pct self.time self.pct
# "mat_getScore"       2.60    100.00      0.24     9.23
# "*"                  0.76     29.23      0.76    29.23
# "!"                  0.40     15.38      0.40    15.38
# "-"                  0.34     13.08      0.34    13.08
# "+"                  0.26     10.00      0.26    10.00
# "abs"                0.20      7.69      0.20     7.69
# "sum"                0.18      6.92      0.18     6.92
# "<"                  0.16      6.15      0.16     6.15
# "is.na"              0.06      2.31      0.06     2.31

# $sample.interval
# [1] 0.02

# $sampling.time
# [1] 2.6

Ah? Slower?

The full profiling result shows that this approach spends more time on floating point multiplication "*", and the logical not "!" seems pretty expensive. While my approach requires floating point addition / subtraction only.

Well, The result might be also architecture dependent. I am testing on Intel Nahalem (Intel Core 2 Duo) at the moment. So benchmarking between two approaches on various platforms are welcomed.

Remark

All profiling are using OP's data in the question.

how ifelse (in data.table) works

This is only by proxy related to data.table; at core is that ifelse is designed for use like:

ifelse(test, yes, no)

where test, yes, and no all have the same length -- the output will be the same length as test, and all the elements corresponding to where test is TRUE will be the corresponding element from yes, and similarly for where test is FALSE.

When test is a scalar and yes or no are vectors, as in your case, you have to look at what ifelse is doing to understand what's going on:

Relevant source:

if (any(test[ok])) #is any element of `test` `TRUE`?
        ans[test & ok] <- rep(yes, length.out = length(ans))[test & 
            ok]

What is rep(c(1, 2), length.out = 1)? It's just 1 -- the second element is truncated.

That's what's happened here -- the value of ifelse is only the first element of paste0(1:.N, "_", col2). When passed to `:=`, this single element is recycled.

When your logical condition is a scalar, you should use if, not ifelse. I'll also add that I do my damndest to avoid using ifelse in general because it's slow.

ifelse over each element of a vector

ifelse isn't a function of one vector, it is a function of 3 vectors of the same length. The first vector, called test, is a boolean, the second vector yes and third vector no give the elements in the result, chosen item-by-item based on the test value.

A sample of size = 1 is a different size than test (unless the length of test is 1), so it will be recycled by ifelse (see note below). Instead, draw samples of the same size as test from the start:

ifelse(
   test = (y == 1),
   yes = sample(x = c(1, 3, 5, 7, 9), size = length(y), replace = TRUE),
   no = sample(x = c(0, 2, 4, 6, 8), size = lenght(y), replace = TRUE)
)

The vectors don't actually have to be of the same length. The help page ?ifelse explains: "If yes or no are too short, their elements are recycled." This is the behavior you observed with "It generates once for each case, and returns that very random number always.".

Group-filling maximum is slow with missing values

Using if(){} we can bypass the max calculation if the entire vector is NA. This is a massive speed-up:

fmax = function(x, na.rm = TRUE) {
  if(all(is.na(x))) return(x[1])
  return(max(x, na.rm = na.rm))
}

system.time(df %>%
  group_by(group) %>%
  mutate(maxval = fmax(val)))
# user  system elapsed 
# 0.20    0.01    0.22

Does dplyr::if_else evaluate both TRUE and FALSE at the same time?

The issue is because we are checking cases where there are groups that return NULL withwhich(value)`

min(NULL)
#[1] Inf

Warning message: In min(NULL) : no non-missing arguments to min;
returning Inf

An option is to subject the which output by indexing with [1] to return NA

mydf %>%
   group_by(group) %>%
   mutate(max_value = if_else(all(!value), max(index), index[which(value)[1]]))
# A tibble: 15 x 4
# Groups:   group [3]
#   value group index max_value
#   <lgl> <fct> <int>     <int>
# 1 FALSE a         1         2
# 2 TRUE  a         2         2
# 3 FALSE a         3         2
# 4 FALSE a         4         2
# 5 TRUE  a         5         2
# 6 FALSE b         1         4
# 7 FALSE b         2         4
# 8 FALSE b         3         4
# 9 TRUE  b         4         4
#10 TRUE  b         5         4
#11 FALSE c         1         5
#12 FALSE c         2         5
#13 FALSE c         3         5
#14 FALSE c         4         5
#15 FALSE c         5         5

Also, in this case, as we are returning a single element, if/else would be more appropriate

mydf %>%
    group_by(group) %>%
    mutate(max_value = if(all(!value)) max(index) else index[which(value)[1]])
# A tibble: 15 x 4
# Groups:   group [3]
#   value group index max_value
#   <lgl> <fct> <int>     <int>
# 1 FALSE a         1         2
# 2 TRUE  a         2         2
# 3 FALSE a         3         2
# 4 FALSE a         4         2
# 5 TRUE  a         5         2
# 6 FALSE b         1         4
# 7 FALSE b         2         4
# 8 FALSE b         3         4
# 9 TRUE  b         4         4
#10 TRUE  b         5         4
#11 FALSE c         1         5
#12 FALSE c         2         5
#13 FALSE c         3         5
#14 FALSE c         4         5
#15 FALSE c         5         5

Optimizing ifelse on a large data frame

There has been some discussion about how ifelse is not the best option for code where speed is an important factor. You might instead try:

df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]

To see what's going on here, let's break down the command. df$A > 0.05 & df$B > 0.05 returns TRUE if both A and B exceed 0.05, and FALSE otherwise. Therefore, (df$A > 0.05 & df$B > 0.05)+1 returns 2 if both A and B exceed 0.05 and 1 otherwise. These are used as indicates into the vector c("", "Equal"), so we get "Equal" when both exceed 0.05 and "" otherwise.

Here's a comparison on a data frame with 1 million rows:

# Build dataset and functions
set.seed(144)
big.df <- data.frame(A = runif(1000000), B = runif(1000000))
OP <- function(df) {
  df$Mean.Result1 <- ifelse(df$A > 0.05 & df$B > 0.05, "Equal", "")
  df
}
josilber <- function(df) {
  df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]
  df
}
all.equal(OP(big.df), josilber(big.df))
# [1] TRUE

# Benchmark
library(microbenchmark)
microbenchmark(OP(big.df), josilber(big.df))
# Unit: milliseconds
#              expr      min        lq      mean    median        uq      max neval
#        OP(big.df) 299.6265 311.56167 352.26841 318.51825 348.09461 540.0971   100
#  josilber(big.df)  40.4256  48.66967  60.72864  53.18471  59.72079 267.3886   100

The approach with vector indexing is about 6x faster in median runtime.

Does Ifelse Really Calculate Both of Its Vectors Every Time? Is It Slow