Does Ifelse Really Calculate Both of Its Vectors Every Time? Is It Slow

Does ifelse really calculate both of its vectors every time? Is it slow?

Yes. (With exception)

ifelse calculates both its yes value and its no value. Except in the case where the test condition is either all TRUE or all FALSE.

We can see this by generating random numbers and observing how many numbers are actually generated. (by reverting the seed).

# TEST CONDITION, ALL TRUE
set.seed(1)
dump <- ifelse(rep(TRUE, 200), rnorm(200), rnorm(200))
next.random.number.after.all.true <- rnorm(1)

# TEST CONDITION, ALL FALSE
set.seed(1)
dump <- ifelse(rep(FALSE, 200), rnorm(200), rnorm(200))
next.random.number.after.all.false <- rnorm(1)

# TEST CONDITION, MIXED
set.seed(1)
dump <- ifelse(c(FALSE, rep(TRUE, 199)), rnorm(200), rnorm(200))
next.random.number.after.some.TRUE.some.FALSE <- rnorm(1)

# RESET THE SEED, GENERATE SEVERAL RANDOM NUMBERS TO SEARCH FOR A MATCH
set.seed(1)
r.1000 <- rnorm(1000)


cat("Quantity of random numbers generated during the `ifelse` statement when:",
"\n\tAll True ", which(r.1000 == next.random.number.after.all.true) - 1,
"\n\tAll False ", which(r.1000 == next.random.number.after.all.false) - 1,
"\n\tMixed T/F ", which(r.1000 == next.random.number.after.some.TRUE.some.FALSE) - 1
)

Gives the following output:

Quantity of random numbers generated during the `ifelse` statement when: 
All True 200
All False 200
Mixed T/F 400 <~~ Notice TWICE AS MANY numbers were
generated when `test` had both
T & F values present

We can also see it in the source code itself:

.
.
if (any(test[!nas]))
ans[test & !nas] <- rep(yes, length.out = length(ans))[test & # <~~~~ This line and the one below
!nas]
if (any(!test[!nas]))
ans[!test & !nas] <- rep(no, length.out = length(ans))[!test & # <~~~~ ... are the cluprits
!nas]
.
.

Notice that yes and no are computed only if there
is some non-NA value of test that is TRUE or FALSE (respectively).

At which point -- and this is the imporant part when it comes to efficiency -- the entirety of each vector is computed.


Ok, but is it slower?

Lets see if we can test it:

library(microbenchmark)

# Create some sample data
N <- 1e4
set.seed(1)
X <- sample(c(seq(100), rep(NA, 100)), N, TRUE)
Y <- ifelse(is.na(X), rnorm(X), NA) # Y has reverse NA/not-NA setup than X

These two statements generate the same results

yesifelse <- quote(sort(ifelse(is.na(X), Y+17, X-17 ) ))
noiflese <- quote(sort(c(Y[is.na(X)]+17, X[is.na(Y)]-17)))

identical(eval(yesifelse), eval(noiflese))
# [1] TRUE

but one is twice as fast as the other

microbenchmark(eval(yesifelse), eval(noiflese), times=50L)

N = 1,000
Unit: milliseconds
expr min lq median uq max neval
eval(yesifelse) 2.286621 2.348590 2.411776 2.537604 10.05973 50
eval(noiflese) 1.088669 1.093864 1.122075 1.149558 61.23110 50

N = 10,000
Unit: milliseconds
expr min lq median uq max neval
eval(yesifelse) 30.32039 36.19569 38.50461 40.84996 98.77294 50
eval(noiflese) 12.70274 13.58295 14.38579 20.03587 21.68665 50

Is ifelse() in R efficient for determining which function to call on a large vector?

The way you've coded does worse than ifelse, but as suggested in the warning section of ?ifelse it's possible to do better. With your simple functions, x^2 and x / 2, the test3() function below is faster - about 2 to 3 times faster than ifelse and 30 times faster than test2(). With more computationally intensive functions (but still vectorized!) the margin might be bigger.

The speed gain is (I think) mostly due to two sources:

  1. ifelse does input checking and error handling that test3() skips. ifelse is more general and more flexible... test3() is hardcoded to only return a numeric vector).
  2. As demonstrated at Does ifelse really calculate both of its vectors every time? Is it slow?, ifelse will calculate its entire TRUE response vector as long as there is at least 1 TRUE value of the test, and similarly for its FALSE. test3() bypasses the extra calculations by creating TRUE and FALSE sub-vectors.

I've modified your test1() and test2() to simplify a bit, pulling out the data simulation (since that's not what we want to test). I added test3 that uses logical subsets. I also drastically reduced the size of the test vector so it runs reasonably quickly.

set.seed(47)
x <- sample(1:1e6, 1e4, replace = TRUE)

test1 <- function(x) {
ifelse(x %% 2 == 0, x**2, x/2)
}

test2 <- function(x) {
y <- numeric(length(x))
for (i in seq_along(x)) {
if (x[i] %% 2 == 0) {
y[i] <- x[i]**2
} else {
y[i] <- x[i]/2
}
}
return(y)
}

test3 <- function(x) {
y = numeric(length(x))
cond = x %% 2 == 0
y[cond] = x[cond] ^ 2
y[!cond] = x[!cond] / 2
return(y)
}

identical(test1(x), test2(x))
# TRUE
identical(test1(x), test3(x))
# TRUE
microbenchmark::microbenchmark(test1(x), test2(x), test3(x), times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# test1(x) 1563.270 1642.3540 1701.3877 1669.2180 1697.894 3159.743 1000 b
# test2(x) 17909.833 18788.9635 23682.1516 19882.8600 20679.436 116206.536 1000 c
# test3(x) 627.241 668.7445 691.8433 680.6675 696.061 1340.507 1000 a

Is `if` faster than ifelse?

This is more of an extended comment building on Roman's answer, but I need the code utilities to expound:

Roman is correct that if is faster than ifelse, but I am under the impression that the speed boost of if isn't particularly interesting since it isn't something that can easily be harnessed through vectorization. That is to say, if is only advantageous over ifelse when the cond/test argument is of length 1.

Consider the following function which is an admittedly weak attempt at vectorizing if without having the side effect of evaluating both the yes and no conditions as ifelse does.

ifelse2 <- function(test, yes, no){
result <- rep(NA, length(test))
for (i in seq_along(test)){
result[i] <- `if`(test[i], yes[i], no[i])
}
result
}

ifelse2a <- function(test, yes, no){
sapply(seq_along(test),
function(i) `if`(test[i], yes[i], no[i]))
}

ifelse3 <- function(test, yes, no){
result <- rep(NA, length(test))
logic <- test
result[logic] <- yes[logic]
result[!logic] <- no[!logic]
result
}


set.seed(pi)
x <- rnorm(1000)

library(microbenchmark)
microbenchmark(
standard = ifelse(x < 0, x^2, x),
modified = ifelse2(x < 0, x^2, x),
modified_apply = ifelse2a(x < 0, x^2, x),
third = ifelse3(x < 0, x^2, x),
fourth = c(x, x^2)[1L + ( x < 0 )],
fourth_modified = c(x, x^2)[seq_along(x) + length(x) * (x < 0)]
)

Unit: microseconds
expr min lq mean median uq max neval cld
standard 52.198 56.011 97.54633 58.357 68.7675 1707.291 100 ab
modified 91.787 93.254 131.34023 94.133 98.3850 3601.967 100 b
modified_apply 645.146 653.797 718.20309 661.568 676.0840 3703.138 100 c
third 20.528 22.873 76.29753 25.513 27.4190 3294.350 100 ab
fourth 15.249 16.129 19.10237 16.715 20.9675 43.695 100 a
fourth_modified 19.061 19.941 22.66834 20.528 22.4335 40.468 100 a

SOME EDITS: Thanks to Frank and Richard Scriven for noticing my shortcomings.

As you can see, the process of breaking up the vector to be suitable to pass to if is a time consuming process and ends up being slower than just running ifelse (which is probably why no one has bothered to implement my solution).

If you're really desperate for an increase in speed, you can use the ifelse3 approach above. Or better yet, Frank's less obvious* but brilliant solution.

  • by 'less obvious' I mean, it took me two seconds to realize what he did. And per nicola's comment below, please note that this works only when yes and no have length 1, otherwise you'll want to stick with ifelse3

Speeding up ifelse() without writing C/C++?

I have encountered this before. We don't have to use ifelse() all the time. If you have a look at how ifelse is written, by typing "ifelse" in your R console, you can see that this function is written in R language, and it does various checking which is really inefficient.

Instead of using ifelse(), we can do this:

getScore <- function(history, similarities) {
######## old code #######
# nh <- ifelse(similarities < 0, 6 - history, history)
######## old code #######
######## new code #######
nh <- history
ind <- similarities < 0
nh[ind] <- 6 - nh[ind]
######## new code #######
x <- nh * abs(similarities)
contados <- !is.na(history)
sum(x, na.rm=TRUE) / sum(abs(similarities[contados]), na.rm = TRUE)
}

And then let's check profiling result again:

Rprof("foo.out")
for (i in (1:10)) getScore(history, similarities)
Rprof(NULL)
summaryRprof("foo.out")

# $by.total
# total.time total.pct self.time self.pct
# "getScore" 2.10 100.00 0.88 41.90
# "abs" 0.32 15.24 0.32 15.24
# "*" 0.26 12.38 0.26 12.38
# "sum" 0.26 12.38 0.26 12.38
# "<" 0.14 6.67 0.14 6.67
# "-" 0.14 6.67 0.14 6.67
# "!" 0.06 2.86 0.06 2.86
# "is.na" 0.04 1.90 0.04 1.90

# $sample.interval
# [1] 0.02

# $sampling.time
# [1] 2.1

We have a 2+ times boost in performance. Furthermore, the profile is more like a flat profile, without any single part dominating execution time.

In R, vector indexing / reading / writing is at speed of C code, so whenever we can, use a vector.


Testing @Matthew's answer

mat_getScore <- function(history, similarities) {
######## old code #######
# nh <- ifelse(similarities < 0, 6 - history, history)
######## old code #######
######## new code #######
ind <- similarities < 0
nh <- ind*(6-history) + (!ind)*history
######## new code #######
x <- nh * abs(similarities)
contados <- !is.na(history)
sum(x, na.rm=TRUE) / sum(abs(similarities[contados]), na.rm = TRUE)
}

Rprof("foo.out")
for (i in (1:10)) mat_getScore(history, similarities)
Rprof(NULL)
summaryRprof("foo.out")

# $by.total
# total.time total.pct self.time self.pct
# "mat_getScore" 2.60 100.00 0.24 9.23
# "*" 0.76 29.23 0.76 29.23
# "!" 0.40 15.38 0.40 15.38
# "-" 0.34 13.08 0.34 13.08
# "+" 0.26 10.00 0.26 10.00
# "abs" 0.20 7.69 0.20 7.69
# "sum" 0.18 6.92 0.18 6.92
# "<" 0.16 6.15 0.16 6.15
# "is.na" 0.06 2.31 0.06 2.31

# $sample.interval
# [1] 0.02

# $sampling.time
# [1] 2.6

Ah? Slower?

The full profiling result shows that this approach spends more time on floating point multiplication "*", and the logical not "!" seems pretty expensive. While my approach requires floating point addition / subtraction only.

Well, The result might be also architecture dependent. I am testing on Intel Nahalem (Intel Core 2 Duo) at the moment. So benchmarking between two approaches on various platforms are welcomed.


Remark

All profiling are using OP's data in the question.

how ifelse (in data.table) works

This is only by proxy related to data.table; at core is that ifelse is designed for use like:

ifelse(test, yes, no)

where test, yes, and no all have the same length -- the output will be the same length as test, and all the elements corresponding to where test is TRUE will be the corresponding element from yes, and similarly for where test is FALSE.

When test is a scalar and yes or no are vectors, as in your case, you have to look at what ifelse is doing to understand what's going on:

Relevant source:

if (any(test[ok])) #is any element of `test` `TRUE`?
ans[test & ok] <- rep(yes, length.out = length(ans))[test &
ok]

What is rep(c(1, 2), length.out = 1)? It's just 1 -- the second element is truncated.

That's what's happened here -- the value of ifelse is only the first element of paste0(1:.N, "_", col2). When passed to `:=`, this single element is recycled.

When your logical condition is a scalar, you should use if, not ifelse. I'll also add that I do my damndest to avoid using ifelse in general because it's slow.

ifelse over each element of a vector

ifelse isn't a function of one vector, it is a function of 3 vectors of the same length. The first vector, called test, is a boolean, the second vector yes and third vector no give the elements in the result, chosen item-by-item based on the test value.

A sample of size = 1 is a different size than test (unless the length of test is 1), so it will be recycled by ifelse (see note below). Instead, draw samples of the same size as test from the start:

ifelse(
test = (y == 1),
yes = sample(x = c(1, 3, 5, 7, 9), size = length(y), replace = TRUE),
no = sample(x = c(0, 2, 4, 6, 8), size = lenght(y), replace = TRUE)
)

The vectors don't actually have to be of the same length. The help page ?ifelse explains: "If yes or no are too short, their elements are recycled." This is the behavior you observed with "It generates once for each case, and returns that very random number always.".

Group-filling maximum is slow with missing values

Using if(){} we can bypass the max calculation if the entire vector is NA. This is a massive speed-up:

fmax = function(x, na.rm = TRUE) {
if(all(is.na(x))) return(x[1])
return(max(x, na.rm = na.rm))
}

system.time(df %>%
group_by(group) %>%
mutate(maxval = fmax(val)))
# user system elapsed
# 0.20 0.01 0.22

Does dplyr::if_else evaluate both TRUE and FALSE at the same time?

The issue is because we are checking cases where there are groups that return NULL withwhich(value)`

min(NULL)
#[1] Inf

Warning message: In min(NULL) : no non-missing arguments to min;
returning Inf


An option is to subject the which output by indexing with [1] to return NA

mydf %>%
group_by(group) %>%
mutate(max_value = if_else(all(!value), max(index), index[which(value)[1]]))
# A tibble: 15 x 4
# Groups: group [3]
# value group index max_value
# <lgl> <fct> <int> <int>
# 1 FALSE a 1 2
# 2 TRUE a 2 2
# 3 FALSE a 3 2
# 4 FALSE a 4 2
# 5 TRUE a 5 2
# 6 FALSE b 1 4
# 7 FALSE b 2 4
# 8 FALSE b 3 4
# 9 TRUE b 4 4
#10 TRUE b 5 4
#11 FALSE c 1 5
#12 FALSE c 2 5
#13 FALSE c 3 5
#14 FALSE c 4 5
#15 FALSE c 5 5

Also, in this case, as we are returning a single element, if/else would be more appropriate

mydf %>%
group_by(group) %>%
mutate(max_value = if(all(!value)) max(index) else index[which(value)[1]])
# A tibble: 15 x 4
# Groups: group [3]
# value group index max_value
# <lgl> <fct> <int> <int>
# 1 FALSE a 1 2
# 2 TRUE a 2 2
# 3 FALSE a 3 2
# 4 FALSE a 4 2
# 5 TRUE a 5 2
# 6 FALSE b 1 4
# 7 FALSE b 2 4
# 8 FALSE b 3 4
# 9 TRUE b 4 4
#10 TRUE b 5 4
#11 FALSE c 1 5
#12 FALSE c 2 5
#13 FALSE c 3 5
#14 FALSE c 4 5
#15 FALSE c 5 5

Optimizing ifelse on a large data frame

There has been some discussion about how ifelse is not the best option for code where speed is an important factor. You might instead try:

df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]

To see what's going on here, let's break down the command. df$A > 0.05 & df$B > 0.05 returns TRUE if both A and B exceed 0.05, and FALSE otherwise. Therefore, (df$A > 0.05 & df$B > 0.05)+1 returns 2 if both A and B exceed 0.05 and 1 otherwise. These are used as indicates into the vector c("", "Equal"), so we get "Equal" when both exceed 0.05 and "" otherwise.

Here's a comparison on a data frame with 1 million rows:

# Build dataset and functions
set.seed(144)
big.df <- data.frame(A = runif(1000000), B = runif(1000000))
OP <- function(df) {
df$Mean.Result1 <- ifelse(df$A > 0.05 & df$B > 0.05, "Equal", "")
df
}
josilber <- function(df) {
df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]
df
}
all.equal(OP(big.df), josilber(big.df))
# [1] TRUE

# Benchmark
library(microbenchmark)
microbenchmark(OP(big.df), josilber(big.df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# OP(big.df) 299.6265 311.56167 352.26841 318.51825 348.09461 540.0971 100
# josilber(big.df) 40.4256 48.66967 60.72864 53.18471 59.72079 267.3886 100

The approach with vector indexing is about 6x faster in median runtime.



Related Topics



Leave a reply



Submit