Does ifelse really calculate both of its vectors every time? Is it slow?
Yes. (With exception)
ifelse
calculates both its yes
value and its no
value. Except in the case where the test
condition is either all TRUE
or all FALSE
.
We can see this by generating random numbers and observing how many numbers are actually generated. (by reverting the seed
).
# TEST CONDITION, ALL TRUE
set.seed(1)
dump <- ifelse(rep(TRUE, 200), rnorm(200), rnorm(200))
next.random.number.after.all.true <- rnorm(1)
# TEST CONDITION, ALL FALSE
set.seed(1)
dump <- ifelse(rep(FALSE, 200), rnorm(200), rnorm(200))
next.random.number.after.all.false <- rnorm(1)
# TEST CONDITION, MIXED
set.seed(1)
dump <- ifelse(c(FALSE, rep(TRUE, 199)), rnorm(200), rnorm(200))
next.random.number.after.some.TRUE.some.FALSE <- rnorm(1)
# RESET THE SEED, GENERATE SEVERAL RANDOM NUMBERS TO SEARCH FOR A MATCH
set.seed(1)
r.1000 <- rnorm(1000)
cat("Quantity of random numbers generated during the `ifelse` statement when:",
"\n\tAll True ", which(r.1000 == next.random.number.after.all.true) - 1,
"\n\tAll False ", which(r.1000 == next.random.number.after.all.false) - 1,
"\n\tMixed T/F ", which(r.1000 == next.random.number.after.some.TRUE.some.FALSE) - 1
)
Gives the following output:
Quantity of random numbers generated during the `ifelse` statement when:
All True 200
All False 200
Mixed T/F 400 <~~ Notice TWICE AS MANY numbers were
generated when `test` had both
T & F values present
We can also see it in the source code itself:
.
.
if (any(test[!nas]))
ans[test & !nas] <- rep(yes, length.out = length(ans))[test & # <~~~~ This line and the one below
!nas]
if (any(!test[!nas]))
ans[!test & !nas] <- rep(no, length.out = length(ans))[!test & # <~~~~ ... are the cluprits
!nas]
.
.
Notice that yes
and no
are computed only if there
is some non-NA
value of test
that is TRUE
or FALSE
(respectively).
At which point -- and this is the imporant part when it comes to efficiency -- the entirety of each vector is computed.
Ok, but is it slower?
Lets see if we can test it:
library(microbenchmark)
# Create some sample data
N <- 1e4
set.seed(1)
X <- sample(c(seq(100), rep(NA, 100)), N, TRUE)
Y <- ifelse(is.na(X), rnorm(X), NA) # Y has reverse NA/not-NA setup than X
These two statements generate the same results
yesifelse <- quote(sort(ifelse(is.na(X), Y+17, X-17 ) ))
noiflese <- quote(sort(c(Y[is.na(X)]+17, X[is.na(Y)]-17)))
identical(eval(yesifelse), eval(noiflese))
# [1] TRUE
but one is twice as fast as the other
microbenchmark(eval(yesifelse), eval(noiflese), times=50L)
N = 1,000
Unit: milliseconds
expr min lq median uq max neval
eval(yesifelse) 2.286621 2.348590 2.411776 2.537604 10.05973 50
eval(noiflese) 1.088669 1.093864 1.122075 1.149558 61.23110 50
N = 10,000
Unit: milliseconds
expr min lq median uq max neval
eval(yesifelse) 30.32039 36.19569 38.50461 40.84996 98.77294 50
eval(noiflese) 12.70274 13.58295 14.38579 20.03587 21.68665 50
Is ifelse() in R efficient for determining which function to call on a large vector?
The way you've coded does worse than ifelse
, but as suggested in the warning section of ?ifelse
it's possible to do better. With your simple functions, x^2
and x / 2
, the test3()
function below is faster - about 2 to 3 times faster than ifelse
and 30 times faster than test2()
. With more computationally intensive functions (but still vectorized!) the margin might be bigger.
The speed gain is (I think) mostly due to two sources:
ifelse
does input checking and error handling thattest3()
skips.ifelse
is more general and more flexible...test3()
is hardcoded to only return anumeric
vector).- As demonstrated at Does ifelse really calculate both of its vectors every time? Is it slow?,
ifelse
will calculate its entireTRUE
response vector as long as there is at least 1TRUE
value of the test, and similarly for itsFALSE
.test3()
bypasses the extra calculations by creatingTRUE
andFALSE
sub-vectors.
I've modified your test1()
and test2()
to simplify a bit, pulling out the data simulation (since that's not what we want to test). I added test3
that uses logical subsets. I also drastically reduced the size of the test vector so it runs reasonably quickly.
set.seed(47)
x <- sample(1:1e6, 1e4, replace = TRUE)
test1 <- function(x) {
ifelse(x %% 2 == 0, x**2, x/2)
}
test2 <- function(x) {
y <- numeric(length(x))
for (i in seq_along(x)) {
if (x[i] %% 2 == 0) {
y[i] <- x[i]**2
} else {
y[i] <- x[i]/2
}
}
return(y)
}
test3 <- function(x) {
y = numeric(length(x))
cond = x %% 2 == 0
y[cond] = x[cond] ^ 2
y[!cond] = x[!cond] / 2
return(y)
}
identical(test1(x), test2(x))
# TRUE
identical(test1(x), test3(x))
# TRUE
microbenchmark::microbenchmark(test1(x), test2(x), test3(x), times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# test1(x) 1563.270 1642.3540 1701.3877 1669.2180 1697.894 3159.743 1000 b
# test2(x) 17909.833 18788.9635 23682.1516 19882.8600 20679.436 116206.536 1000 c
# test3(x) 627.241 668.7445 691.8433 680.6675 696.061 1340.507 1000 a
Is `if` faster than ifelse?
This is more of an extended comment building on Roman's answer, but I need the code utilities to expound:
Roman is correct that if
is faster than ifelse
, but I am under the impression that the speed boost of if
isn't particularly interesting since it isn't something that can easily be harnessed through vectorization. That is to say, if
is only advantageous over ifelse
when the cond
/test
argument is of length 1.
Consider the following function which is an admittedly weak attempt at vectorizing if
without having the side effect of evaluating both the yes
and no
conditions as ifelse
does.
ifelse2 <- function(test, yes, no){
result <- rep(NA, length(test))
for (i in seq_along(test)){
result[i] <- `if`(test[i], yes[i], no[i])
}
result
}
ifelse2a <- function(test, yes, no){
sapply(seq_along(test),
function(i) `if`(test[i], yes[i], no[i]))
}
ifelse3 <- function(test, yes, no){
result <- rep(NA, length(test))
logic <- test
result[logic] <- yes[logic]
result[!logic] <- no[!logic]
result
}
set.seed(pi)
x <- rnorm(1000)
library(microbenchmark)
microbenchmark(
standard = ifelse(x < 0, x^2, x),
modified = ifelse2(x < 0, x^2, x),
modified_apply = ifelse2a(x < 0, x^2, x),
third = ifelse3(x < 0, x^2, x),
fourth = c(x, x^2)[1L + ( x < 0 )],
fourth_modified = c(x, x^2)[seq_along(x) + length(x) * (x < 0)]
)
Unit: microseconds
expr min lq mean median uq max neval cld
standard 52.198 56.011 97.54633 58.357 68.7675 1707.291 100 ab
modified 91.787 93.254 131.34023 94.133 98.3850 3601.967 100 b
modified_apply 645.146 653.797 718.20309 661.568 676.0840 3703.138 100 c
third 20.528 22.873 76.29753 25.513 27.4190 3294.350 100 ab
fourth 15.249 16.129 19.10237 16.715 20.9675 43.695 100 a
fourth_modified 19.061 19.941 22.66834 20.528 22.4335 40.468 100 a
SOME EDITS: Thanks to Frank and Richard Scriven for noticing my shortcomings.
As you can see, the process of breaking up the vector to be suitable to pass to if
is a time consuming process and ends up being slower than just running ifelse
(which is probably why no one has bothered to implement my solution).
If you're really desperate for an increase in speed, you can use the ifelse3
approach above. Or better yet, Frank's less obvious* but brilliant solution.
- by 'less obvious' I mean, it took me two seconds to realize what he did. And per nicola's comment below, please note that this works only when
yes
andno
have length 1, otherwise you'll want to stick withifelse3
Speeding up ifelse() without writing C/C++?
I have encountered this before. We don't have to use ifelse()
all the time. If you have a look at how ifelse
is written, by typing "ifelse" in your R console, you can see that this function is written in R language, and it does various checking which is really inefficient.
Instead of using ifelse()
, we can do this:
getScore <- function(history, similarities) {
######## old code #######
# nh <- ifelse(similarities < 0, 6 - history, history)
######## old code #######
######## new code #######
nh <- history
ind <- similarities < 0
nh[ind] <- 6 - nh[ind]
######## new code #######
x <- nh * abs(similarities)
contados <- !is.na(history)
sum(x, na.rm=TRUE) / sum(abs(similarities[contados]), na.rm = TRUE)
}
And then let's check profiling result again:
Rprof("foo.out")
for (i in (1:10)) getScore(history, similarities)
Rprof(NULL)
summaryRprof("foo.out")
# $by.total
# total.time total.pct self.time self.pct
# "getScore" 2.10 100.00 0.88 41.90
# "abs" 0.32 15.24 0.32 15.24
# "*" 0.26 12.38 0.26 12.38
# "sum" 0.26 12.38 0.26 12.38
# "<" 0.14 6.67 0.14 6.67
# "-" 0.14 6.67 0.14 6.67
# "!" 0.06 2.86 0.06 2.86
# "is.na" 0.04 1.90 0.04 1.90
# $sample.interval
# [1] 0.02
# $sampling.time
# [1] 2.1
We have a 2+ times boost in performance. Furthermore, the profile is more like a flat profile, without any single part dominating execution time.
In R, vector indexing / reading / writing is at speed of C code, so whenever we can, use a vector.
Testing @Matthew's answer
mat_getScore <- function(history, similarities) {
######## old code #######
# nh <- ifelse(similarities < 0, 6 - history, history)
######## old code #######
######## new code #######
ind <- similarities < 0
nh <- ind*(6-history) + (!ind)*history
######## new code #######
x <- nh * abs(similarities)
contados <- !is.na(history)
sum(x, na.rm=TRUE) / sum(abs(similarities[contados]), na.rm = TRUE)
}
Rprof("foo.out")
for (i in (1:10)) mat_getScore(history, similarities)
Rprof(NULL)
summaryRprof("foo.out")
# $by.total
# total.time total.pct self.time self.pct
# "mat_getScore" 2.60 100.00 0.24 9.23
# "*" 0.76 29.23 0.76 29.23
# "!" 0.40 15.38 0.40 15.38
# "-" 0.34 13.08 0.34 13.08
# "+" 0.26 10.00 0.26 10.00
# "abs" 0.20 7.69 0.20 7.69
# "sum" 0.18 6.92 0.18 6.92
# "<" 0.16 6.15 0.16 6.15
# "is.na" 0.06 2.31 0.06 2.31
# $sample.interval
# [1] 0.02
# $sampling.time
# [1] 2.6
Ah? Slower?
The full profiling result shows that this approach spends more time on floating point multiplication "*"
, and the logical not "!"
seems pretty expensive. While my approach requires floating point addition / subtraction only.
Well, The result might be also architecture dependent. I am testing on Intel Nahalem (Intel Core 2 Duo) at the moment. So benchmarking between two approaches on various platforms are welcomed.
Remark
All profiling are using OP's data in the question.
how ifelse (in data.table) works
This is only by proxy related to data.table
; at core is that ifelse
is designed for use like:
ifelse(test, yes, no)
where test
, yes
, and no
all have the same length -- the output will be the same length as test
, and all the elements corresponding to where test
is TRUE
will be the corresponding element from yes
, and similarly for where test
is FALSE
.
When test
is a scalar and yes
or no
are vectors, as in your case, you have to look at what ifelse
is doing to understand what's going on:
Relevant source:
if (any(test[ok])) #is any element of `test` `TRUE`?
ans[test & ok] <- rep(yes, length.out = length(ans))[test &
ok]
What is rep(c(1, 2), length.out = 1)
? It's just 1
-- the second element is truncated.
That's what's happened here -- the value of ifelse
is only the first element of paste0(1:.N, "_", col2)
. When passed to `:=`
, this single element is recycled.
When your logical condition is a scalar, you should use if
, not ifelse
. I'll also add that I do my damndest to avoid using ifelse
in general because it's slow.
ifelse over each element of a vector
ifelse
isn't a function of one vector, it is a function of 3 vectors of the same length. The first vector, called test
, is a boolean, the second vector yes
and third vector no
give the elements in the result, chosen item-by-item based on the test
value.
A sample of size = 1
is a different size than test
(unless the length of test
is 1), so it will be recycled by ifelse
(see note below). Instead, draw samples of the same size as test
from the start:
ifelse(
test = (y == 1),
yes = sample(x = c(1, 3, 5, 7, 9), size = length(y), replace = TRUE),
no = sample(x = c(0, 2, 4, 6, 8), size = lenght(y), replace = TRUE)
)
The vectors don't actually have to be of the same length. The help page ?ifelse
explains: "If yes
or no
are too short, their elements are recycled." This is the behavior you observed with "It generates once for each case, and returns that very random number always.".
Group-filling maximum is slow with missing values
Using if(){}
we can bypass the max
calculation if the entire vector is NA
. This is a massive speed-up:
fmax = function(x, na.rm = TRUE) {
if(all(is.na(x))) return(x[1])
return(max(x, na.rm = na.rm))
}
system.time(df %>%
group_by(group) %>%
mutate(maxval = fmax(val)))
# user system elapsed
# 0.20 0.01 0.22
Does dplyr::if_else evaluate both TRUE and FALSE at the same time?
The issue is because we are checking cases where there are groups that return NULL with
which(value)`
min(NULL)
#[1] Inf
Warning message: In min(NULL) : no non-missing arguments to min;
returning Inf
An option is to subject the which
output by indexing with [1]
to return NA
mydf %>%
group_by(group) %>%
mutate(max_value = if_else(all(!value), max(index), index[which(value)[1]]))
# A tibble: 15 x 4
# Groups: group [3]
# value group index max_value
# <lgl> <fct> <int> <int>
# 1 FALSE a 1 2
# 2 TRUE a 2 2
# 3 FALSE a 3 2
# 4 FALSE a 4 2
# 5 TRUE a 5 2
# 6 FALSE b 1 4
# 7 FALSE b 2 4
# 8 FALSE b 3 4
# 9 TRUE b 4 4
#10 TRUE b 5 4
#11 FALSE c 1 5
#12 FALSE c 2 5
#13 FALSE c 3 5
#14 FALSE c 4 5
#15 FALSE c 5 5
Also, in this case, as we are returning a single element, if/else
would be more appropriate
mydf %>%
group_by(group) %>%
mutate(max_value = if(all(!value)) max(index) else index[which(value)[1]])
# A tibble: 15 x 4
# Groups: group [3]
# value group index max_value
# <lgl> <fct> <int> <int>
# 1 FALSE a 1 2
# 2 TRUE a 2 2
# 3 FALSE a 3 2
# 4 FALSE a 4 2
# 5 TRUE a 5 2
# 6 FALSE b 1 4
# 7 FALSE b 2 4
# 8 FALSE b 3 4
# 9 TRUE b 4 4
#10 TRUE b 5 4
#11 FALSE c 1 5
#12 FALSE c 2 5
#13 FALSE c 3 5
#14 FALSE c 4 5
#15 FALSE c 5 5
Optimizing ifelse on a large data frame
There has been some discussion about how ifelse
is not the best option for code where speed is an important factor. You might instead try:
df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]
To see what's going on here, let's break down the command. df$A > 0.05 & df$B > 0.05
returns TRUE
if both A
and B
exceed 0.05, and FALSE
otherwise. Therefore, (df$A > 0.05 & df$B > 0.05)+1
returns 2 if both A
and B
exceed 0.05 and 1 otherwise. These are used as indicates into the vector c("", "Equal")
, so we get "Equal"
when both exceed 0.05 and ""
otherwise.
Here's a comparison on a data frame with 1 million rows:
# Build dataset and functions
set.seed(144)
big.df <- data.frame(A = runif(1000000), B = runif(1000000))
OP <- function(df) {
df$Mean.Result1 <- ifelse(df$A > 0.05 & df$B > 0.05, "Equal", "")
df
}
josilber <- function(df) {
df$Mean.Result1 <- c("", "Equal")[(df$A > 0.05 & df$B > 0.05)+1]
df
}
all.equal(OP(big.df), josilber(big.df))
# [1] TRUE
# Benchmark
library(microbenchmark)
microbenchmark(OP(big.df), josilber(big.df))
# Unit: milliseconds
# expr min lq mean median uq max neval
# OP(big.df) 299.6265 311.56167 352.26841 318.51825 348.09461 540.0971 100
# josilber(big.df) 40.4256 48.66967 60.72864 53.18471 59.72079 267.3886 100
The approach with vector indexing is about 6x faster in median runtime.
Related Topics
Using Reshape from Wide to Long in R
How to Read Multiple .Txt Files into R
Plot Multiple Columns on the Same Graph in R
Is There an R Function For Finding the Index of an Element in a Vector
Plot a Legend Outside of the Plotting Area in Base Graphics
Cannot Install R-Forge Package Using Install.Packages
Split String Column to Create New Binary Columns
Sample Random Rows in Dataframe
Stratified Random Sampling from Data Frame
Can Lists Be Created That Name Themselves Based on Input Object Names
Rotating and Spacing Axis Labels in Ggplot2
How to Change Language Settings in R
Fixing the Order of Facets in Ggplot
Pair-Wise Duplicate Removal from Dataframe
How to Read Data When Some Numbers Contain Commas as Thousand Separator