Is Ifelse Ever Appropriate in a Non-Vectorized Situation and Vice-Versa

is ifelse ever appropriate in a non-vectorized situation and vice-versa?

First, ifelse does NOT always evaluate both expressions - only if there are both TRUE and FALSE elements in the test vector.

ifelse(TRUE, 'foo', stop('bar')) # "foo"

And in my opinion:

ifelse should not be used in a non-vectorized situation. It is always slower and more error prone to use ifelse over if / else:

# This is fairly common if/else code
if (length(letters) > 0) letters else LETTERS

# But this "equivalent" code will yield a very different result - TRY IT!
ifelse(length(letters) > 0, letters, LETTERS)

In vectorized situations though, ifelse can be a good choice - but beware that the length and attributes of the result might not be what you expect (as above, and I consider ifelse broken in that respect).

Here's an example: tst is of length 5 and has a class. I'd expect the result to be of length 10 and have no class, but that isn't what happens - it gets an incompatible class and length 5!

# a logical vector of class 'mybool'
tst <- structure(1:5 %%2 > 0, class='mybool')

# produces a numeric vector of class 'mybool'!
ifelse(tst, 101:110, 201:210)
#[1] 101 202 103 204 105
#attr(,"class")
#[1] "mybool"

Why would I expect the length to be 10? Because most functions in R "cycle" the shorter vector to match the longer:

1:5 + 1:10 # returns a vector of length 10.

...But ifelse only cycles the yes/no arguments to match the length of the tst argument.

Why would I expect the class (and other attributes) to not be copied from the test object? Because < which returns a logical vector does not copy class and attributes from its (typically numeric) arguments. It doesn't do that because it would typically be very wrong.

1:5 < structure(1:10, class='mynum') # returns a logical vector without class

Finally, can it be more efficient to "do it yourself"? Well, it seems that ifelse is not a primitive like if, and it needs some special code to handle NA. If you don't have NAs, it can be faster to do it yourself.

tst <- 1:1e7 %%2 == 0
a <- rep(1, 1e7)
b <- rep(2, 1e7)
system.time( r1 <- ifelse(tst, a, b) ) # 2.58 sec

# If we know that a and b are of the same length as tst, and that
# tst doesn't have NAs, then we can do like this:
system.time( { r2 <- b; r2[tst] <- a[tst]; r2 } ) # 0.46 secs

identical(r1, r2) # TRUE

ifelse not working properly

The key is in ?ifelse:

‘ifelse’ returns a value with the same shape as ‘test’

(emphasis added). Since is.null(rcontrol) is a 1-element logical vector, what you get back is a 1-element thing (in this case the first element of rcontrol).

You are looking for either:

if (is.null(rcontrol)) { rcontrol <- rpart.control(cp=0.001,minbucket=100,minsplit = 5)  }

or

rcontrol <- if (is.null(rcontrol)) [...] else rcontrol

(in this case the first idiom seems more appropriate since you don't do anything to rcontrol if the test is false)

Create data.frame conditional on another df without for loop

(a == 0) * mean(b$v1) + t(t(a) * c(tapply(b$v1, b$v2, mean)))

Run in pieces to understand what's happening. Also, note that this assumes ordered names in a (and 0's and 1's as entries in it, as per OP).

An alternative to a bunch of t's as above is using mapply (this assumes a is a data.frame or data.table and not a matrix, while the above doesn't care):

(a == 0) * mean(b$v1) + mapply(`*`, a, tapply(b$v1, b$v2, mean))

Advantage of switch over if-else statement

Use switch.

In the worst case the compiler will generate the same code as a if-else chain, so you don't lose anything. If in doubt put the most common cases first into the switch statement.

In the best case the optimizer may find a better way to generate the code. Common things a compiler does is to build a binary decision tree (saves compares and jumps in the average case) or simply build a jump-table (works without compares at all).

Break out of the inner loop

The problem is that the inner loop only breaks if the first row of TRS matches. To make your code work you'd have to do like this:

a_2011<- data.frame(c("10N11W11", "10N11W11", "10N12W7", "10N13W22" , "10N14W1"))
TRS <- data.frame(c("10N12W7","10N13W22","10N14W1", "10N15W33"))

for (i in 1:nrow(a_2011))
{
flag <- 0
for (j in 1:nrow(TRS))
{
if ( as.character(a_2011[i,1]) == as.character(TRS[j,1]) )
{
flag <- 1
break
}
}
a_2011$City[i] <- flag
}

You can remove the need for the inner loop like this:

a_2011<- data.frame(c("10N11W11", "10N11W11", "10N12W7", "10N13W22" , "10N14W1"))
TRS <- data.frame(c("10N12W7","10N13W22","10N14W1", "10N15W33"))

for (i in 1:nrow(a_2011))
{
flag <- any(as.character(a_2011[i,1]) == as.character(TRS[,1]))
a_2011$City[i] <- as.numeric(flag)
}

..And then to simplify it further, you can remove the outer loop too:

a_2011<- data.frame(c("10N11W11", "10N11W11", "10N12W7", "10N13W22" , "10N14W1"))
TRS <- data.frame(c("10N12W7","10N13W22","10N14W1", "10N15W33"))

a_2011$City <- as.numeric(a_2011[[1]] %in% TRS[[1]])

What are the advantages of the apply functions? When are they better to use than for loops, and when are they not?

There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.

Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.

for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:

IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}

that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.

The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!

The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.

As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:

> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>

and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().

pandas loc vs. iloc vs. at vs. iat?

loc: only work on index

iloc: work on position

at: get scalar values. It's a very fast loc

iat: Get scalar values. It's a very fast iloc

Also,

at and iat are meant to access a scalar, that is, a single element
in the dataframe, while loc and iloc are ments to access several
elements at the same time, potentially to perform vectorized
operations.

http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html

Detect sign changes in Pandas Dataframe

Let us try

import numpy as np 
np.sign(data).diff().ne(0)

How do I write a function in r to do cacluations on a record?

Vectorization is one of the most fundamental (and unusual) things you'll need to get used to in R. Many (most?) R operations are vectorized. But a few things aren't - and if(){}else{} is one of the non-vectorized things. It's used for control flow (whether or not to run a code block) not for vector operations. ifelse() is a separate function that is used for vectors, where the first argument is a "test", and the 2nd and 3rd arguments are the "if yes" and "if no" results. The test is a vector, and the returned value is the appropriate yes/no result for each item in test. The result will be the same length as the test.

So we would write your IsPretty function like this:

IsPretty <- function(PetalWidth){
return(ifelse(PetalWidth > 0.3, "Y", "N"))
}

df <- iris
df$Pretty = IsPretty(df$Petal.Width)

Contrast to an if(){...}else{...} block where the test condition is of length one, and arbitrary code can be run in the ... - may return a bigger result than the test, or a smaller result, or no result - might modify other objects... You can do anything inside if(){}else(), but the test condition must have length 1.

You could use your IsPretty function one row at a time - it will work fine for any one row. So we could put it in a loop as below, checking one row at time, giving if() one test at a time, assigning results one at a time. But R is optimized for vectorization, and this will be noticeably slower and is a bad habit.

IsPrettyIf <-function(PetalWidth){
if (PetalWidth >0.3) return("Y")
return("N")
}

for(i in 1:nrow(df)) {
df$PrettyLoop[i] = IsPrettyIf(df$Petal.Width[i])
}

A benchmark below shows that the vectorized version is 50x faster. This is such a simple case and such small data that it doesn't much matter, but on larger data, or with more complex operations the difference between vectorized and non-vectorized code can be minutes vs days.

microbenchmark::microbenchmark(
loop = {
for(i in 1:nrow(df)) {
df$PrettyLoop[i] = IsPrettyIf(df$Petal.Width[i])
}
},
vectorized = {
df$Pretty = IsPretty(df$Petal.Width)
}
)
Unit: microseconds
expr min lq mean median uq max neval
loop 3898.9 4365.6 5880.623 5442.3 7041.10 11344.6 100
vectorized 47.7 59.6 112.288 67.4 83.85 1819.4 100

This is a common bump for R learners - you can find many questions on Stack Overflow where people are using if(){}else{} when they need ifelse() or vice versa. Why can't ifelse return vectors? is a FAQ coming from the opposite side of the problem.



What goes on in your attempt?

df <- iris

## The condition has length equal to the number of rows in the data frame
df$Petal.Width > 0.3
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## ... truncated

## R warns us that only the first value (which happens to be FALSE) is used
result = if(df$Petal.Width > 0.3) {"Y"} else {"N"}
#> Warning in if (df$Petal.Width > 0.3) {: the condition has length > 1 and only
#> the first element will be used

## So the result is a single "N"
result
#> [1] "N"

length(result)
#> [1] 1

## R "recycles" inputs that are of insufficient length
## so we get a full column of "N"
df$Pretty = result
head(df)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Pretty
#> 1 5.1 3.5 1.4 0.2 setosa N
#> 2 4.9 3.0 1.4 0.2 setosa N
#> 3 4.7 3.2 1.3 0.2 setosa N
#> 4 4.6 3.1 1.5 0.2 setosa N
#> 5 5.0 3.6 1.4 0.2 setosa N
#> 6 5.4 3.9 1.7 0.4 setosa N

Created on 2020-11-08 by the reprex package (v0.3.0)

When would you use an array rather than a vector/string?

When writing code that should used in other projects, in particular if you target special platforms (embedded, game consoles, etc.) where STL might not be present.

Old projects or projects with special requirements might not want to introduce dependencies on STL libraries. An interface depending on arrays, char* or whatever will be compatible with anything since it's part of the language. STL however is not guaranteed to be present in all build environments.



Related Topics



Leave a reply



Submit