is ifelse ever appropriate in a non-vectorized situation and vice-versa?
First, ifelse
does NOT always evaluate both expressions - only if there are both TRUE
and FALSE
elements in the test vector.
ifelse(TRUE, 'foo', stop('bar')) # "foo"
And in my opinion:
ifelse
should not be used in a non-vectorized situation. It is always slower and more error prone to use ifelse
over if
/ else
:
# This is fairly common if/else code
if (length(letters) > 0) letters else LETTERS
# But this "equivalent" code will yield a very different result - TRY IT!
ifelse(length(letters) > 0, letters, LETTERS)
In vectorized situations though, ifelse
can be a good choice - but beware that the length and attributes of the result might not be what you expect (as above, and I consider ifelse
broken in that respect).
Here's an example: tst
is of length 5 and has a class. I'd expect the result to be of length 10 and have no class, but that isn't what happens - it gets an incompatible class and length 5!
# a logical vector of class 'mybool'
tst <- structure(1:5 %%2 > 0, class='mybool')
# produces a numeric vector of class 'mybool'!
ifelse(tst, 101:110, 201:210)
#[1] 101 202 103 204 105
#attr(,"class")
#[1] "mybool"
Why would I expect the length to be 10? Because most functions in R "cycle" the shorter vector to match the longer:
1:5 + 1:10 # returns a vector of length 10.
...But ifelse
only cycles the yes/no arguments to match the length of the tst argument.
Why would I expect the class (and other attributes) to not be copied from the test object? Because <
which returns a logical vector does not copy class and attributes from its (typically numeric) arguments. It doesn't do that because it would typically be very wrong.
1:5 < structure(1:10, class='mynum') # returns a logical vector without class
Finally, can it be more efficient to "do it yourself"? Well, it seems that ifelse
is not a primitive like if
, and it needs some special code to handle NA
. If you don't have NA
s, it can be faster to do it yourself.
tst <- 1:1e7 %%2 == 0
a <- rep(1, 1e7)
b <- rep(2, 1e7)
system.time( r1 <- ifelse(tst, a, b) ) # 2.58 sec
# If we know that a and b are of the same length as tst, and that
# tst doesn't have NAs, then we can do like this:
system.time( { r2 <- b; r2[tst] <- a[tst]; r2 } ) # 0.46 secs
identical(r1, r2) # TRUE
ifelse not working properly
The key is in ?ifelse
:
‘ifelse’ returns a value with the same shape as ‘test’
(emphasis added). Since is.null(rcontrol)
is a 1-element logical vector, what you get back is a 1-element thing (in this case the first element of rcontrol
).
You are looking for either:
if (is.null(rcontrol)) { rcontrol <- rpart.control(cp=0.001,minbucket=100,minsplit = 5) }
or
rcontrol <- if (is.null(rcontrol)) [...] else rcontrol
(in this case the first idiom seems more appropriate since you don't do anything to rcontrol
if the test is false)
Create data.frame conditional on another df without for loop
(a == 0) * mean(b$v1) + t(t(a) * c(tapply(b$v1, b$v2, mean)))
Run in pieces to understand what's happening. Also, note that this assumes ordered names in a
(and 0's and 1's as entries in it, as per OP).
An alternative to a bunch of t
's as above is using mapply
(this assumes a
is a data.frame
or data.table
and not a matrix
, while the above doesn't care):
(a == 0) * mean(b$v1) + mapply(`*`, a, tapply(b$v1, b$v2, mean))
Advantage of switch over if-else statement
Use switch.
In the worst case the compiler will generate the same code as a if-else chain, so you don't lose anything. If in doubt put the most common cases first into the switch statement.
In the best case the optimizer may find a better way to generate the code. Common things a compiler does is to build a binary decision tree (saves compares and jumps in the average case) or simply build a jump-table (works without compares at all).
Break out of the inner loop
The problem is that the inner loop only breaks if the first row of TRS matches. To make your code work you'd have to do like this:
a_2011<- data.frame(c("10N11W11", "10N11W11", "10N12W7", "10N13W22" , "10N14W1"))
TRS <- data.frame(c("10N12W7","10N13W22","10N14W1", "10N15W33"))
for (i in 1:nrow(a_2011))
{
flag <- 0
for (j in 1:nrow(TRS))
{
if ( as.character(a_2011[i,1]) == as.character(TRS[j,1]) )
{
flag <- 1
break
}
}
a_2011$City[i] <- flag
}
You can remove the need for the inner loop like this:
a_2011<- data.frame(c("10N11W11", "10N11W11", "10N12W7", "10N13W22" , "10N14W1"))
TRS <- data.frame(c("10N12W7","10N13W22","10N14W1", "10N15W33"))
for (i in 1:nrow(a_2011))
{
flag <- any(as.character(a_2011[i,1]) == as.character(TRS[,1]))
a_2011$City[i] <- as.numeric(flag)
}
..And then to simplify it further, you can remove the outer loop too:
a_2011<- data.frame(c("10N11W11", "10N11W11", "10N12W7", "10N13W22" , "10N14W1"))
TRS <- data.frame(c("10N12W7","10N13W22","10N14W1", "10N15W33"))
a_2011$City <- as.numeric(a_2011[[1]] %in% TRS[[1]])
What are the advantages of the apply functions? When are they better to use than for loops, and when are they not?
There are several reasons why one might prefer an apply
family function over a for
loop, or vice-versa.
Firstly, for()
and apply()
, sapply()
will generally be just as quick as each other if executed correctly. lapply()
does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply()
. In the end, these all will be calling R functions so they need to be interpreted and then run.
for()
loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply
family functions. However, to use for()
loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply
functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as >
is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for()
loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply
family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply
family lend themselves to scalar or vector operations. A for()
loop will often lend itself to doing multiple iterated operations using the same index i
. For example, I have written code that uses for()
loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply
family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply()
can possibly be faster that for()
or apply()
, you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()
) then that is where the performance gain can come from over apply()
say which boils down to a for()
loop in actual R code. See the source for apply()
to see that it is a wrapper around a for()
loop, and then look at the code for lapply()
, which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply()
and for()
and the other apply
family functions. The .Internal()
is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN
, the entire computation is done in C, calling the R function FUN
. Compare that with the source for apply()
.
pandas loc vs. iloc vs. at vs. iat?
loc: only work on index
iloc: work on position
at: get scalar values. It's a very fast loc
iat: Get scalar values. It's a very fast iloc
Also,
at
andiat
are meant to access a scalar, that is, a single element
in the dataframe, whileloc
andiloc
are ments to access several
elements at the same time, potentially to perform vectorized
operations.
http://pyciencia.blogspot.com/2015/05/obtener-y-filtrar-datos-de-un-dataframe.html
Detect sign changes in Pandas Dataframe
Let us try
import numpy as np
np.sign(data).diff().ne(0)
How do I write a function in r to do cacluations on a record?
Vectorization is one of the most fundamental (and unusual) things you'll need to get used to in R. Many (most?) R operations are vectorized. But a few things aren't - and if(){}else{}
is one of the non-vectorized things. It's used for control flow (whether or not to run a code block) not for vector operations. ifelse()
is a separate function that is used for vectors, where the first argument is a "test", and the 2nd and 3rd arguments are the "if yes" and "if no" results. The test is a vector, and the returned value is the appropriate yes/no result for each item in test. The result will be the same length as the test.
So we would write your IsPretty
function like this:
IsPretty <- function(PetalWidth){
return(ifelse(PetalWidth > 0.3, "Y", "N"))
}
df <- iris
df$Pretty = IsPretty(df$Petal.Width)
Contrast to an if(){...}else{...}
block where the test condition is of length one, and arbitrary code can be run in the ...
- may return a bigger result than the test, or a smaller result, or no result - might modify other objects... You can do anything inside if(){}else()
, but the test condition must have length 1.
You could use your IsPretty
function one row at a time - it will work fine for any one row. So we could put it in a loop as below, checking one row at time, giving if()
one test at a time, assigning results one at a time. But R is optimized for vectorization, and this will be noticeably slower and is a bad habit.
IsPrettyIf <-function(PetalWidth){
if (PetalWidth >0.3) return("Y")
return("N")
}
for(i in 1:nrow(df)) {
df$PrettyLoop[i] = IsPrettyIf(df$Petal.Width[i])
}
A benchmark below shows that the vectorized version is 50x faster. This is such a simple case and such small data that it doesn't much matter, but on larger data, or with more complex operations the difference between vectorized and non-vectorized code can be minutes vs days.
microbenchmark::microbenchmark(
loop = {
for(i in 1:nrow(df)) {
df$PrettyLoop[i] = IsPrettyIf(df$Petal.Width[i])
}
},
vectorized = {
df$Pretty = IsPretty(df$Petal.Width)
}
)
Unit: microseconds
expr min lq mean median uq max neval
loop 3898.9 4365.6 5880.623 5442.3 7041.10 11344.6 100
vectorized 47.7 59.6 112.288 67.4 83.85 1819.4 100
This is a common bump for R learners - you can find many questions on Stack Overflow where people are using if(){}else{}
when they need ifelse()
or vice versa. Why can't ifelse
return vectors? is a FAQ coming from the opposite side of the problem.
What goes on in your attempt?
df <- iris
## The condition has length equal to the number of rows in the data frame
df$Petal.Width > 0.3
#> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## ... truncated
## R warns us that only the first value (which happens to be FALSE) is used
result = if(df$Petal.Width > 0.3) {"Y"} else {"N"}
#> Warning in if (df$Petal.Width > 0.3) {: the condition has length > 1 and only
#> the first element will be used
## So the result is a single "N"
result
#> [1] "N"
length(result)
#> [1] 1
## R "recycles" inputs that are of insufficient length
## so we get a full column of "N"
df$Pretty = result
head(df)
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species Pretty
#> 1 5.1 3.5 1.4 0.2 setosa N
#> 2 4.9 3.0 1.4 0.2 setosa N
#> 3 4.7 3.2 1.3 0.2 setosa N
#> 4 4.6 3.1 1.5 0.2 setosa N
#> 5 5.0 3.6 1.4 0.2 setosa N
#> 6 5.4 3.9 1.7 0.4 setosa N
Created on 2020-11-08 by the reprex package (v0.3.0)
When would you use an array rather than a vector/string?
When writing code that should used in other projects, in particular if you target special platforms (embedded, game consoles, etc.) where STL might not be present.
Old projects or projects with special requirements might not want to introduce dependencies on STL libraries. An interface depending on arrays, char* or whatever will be compatible with anything since it's part of the language. STL however is not guaranteed to be present in all build environments.
Related Topics
Removing/Replacing Brackets from R String Using Gsub
How to Get The R Shiny Downloadhandler Filename to Work
Error Trying to Read a PDF Using Readpdf from The Tm Package
Error: $ Operator Not Defined for This S4 Class
Learning to Write Functions in R
Margins Between Plots in Grid.Arrange
Combining Date and Time into a Date Column for Plotting
How to Fix Degree Symbol Not Showing Correctly in R on Linux/Fedora 31
Recursive Function Using Dplyr
Run R Interactively from Rscript
R: Apply Function to Matrix with Elements of Vector as Argument
Get Start and End Index of Runs of Values
Creating a Table with Individual Trials from a Frequency Table in R (Inverse of Table Function)
How to Calculate Euclidean Distance Between Two Matrices in R