Learning to Write Functions in R

Learning to write functions in R

At a glance, the biggest thing that you can do is to not use non-standard-evaluation shortcuts inside your functions: things like $, subset() and with(). These are functions intended for convenient interactive use, not extensible programmatic use. (See, e.g., the Warning in ?subset which should probably be added to ?with, fortunes::fortune(312), fortunes::fortune(343).)

fortunes::fortune(312)

The problem here is that the $ notation is a magical shortcut and like
any other magic if used incorrectly is likely to do the programmatic
equivalent of turning yourself into a toad. -- Greg Snow (in
response to a user that wanted to access a column whose name is stored
in y via x$y rather than x[[y]])
R-help (February 2012)

fortunes::fortune(343)

Sooner or later most R beginners are bitten by this all too convenient shortcut. As an R
newbie, think of R as your bank account: overuse of $-extraction can lead to undesirable
consequences. It's best to acquire the [[ and [ habit early.
-- Peter Ehlers (about the use of $-extraction)
R-help (March 2013)

When you start writing functions that work on data frames, if you need to reference column names you should pass them in as strings, and then use [ or [[ to get the column based on the string stored in a variable name. This is the simplest way to make functions flexible with user-specified column names. For example, here's a simple stupid function that tests if a data frame has a column of the given name:

does_col_exist_1 = function(df, col) {
    return(!is.null(df$col))
}

does_col_exist_2 = function(df, col) {
    return(!is.null(df[[col]])
    # equivalent to df[, col]
}

These yield:

does_col_exist_1(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_1(mtcars, col = "mpg")
# [1] FALSE

does_col_exist_2(mtcars, col = "jhfa")
# [1] FALSE
does_col_exist_2(mtcars, col = "mpg")
# [1] TRUE

The first function is wrong because $ doesn't evaluate what comes after it, no matter what value I set col to when I call the function, df$col will look for a column literally named "col". The brackets, however, will evaluate col and see "oh hey, col is set to "mpg", let's look for a column of that name."

If you want lots more understanding of this issue, I'd recommend the Non-Standard Evaluation Section of Hadley Wickham's Advanced R book.

I'm not going to re-write and debug your functions, but if I wanted to my first step would be to remove all $, with(), and subset(), replacing with [. There's a pretty good chance that's all you need to do.

writing functions vs. line-by-line interpretation in an R workflow

I don't think there is a single answer. The best thing to do is keep the relative merits in mind and then pick an approach for that situation.

1) functions. The advantage of not using functions is that all your variables are left in the workspace and you can examine them at the end. That may help you figure out what is going on if you have problems.

On the other hand, the advantage of well designed functions is that you can unit test them. That is you can test them apart from the rest of the code making them easier to test. Also when you use a function, modulo certain lower level constructs, you know that the results of one function won't affect the others unless they are passed out and this may limit the damage that one function's erroneous processing can do to another's. You can use the debug facility in R to debug your functions and being able to single step through them is an advantage.

2) LCFD. Regarding whether you should use a decomposition of load/clean/func/do regardless of whether its done via source or functions is a second question. The problem with this decomposition regardless of whether its done via source or functions is that you need to run one just to be able to test out the next so you can't really test them independently. From that viewpoint its not the ideal structure.

On the other hand, it does have the advantage that you may be able to replace the load step independently of the other steps if you want to try it on different data and can replace the other steps independently of the load and clean steps if you want to try different processing.

3) No. of Files There may be a third question implicit in what you are asking whether everything should be in one or multiple source files. The advantage of putting things in different source files is that you don't have to look at irrelevant items. In particular if you have routines that are not being used or not relevant to the current function you are looking at they won't interrupt the flow since you can arrange that they are in other files.

On the other hand, there may be an advantage in putting everything in one file from the viewpoint of (a) deployment, i.e. you can just send someone that single file, and (b) editing convenience as you can put the entire program in a single editor session which, for example, facilitates searching since you can search the entire program using the editor's functions as you don't have to determine which file a routine is in. Also successive undo commands will allow you to move backward across all units of your program and a single save will save the current state of all modules since there is only one. (c) speed, i.e. if you are working over a slow network it may be faster to keep a single file in your local machine and then just write it out occasionally rather than having to go back and forth to the slow remote.

Note: One other thing to think about is that using packages may be superior for your needs relative to sourcing files in the first place.

Writing functions in R

You cannot use the $ sign with a variable. Try instead:

data[,var]

where var must be a character, e.g. "speed"

dscore<-function(data,var){

  ave<-mean(data[,var])
  sd<-sd(data[,var])

  data[,paste0(var,"dscore")]<-(data[,var]-ave)/sd

  return(data)
}

cars<-dscore(cars,var="speed")

Write functions for methods in Rstudio

Since you are already using some S3 method dispatch with class(obj) <- "class", you should be able to do:

`[.rolls` <- function(obj, ind) obj$rolls[ind]
`[<-.rolls` <- function(obj, ind, value) { obj[["rolls"]][ind] <- value; obj; }

Some fake data:

foo <- list(rolls=10L+1:5, sides=6, probs = 1/6)
class(foo) <- "rolls"
foo[3]
# [1] 13
foo[3] <- 99L
foo
# $rolls
# [1] 11 12 99 14 15
# $sides
# [1] 6
# $probs
# [1] 0.1666667
# attr(,"class")
# [1] "rolls"

You can go so far as to pretty-print the object, though it only has interactive uses:

print.rolls <- function(x, ...) {
  cat("<Rolls>\n")
  cat("  len: ", length(x[["rolls"]]), "\n")
  cat("  other properties: ", paste(sort(setdiff(names(x), "rolls")), collapse = ", "), "\n")
}
foo
# <Rolls>
#   len:  5 
#   other properties:  probs, sides

Writing an R function that creates columns

You need to return the new object:

createColumns <- function(df) {
    df$Country1 <- ifelse(as.integer(df$CountryID == 1), 1, 0)
    df$Country2 <- ifelse(as.integer(df$CountryID == 2), 1, 0)
    df$Country3 <- ifelse(as.integer(df$CountryID == 3), 1, 0)
    ... 60 more lines
    df
}

And use it thus:

df <- createColumns(df)

You can’t (easily) modify non-local objects in R, and this is on purpose: you shouldn’t do that. Instead, make assignment explicit, as done above.

There are some other things to note; for example, repeating what you wrote 60 times should be a huge red flag. Rethink the problem: you probably don’t want 60 columns for this in the first place (instead, research and use the concept of tidy data) but if you really do, you can probably use pivot functions to replace those 60 lines of code with a single line of code.

Furthermore, R has a logical data type. This means that, in virtually all use-cases, you’d assign TRUE and FALSE instead of 1 and 0. So, instead of writing

ifelse(as.integer(df$CountryID == 1), 1, 0)

You’d simplify this to

df$CountryID == 1

In addition, the as.integer call is entirely redundant: ifelse requires a logical argument, so your current code takes a logical vector, explicitly converts it to integer, and the result is converted back to a logical vector by ifelse.

Writing functions in R, keeping scoping in mind

If I know that I'm going to need a function parametrized by some values and called repeatedly, I avoid globals by using a closure:

make.fn2 <- function(a, b) {
    fn2 <- function(x) {
        return( x + a + b )
    }
    return( fn2 )
}

a <- 2; b <- 3
fn2.1 <- make.fn2(a, b)
fn2.1(3)    # 8
fn2.1(4)    # 9

a <- 4
fn2.2 <- make.fn2(a, b)
fn2.2(3)    # 10
fn2.1(3)    # 8

This neatly avoids referencing global variables, instead using the enclosing environment of the function for a and b. Modification of globals a and b doesn't lead to unintended side effects when fn2 instances are called.

How to organize big R functions?

Option 1

One option is to use switch instead of multiple if statements:

myfun <- function(y, type=c("aa", "bb", "cc", "dd" ... "zz")){
  switch(type, 
    "aa" = sub_fun_aa(y),
    "bb" = sub_fun_bb(y),
    "bb" = sub_fun_cc(y),
    "dd" = sub_fun_dd(y)
  )
}

Option 2

In your edited question you gave far more specific information. Here is a general design pattern that you might want to consider. The key element in this pattern is that there is not a single if in sight. I replace it with match.function, where the key idea is that the type in your function is itself a function (yes, since R supports functional programming, this is allowed).:

sharpening <- function(x){
  paste(x, "General sharpening", sep=" - ")
}

unsharpMask <- function(x){
  y <- sharpening(x)
  #... Some specific stuff here...
  paste(y, "Unsharp mask", sep=" - ")
}

hiPass <- function(x) {
  y <- sharpening(x)
  #... Some specific stuff here...
  paste(y, "Hipass filter", sep=" - ")
}

generalMethod <- function(x, type=c(hiPass, unsharpMask, ...)){
  match.fun(type)(x)
}

And call it like this:

> generalMethod("stuff", "unsharpMask")
[1] "stuff - General sharpening - Unsharp mask"
> hiPass("mystuff")
[1] "mystuff - General sharpening - Hipass filter"

Learning to Write Functions in R