Coding Practice in R:What Are the Advantages and Disadvantages of Different Styles

Coding principles in R - Looking for a book/web tutorial for writing complex programs in R

UPDATE:

There are two more recent books that you definitely need to check out when writing packages:

Advanced R from Hadley Wickham, explaining about environments and other advanced topics.

R Packages from Hadley Wickham, giving a great guide for package writing


There isn't one book or style guide for writing R packages; there are numerous books about R that include package writing etc, and the R internals give you a style guide as well.

R coding standards from R internals

The books that contain the most advanced information about R as a programming language are in my view the following two:

R programming for bioinformatics from Robert Gentleman

Software for data analysis: Programming with R from John Chambers

Both books give a lot of insight in R itself and contain useful style tips. Gentleman focuses on object oriented programming (as Bioconductor is largely S4 based), and Chambers is difficult to read but a rich information mine.

Next to that, you have a lot of information on stackoverflow to get ideas:

Coding practice in R : what are the advantages and disadvantages of different styles?

Function commenting conventions in R

any R style guide / checker?

What is your preferred style for naming variables in R?

Common R idioms

But basically you'll have to sit down with your team and agree on a standard. There's no 'best' way, so you all just have to agree on a good way you all use in order to keep the code consistent.

any R style guide / checker?

I think if you want such a tool, you may have to write it yourself. The reason is that R does not have an equivalent to Python's PEP8; that is, an "official style guide" that has been handed down from on high and is universally followed by the majority of R programmers.

In addition there are a lot of stylistic inconsistencies in the R core itself; this is a consequence of the way in which R evolved as a language. For example, many functions in R core follow the form of foo.bar and were written before the S3 object system came along and used that notation for method dispatch. In hindsight, the naming of these functions should probably be changed in the interests of consistency and clarity, but it is too late to consider that now.

In summary, there is no official "style lint" tool for R because the R Core itself contains enough style lint, of which nothing can be done about, that writing one would be very difficult. For every rule--- "don't do this" ---there would have to be a long list of exceptions--- "except in this case, and this case, and this one, and ..., where it was done for historical purposes".

Code Building Process and Embedded Functions

IMHO, speed efficiency should be the last of your concerns when writing code, especially if you are a beginner. Instead, your primary focus should be about simplicity, readability, modularity. Don't read me wrong, efficiency is a great thing, and you'll find many ways to make your code faster when needed, but it should not be a priority by itself.

So I'll be giving tips about style mostly. To illustrate, here is what my version of your code would look like. Please bear in mind that I do not know what your code is computing so I did my best in trying to break it using meaningful variable names.

IVCV <- function(stack) {

## This function computes [...] IVCV stands for [...]
## Inputs:
## - stack: a matrix where each column [...]
## Output: a matrix [...]

n <- nrow(stack) # stack size
stack.ratios <- stack[-n, ] / stack[-1, ]
log.ratios <- log(stack.ratios)
ivcv <- solve(var(log.ratios))

return(ivcv)
}

ExcessReturn <- function(stack) {

## This function computes [...] IVCV stands for [...]
## Inputs:
## - stack: a matrix where each column [...]
## Output: a matrix [...]

n <- nrow(stack) # stack size
total.ratio <- stack[1, ] / stack[n, ]
excess.return <- (1 + log(total.ratio)) ^ (1 / n) - 1

return(excess.return)
}

ExcessReturnTimesIVCV <- function(stack) {

## This function computes [...] IVCV stands for [...]
## Inputs:
## - stack: a matrix where each column [...]
## Output: a vector [...]

return(IVCV(stack) %*% ExcessReturn(stack))
}

1) yes, break your code into small functions. It is better for readability, flexibility, and maintenance. It also makes unit testing easier, where you can design tests for each elementary piece of code.

2) document a function by including comments about its description/inputs/output inside the body of the function. This way, after the function is created, the user can see its description as part of the function's printout (e.g., just type ExcessReturnTimesIVCV in the GUI).

3) break out complexity into multiple statements. Right now, all of your three suggestions are hard to understand, with too many things going on on each line. A statement should do a simple thing so it can read easily. Creating more objects is unlikely to slow down your process, and it will make debugging much easier.

4) your object names are key to making your code clear. Choose them well and use a consistent syntax. I use UpperCamelCase for my own functions' names, and lowercase words separated with dots for most other objects.

5) put comments, especially where 3) and 4) are not enough to make the code clear. In my example, I chose to use a variable n. I went against the recommendation that variable names should be descriptive, but it was to make the code a little lighter and give expressions like stack[-n, ] / stack[-1, ] some nice symmetry. Since n is a bad name, I put a comment explaining its meaning. I might also have put more comments in the code if I knew what the functions were really doing.

6) Use consistent syntax rules, mostly to improve readability. You'll hear different opinions about what should be used here. In general, there is not one best approach. The most important thing is to make a choice and stick with it. So here are my suggestions:

a) one statement per line, no semi colons.

b) consistent spacing and indentation (no tabs). I put spaces after commas, around binary operators. I also use extra spacing to line up things if it helps readability.

c) consistent bracing : be careful of the way you are using curly brackets to define blocks, otherwise you are likely to get problems in script mode. See Section 8.1.43 of the R Inferno (a great reference.)

Good luck!

Understanding different styles of #defines in c

Note that you should not, in general, create function, variable, tag or macro names that start with an underscore. Part of C11 §7.1.3 Reserved identifiers says:

  • All identifiers that begin with an underscore and either an uppercase letter or another underscore are always reserved for any use.
  • All identifiers that begin with an underscore are always reserved for use as identifiers with file scope in both the ordinary and tag name spaces.

See also What does double underscore (__const) mean in C?

That means the last name (__MY_HEADER_H__) can be used by 'system' headers (and the others can't be used by system headers). Note that a common problem is that new programmers look to see what the system headers do and copy them, not realizing that the rules for the headers provided by 'the implementation' (what I called system headers) are subject to different rules from headers written by users. Consequently, people inadvertantly trample on the system namespace thinking it is a good idea because that's what the system headers do, not realizing that they must not do it so that the system headers can be written safely.

Technically, you can use any of the other three names yourself. I don't like the trailing underscores so I don't use them in the absence of a compelling reason. Are these header guards to prevent multiple inclusions?

#ifndef MY_HEADER_H
#define MY_HEADER_H

#endif /* MY_HEADER_H */

If the names are for header guards, using a single or double underscore means they're less likely to collide with other names. You're not likely to refer to these macros. You should resist the temptation to try writing in some other source file:

#ifndef MY_HEADER_H__
#include "my_header.h"
#endif

The name in the header could change. It is crucial that the header contains a set of header guards (the exceptions are rare). But code outside the header itself should not usually be aware of that name.

I tend to use either HEADER_H or HEADER_H_INCLUDED for file header.h (and I seldom if ever use 'my' as a prefix to anything), but the name doesn't matter as long as it is unique (an MD5 checksum for the file is likely to be fine — this isn't a security application).

Is R's apply family more than syntactic sugar?

The apply functions in R don't provide improved performance over other looping functions (e.g. for). One exception to this is lapply which can be a little faster because it does more work in C code than in R (see this question for an example of this).

But in general, the rule is that you should use an apply function for clarity, not for performance.

I would add to this that apply functions have no side effects, which is an important distinction when it comes to functional programming with R. This can be overridden by using assign or <<-, but that can be very dangerous. Side effects also make a program harder to understand since a variable's state depends on the history.

Edit:

Just to emphasize this with a trivial example that recursively calculates the Fibonacci sequence; this could be run multiple times to get an accurate measure, but the point is that none of the methods have significantly different performance:

> fibo <- function(n) {
+ if ( n < 2 ) n
+ else fibo(n-1) + fibo(n-2)
+ }
> system.time(for(i in 0:26) fibo(i))
user system elapsed
7.48 0.00 7.52
> system.time(sapply(0:26, fibo))
user system elapsed
7.50 0.00 7.54
> system.time(lapply(0:26, fibo))
user system elapsed
7.48 0.04 7.54
> library(plyr)
> system.time(ldply(0:26, fibo))
user system elapsed
7.52 0.00 7.58

Edit 2:

Regarding the usage of parallel packages for R (e.g. rpvm, rmpi, snow), these do generally provide apply family functions (even the foreach package is essentially equivalent, despite the name). Here's a simple example of the sapply function in snow:

library(snow)
cl <- makeSOCKcluster(c("localhost","localhost"))
parSapply(cl, 1:20, get("+"), 3)

This example uses a socket cluster, for which no additional software needs to be installed; otherwise you will need something like PVM or MPI (see Tierney's clustering page). snow has the following apply functions:

parLapply(cl, x, fun, ...)
parSapply(cl, X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
parApply(cl, X, MARGIN, FUN, ...)
parRapply(cl, x, fun, ...)
parCapply(cl, x, fun, ...)

It makes sense that apply functions should be used for parallel execution since they have no side effects. When you change a variable value within a for loop, it is globally set. On the other hand, all apply functions can safely be used in parallel because changes are local to the function call (unless you try to use assign or <<-, in which case you can introduce side effects). Needless to say, it's critical to be careful about local vs. global variables, especially when dealing with parallel execution.

Edit:

Here's a trivial example to demonstrate the difference between for and *apply so far as side effects are concerned:

> df <- 1:10
> # *apply example
> lapply(2:3, function(i) df <- df * i)
> df
[1] 1 2 3 4 5 6 7 8 9 10
> # for loop example
> for(i in 2:3) df <- df * i
> df
[1] 6 12 18 24 30 36 42 48 54 60

Note how the df in the parent environment is altered by for but not *apply.

What is better android.R or custom R?

There is no big advantage or disadvantage of the framework id vs the custom id in layouts.

Advantages of using framework identifiers:

  • Avoid the creation of more identifiers. Saves one field in the apk (and there is an apk size limit).
  • Must be used in some situations, like ListActivity

Drawbacks of using framework identifiers:

  • Don't provide a descriptive name

In both practices

  • The code will work in the future
  • Missing references are discovered at runtime, and hidden at compile time

I thought the the samples in the SDK would help me take a decision, and (guess what?) it doesn't. The Notepad and LunarLander applications use the android.R.id for the view identifiers, whereas the ApiDemos project uses custom identifiers.

Best practice for GestureBuilder which mixes both approaches? (@+id/addButton and @android:id/empty)

IHMO, worse practice for HelloActivity and JetBoy which define @+id/text @+id/Button01... This is not descriptive and could have been replaced by (@andoid:id/button1, or by @+id/startButton)

Make Emacs ESS follow R style guide

The good point of Hadley's guide is spaceing around operators (except maybe around /)

There is a smart-operator package which implements it for almost every operator.

This is my setup (uncoment operators which you want to use):

(setq smart-operator-mode-map
(let ((keymap (make-sparse-keymap)))
(define-key keymap "=" 'smart-operator-self-insert-command)
;; (define-key keymap "<" 'smart-operator-<)
;; (define-key keymap ">" 'smart-operator->)
;; (define-key keymap "%" 'smart-operator-%)
(define-key keymap "+" 'smart-operator-+)
;; (define-key keymap "-" 'smart-operator--)
;; (define-key keymap "*" 'smart-operator-*)
;; (define-key keymap "/" 'smart-operator-self-insert-command)
(define-key keymap "&" 'smart-operator-&)
(define-key keymap "|" 'smart-operator-self-insert-command)
;; (define-key keymap "!" 'smart-operator-self-insert-command)
;; (define-key keymap ":" 'smart-operator-:)
;; (define-key keymap "?" 'smart-operator-?)
(define-key keymap "," 'smart-operator-,)
;; (define-key keymap "." 'smart-operator-.)
keymap)
"Keymap used my `smart-operator-mode'.")

See also a nice discussion on R styles here.

[edit] I am also using the defacto camelCase style of R code for globals. The underscore-separated names for local variables - it's easy to differentiate.

There is a special subword mode in emacs which redefines all editing and navigation commands to be used on capitalized sub-words

(global-subword-mode)

Is there any advantage in organizing methods and functions in objects besides organization?

The most obvious advantage would be the prototype inheritance model. An object can inherit methods from its prototype or it can override its prototype properties. This allows for code reuse.

The most important feature of objects in my opinion, is that fact that each object basically acts as a namespace for a set of properties. This way you can have multiple properties with the same name across different objects.

Which of these two coding-styles is better? Both are compliant with PEP8

The first version should be preferred. This is true, not just for Python, but for other languages like C++/Java because its prone to error and per the Zen of Python Errors should never pass silently., which would be the case in your second version.

The Reason:

Consider, the instantiate of Client fails, which results in returning None. Chaining the same object without checking for a success or proper error handling would result in errors unrelated to the actual problem AttributeError: 'NoneType' object has no attribute 'users'.

So it is always better, to avoid chaining objects, as that would cause errors to pass silently.

Example

Consider the following modification to your first version

class EmailConfirmation():
@receiver(email_confirmed)
def confirmed(sender, **kwargs):
email = kwargs['email_address'].email
keystone_id = User.objects.get_by_natural_key(email).keystone_id
try:
client = Client(token=settings.KEYSTONE_TOKEN,
endpoint=settings.KEYSTONE_URL)
except CustomClientError as e:
# Your Error Handling Code goes here
raise CustomClientError("Your Error Message")
else:
if client.users:
client.users.update(keystone_id, enabled=True)
else:
raise CustomEmailError("Your Error Message")
finally:
# Cleanup code goes here


Related Topics



Leave a reply



Submit