Writing Functions VS. Line-By-Line Interpretation in an R Workflow

writing functions vs. line-by-line interpretation in an R workflow

I don't think there is a single answer. The best thing to do is keep the relative merits in mind and then pick an approach for that situation.

1) functions. The advantage of not using functions is that all your variables are left in the workspace and you can examine them at the end. That may help you figure out what is going on if you have problems.

On the other hand, the advantage of well designed functions is that you can unit test them. That is you can test them apart from the rest of the code making them easier to test. Also when you use a function, modulo certain lower level constructs, you know that the results of one function won't affect the others unless they are passed out and this may limit the damage that one function's erroneous processing can do to another's. You can use the debug facility in R to debug your functions and being able to single step through them is an advantage.

2) LCFD. Regarding whether you should use a decomposition of load/clean/func/do regardless of whether its done via source or functions is a second question. The problem with this decomposition regardless of whether its done via source or functions is that you need to run one just to be able to test out the next so you can't really test them independently. From that viewpoint its not the ideal structure.

On the other hand, it does have the advantage that you may be able to replace the load step independently of the other steps if you want to try it on different data and can replace the other steps independently of the load and clean steps if you want to try different processing.

3) No. of Files There may be a third question implicit in what you are asking whether everything should be in one or multiple source files. The advantage of putting things in different source files is that you don't have to look at irrelevant items. In particular if you have routines that are not being used or not relevant to the current function you are looking at they won't interrupt the flow since you can arrange that they are in other files.

On the other hand, there may be an advantage in putting everything in one file from the viewpoint of (a) deployment, i.e. you can just send someone that single file, and (b) editing convenience as you can put the entire program in a single editor session which, for example, facilitates searching since you can search the entire program using the editor's functions as you don't have to determine which file a routine is in. Also successive undo commands will allow you to move backward across all units of your program and a single save will save the current state of all modules since there is only one. (c) speed, i.e. if you are working over a slow network it may be faster to keep a single file in your local machine and then just write it out occasionally rather than having to go back and forth to the slow remote.

Note: One other thing to think about is that using packages may be superior for your needs relative to sourcing files in the first place.

Workflow for statistical analysis and report writing

I generally break my projects into 4 pieces:

  1. load.R
  2. clean.R
  3. func.R
  4. do.R

load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.

clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.

func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.

do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.

The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.

Best practices for unit tests on custom functions for a drake workflow

The best practices for unit tests do not change much when drake enters the picture. Here are the main considerations.

  1. If you are using drake, you are probably dealing with annoyingly long runtimes in your full pipeline. So one challenge is to construct tests that do not take forever. I recommend invoking your functions on a small dataset, a small number of iterations, or whatever will get the test done in a reasonable amount of time. You can run a lot of basic checks that way. To more thoroughly validate the answers that come from your functions, you can run an additional set of checks on the results of the drake pipeline.
  2. If you are using testthat, you probably have your functions arranged in a package-like structure, or even a fully-fledged package, and you may even be loading your functions with devtools::load_all() or library(yourPackage). If you load your functions this way instead of individually sourcing your function scripts, be sure to call expose_imports() before make() so drake can analyze the functions for dependencies.

While developing package, how to explore my function interactively when there's dependency

When writing functions for a package, or for any external use, I highly recommend using the :: operator. This operator can be used for directly referring to a package's namespace when calling the function.
This is useful for programming to avoid confusion with identically or similarly named functions in different packages.

In your case, :: has another advantage:
The relevant package is loaded automatically whenever the function is called.
This is useful for checking your code because you don't have to attach the package in advance, and the function will run "as is" (provided the package is installed, which should be the case for imported packages).

Find more info on that topic here:
http://r-pkgs.had.co.nz/namespace.html

In your case, you might alter your code like this:

DDGet <- function(url = 'http://uofi.box.com/file.dta') {
tmpfile <- tempfile()
download.file(url, tmpfile, method = "wget")
DDData <- foreign::read.dta(tmpfile, to.data.frame = TRUE)
}

Let me know if this was helpful for your problem.

Specify in source what functions to import

Break up functions.R into multiple files which each have some of the functions. Then replace functions.R with a file which sources each of those files. If you want all functions just source functions.R like you do now or if you want some of them just source the appropriate file.

Another approach is the klmr modules package on github (google it) that provides a module system that you could consider.

Strategies for repeating large chunk of analysis

Making code reusable takes some time, effort and holds a few extra challenges like you mention yourself.

The question whether to invest is probably the key issue in informatics (if not in a lot of other fields): do I write a script to rename 50 files in a similar fashion, or do I go ahead and rename them manually.

The answer, I believe, is highly personal and even then, different case by case. If you are easy on the programming, you may sooner decide to go the reuse route, as the effort for you will be relatively low (and even then, programmers typically like to learn new tricks, so that's a hidden, often counterproductive motivation).

That said, in your particular case: I'd go with the sourcing option: since you plan to reuse the code only 2 times more, a greater effort would probably go wasted (you indicate the analysis to be rather extensive). So what if it's not an elegant solution? Nobody is ever going to see you do it, and everybody will be happy with the swift results.

If it turns out in a year or so that the reuse is higher than expected, you can then still invest. And by that time, you will also have (at least) three cases for which you can compare the results from the rewritten and funky reusable version of your code with your current results.

If/when I do know up front that I'm going to reuse code, I try to keep that in mind while developing it. Either way I hardly ever write code that is not in a function (well, barring the two-liners for SO and other out-of-the-box analyses): I find this makes it easier for me to structure my thoughts.



Related Topics



Leave a reply



Submit