How to Organize Large R Programs

How to organize large R programs?

The standard answer is to use packages -- see the Writing R Extensions manual as well as different tutorials on the web.

It gives you

a quasi-automatic way to organize your code by topic
strongly encourages you to write a help file, making you think about the interface
a lot of sanity checks via R CMD check
a chance to add regression tests
as well as a means for namespaces.

Just running source() over code works for really short snippets. Everything else should be in a package -- even if you do not plan to publish it as you can write internal packages for internal repositories.

As for the 'how to edit' part, the R Internals manual has excellent R coding standards in Section 6. Otherwise, I tend to use defaults in Emacs' ESS mode.

Update 2008-Aug-13: David Smith just blogged about the Google R Style Guide.

Organizing R Source Code

This question is very closely related to: "How to organize large R programs?"

You should consider creating an R package. You can use the package.skeleton function to start with given a set of R files. I also strongly recommend using roxygen to document the package at the beginning, because it's much more difficult to do it after the fact.

Read "Writing R Extensions". The online book "Statistics with R" has a section on this subject. Also take a look at Creating R Packages: A Tutorial by Friedrich Leisch. Lastly, if you're in NY, come to the upcoming NY use-R group meeting on "Authoring R Packages: a gentle introduction with examples".

Just to rehash some suggestions about good practices:

A package allows you to use R CMD check which is very helpful at catching bugs; separately you can look at using the codetools package.
A package also forces you to do a minimal amount of documentation, which leads to better practices in the long run.
You should also consider doing unit testing (e.g. with RUnit) if you want your code to be robust/maintainable.
You should consider using a style guide (e.g. Google Style Guide).
Use a version control system from the beginning, and if you're going to make your code open source, then consider using github or r-forge.

Edit:

Regarding how do make incremental changes without rebuilding and installing the full package: I find the easiest thing to do is to make changes in your relevant R file and then use the source command to load those changes. Once you load your library into an R session, it will always be lower in the environment (and lower in priority) than the .GlobalEnv, so any changes that you source or load in directly will be used first (use the search command to see this). That way you can have your package underlying and you are overwriting changes as you're testing them in the environment.

Alternatively, you can use an IDE like StatET or ESS. They make loading individual lines or functions out of an R package very easy. StatET is particularly well designed to handle managing packages in a directory-like structure.

How to organize large Shiny apps?

After addition of modules to R shiny. Managing of complex structures in shiny applications has become a lot easier.

Detailed description of shiny modules:Here

Advantages of using modules:

Once created, they are easily reused

ID collisions is easier to avoid

Code organization based on inputs and output of modules

In tab based shiny app, one tab can be considered as one module which has inputs and outputs. Outputs of tabs can be then passed to other tabs as inputs.

Single-file app for tab-based structure which exploits modular thinking. App can be tested by using cars dataset. Parts of the code where copied from the Joe Cheng(first link). All comments are welcome.

# Tab module
# This module creates new tab which renders dataTable

dataTabUI <- function(id, input, output) {
  # Create a namespace function using the provided id
  ns <- NS(id)

  tagList(sidebarLayout(sidebarPanel(input),

                        mainPanel(dataTableOutput(output))))

}

# Tab module
# This module creates new tab which renders plot
plotTabUI <- function(id, input, output) {
  # Create a namespace function using the provided id
  ns <- NS(id)

  tagList(sidebarLayout(sidebarPanel(input),

                        mainPanel(plotOutput(output))))

}

dataTab <- function(input, output, session) {
  # do nothing...
  # Should there be some logic?


}

# File input module
# This module takes as input csv file and outputs dataframe
# Module UI function
csvFileInput <- function(id, label = "CSV file") {
  # Create a namespace function using the provided id
  ns <- NS(id)

  tagList(
    fileInput(ns("file"), label),
    checkboxInput(ns("heading"), "Has heading"),
    selectInput(
      ns("quote"),
      "Quote",
      c(
        "None" = "",
        "Double quote" = "\"",
        "Single quote" = "'"
      )
    )
  )
}

# Module server function
csvFile <- function(input, output, session, stringsAsFactors) {
  # The selected file, if any
  userFile <- reactive({
    # If no file is selected, don't do anything
    validate(need(input$file, message = FALSE))
    input$file
  })

  # The user's data, parsed into a data frame
  dataframe <- reactive({
    read.csv(
      userFile()$datapath,
      header = input$heading,
      quote = input$quote,
      stringsAsFactors = stringsAsFactors
    )
  })

  # We can run observers in here if we want to
  observe({
    msg <- sprintf("File %s was uploaded", userFile()$name)
    cat(msg, "\n")
  })

  # Return the reactive that yields the data frame
  return(dataframe)
}
basicPlotUI <- function(id) {
  ns <- NS(id)
  uiOutput(ns("controls"))

}
# Functionality for dataselection for plot
# SelectInput is rendered dynamically based on data

basicPlot <- function(input, output, session, data) {
  output$controls <- renderUI({
    ns <- session$ns
    selectInput(ns("col"), "Columns", names(data), multiple = TRUE)
  })
  return(reactive({
    validate(need(input$col, FALSE))
    data[, input$col]
  }))
}

##################################################################################
# Here starts main program. Lines above can be sourced: source("path-to-module.R")
##################################################################################

library(shiny)


ui <- shinyUI(navbarPage(
  "My Application",
  tabPanel("File upload", dataTabUI(
    "tab1",
    csvFileInput("datafile", "User data (.csv format)"),
    "table"
  )),
  tabPanel("Plot", plotTabUI(
    "tab2", basicPlotUI("plot1"), "plotOutput"
  ))

))


server <- function(input, output, session) {
  datafile <- callModule(csvFile, "datafile",
                         stringsAsFactors = FALSE)

  output$table <- renderDataTable({
    datafile()
  })

  plotData <- callModule(basicPlot, "plot1", datafile())

  output$plotOutput <- renderPlot({
    plot(plotData())
  })
}


shinyApp(ui, server)

How to organize big R functions?

Option 1

One option is to use switch instead of multiple if statements:

myfun <- function(y, type=c("aa", "bb", "cc", "dd" ... "zz")){
  switch(type, 
    "aa" = sub_fun_aa(y),
    "bb" = sub_fun_bb(y),
    "bb" = sub_fun_cc(y),
    "dd" = sub_fun_dd(y)
  )
}

Option 2

In your edited question you gave far more specific information. Here is a general design pattern that you might want to consider. The key element in this pattern is that there is not a single if in sight. I replace it with match.function, where the key idea is that the type in your function is itself a function (yes, since R supports functional programming, this is allowed).:

sharpening <- function(x){
  paste(x, "General sharpening", sep=" - ")
}

unsharpMask <- function(x){
  y <- sharpening(x)
  #... Some specific stuff here...
  paste(y, "Unsharp mask", sep=" - ")
}

hiPass <- function(x) {
  y <- sharpening(x)
  #... Some specific stuff here...
  paste(y, "Hipass filter", sep=" - ")
}

generalMethod <- function(x, type=c(hiPass, unsharpMask, ...)){
  match.fun(type)(x)
}

And call it like this:

> generalMethod("stuff", "unsharpMask")
[1] "stuff - General sharpening - Unsharp mask"
> hiPass("mystuff")
[1] "mystuff - General sharpening - Hipass filter"

Workflow for statistical analysis and report writing

I generally break my projects into 4 pieces:

load.R
clean.R
func.R
do.R

load.R: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I'll either write out the workspace using save() or just keep things in memory for the next step.

clean.R: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.

func.R: Contains all of the functions needed to perform the actual analysis. source()'ing this file should have no side effects other than loading up the function definitions. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.

do.R: Calls the functions defined in func.R to perform the analysis and produce charts and tables.

The main motivation for this set up is for working with large data whereby you don't want to have to reload the data each time you make a change to a subsequent step. Also, keeping my code compartmentalized like this means I can come back to a long forgotten project and quickly read load.R and work out what data I need to update, and then look at do.R to work out what analysis was performed.

How do you organize classes on a large project?

Would it be bad design to just put
everything together so everything is
separated by function rather than
class type?

There's no general rule, it depends on the situation, your personal preferences and change patterns in your code.
However I would suggest doing the following:

Keep as much of the common code in the superclasses to eliminate any redundancy. Maybe then some of the subclasses won't even be needed.
I'm not sure what you mean by "modules", but I would recommend separating the stuff using namespaces+subdirectories in the same project and not by using separate assemblies. Use assemblies only for deployment separation purposes and similar stuff.
Those "helper" classes you mention sound a bit code-smelly. They could be a violation of OO principles. Check out this link for more info: http://blogs.msdn.com/nickmalik/archive/2005/09/06/461404.aspx. Maybe you can reorganize the code, reduce the need for such classes AND get a benefit of a cleaner OO design at the same time?

How to Organize Large R Programs