Writing Functions in R, Keeping Scoping in Mind

Writing functions in R, keeping scoping in mind

If I know that I'm going to need a function parametrized by some values and called repeatedly, I avoid globals by using a closure:

make.fn2 <- function(a, b) {
    fn2 <- function(x) {
        return( x + a + b )
    }
    return( fn2 )
}

a <- 2; b <- 3
fn2.1 <- make.fn2(a, b)
fn2.1(3)    # 8
fn2.1(4)    # 9

a <- 4
fn2.2 <- make.fn2(a, b)
fn2.2(3)    # 10
fn2.1(3)    # 8

This neatly avoids referencing global variables, instead using the enclosing environment of the function for a and b. Modification of globals a and b doesn't lead to unintended side effects when fn2 instances are called.

Passing arguments to functions, and variable scopes in R

You would generally read your data outside of any function, like so:

outcome.data <- read.csv("outcome-of-care-measures.csv", colClasses = "character")

Otherwise, since a function has its own namespace, all the variables defined inside of it will vanish upon its return, unless they themselves are returned by the function with return(...). Several objects can be returned by putting them in a list: return(list(item1=var1, item2=var2)).

Some functions, such as assign, have the envir parameter that can be set to .GlobalEnv to change this behavior. Altering an object can also be done inside a function using the <<- operator instead of <-, although this practice is generally recommended against.

As a side note, when using a function, you need to define clearly:

What are its inputs
What does it do
What does it return

It's not useful, for instance, to use outcome as a function parameter and then read into a variable named income the content of a csv file. Your argument is then useless as it will be written over. That's why you had to comment out the line defining your state variable inside the function to actually be able to use state as it was received by the function.

This surely won't answer all your questions, but hopefully it can help you clarify certain things. For the rest there are plenty of good tutorials to learn further on how to program in R and how/when to use functions. Best of luck and happy learning!

Making a Variable Constant in a Function in R

A possible solution is to define your function within another function:

g <- function( index ){
  function( x ) x + index
}
index <- 3
f <- g( index )
f(4)
index<-20
f(4)

Now the output of g( index ) is a function which is defined within the (execution) environment of g. This function (f) will look at the value of indexin this environment, where it is fixed to 3. That's why it works, but maybe there is a simpler solution.

Is it possible to make functions recognize variables in scopes above them?

Maybe

printx <- function() {
  x <- 1
  printy()
  return(x)
}

printy <- function() {
  print(get('x',envir=parent.frame()))
}

> x<-0
> printy()
[1] 0
> printx()
[1] 1
[1] 1

This would use the x to be printed by printy which was associated with the environment the function was called in.

One other possibility would be to create a new environment

e1<-new.env(parent = baseenv())

> assign('x',12,envir=e1)
> x
[1] 0
> get('x',e1)
[1] 12

writing functions vs. line-by-line interpretation in an R workflow

I don't think there is a single answer. The best thing to do is keep the relative merits in mind and then pick an approach for that situation.

1) functions. The advantage of not using functions is that all your variables are left in the workspace and you can examine them at the end. That may help you figure out what is going on if you have problems.

On the other hand, the advantage of well designed functions is that you can unit test them. That is you can test them apart from the rest of the code making them easier to test. Also when you use a function, modulo certain lower level constructs, you know that the results of one function won't affect the others unless they are passed out and this may limit the damage that one function's erroneous processing can do to another's. You can use the debug facility in R to debug your functions and being able to single step through them is an advantage.

2) LCFD. Regarding whether you should use a decomposition of load/clean/func/do regardless of whether its done via source or functions is a second question. The problem with this decomposition regardless of whether its done via source or functions is that you need to run one just to be able to test out the next so you can't really test them independently. From that viewpoint its not the ideal structure.

On the other hand, it does have the advantage that you may be able to replace the load step independently of the other steps if you want to try it on different data and can replace the other steps independently of the load and clean steps if you want to try different processing.

3) No. of Files There may be a third question implicit in what you are asking whether everything should be in one or multiple source files. The advantage of putting things in different source files is that you don't have to look at irrelevant items. In particular if you have routines that are not being used or not relevant to the current function you are looking at they won't interrupt the flow since you can arrange that they are in other files.

On the other hand, there may be an advantage in putting everything in one file from the viewpoint of (a) deployment, i.e. you can just send someone that single file, and (b) editing convenience as you can put the entire program in a single editor session which, for example, facilitates searching since you can search the entire program using the editor's functions as you don't have to determine which file a routine is in. Also successive undo commands will allow you to move backward across all units of your program and a single save will save the current state of all modules since there is only one. (c) speed, i.e. if you are working over a slow network it may be faster to keep a single file in your local machine and then just write it out occasionally rather than having to go back and forth to the slow remote.

Note: One other thing to think about is that using packages may be superior for your needs relative to sourcing files in the first place.

Scoping: Local vs Var

They are very similar, but not exactly the same. Both only exist inside of a function but they work slightly differently.

The var version works it way through all the default variable scopes. See http://help.adobe.com/en_US/ColdFusion/9.0/Developing/WSc3ff6d0ea77859461172e0811cbec09af4-7fdf.html

Local will match only a variable in a local scope. Consider the following

<cffunction name="himom">
    <cfoutput>
        <p><b>try 0:</b> #request_method#</p>
        <!--- you might think that the variable does not exist, 
              but it does because it came from cgi scope --->
    </cfoutput> 

    <cfquery name="myData" datasource="Scorecard3">
        SELECT 'This is via query' AS request_method
    </cfquery>
    <!--- Data is now being loaded into a query --->

    <cfoutput query="myData">
        <p><b>try 1:</b> #request_method#</p>
    </cfoutput>
    <!--- This one now came from the query --->        

    <cfset var request_method = "This is Var">
    <!--- We now declare via var ---> 
 
    <cfoutput query="myData">
        <p><b>try 2:</b> #request_method#</p>
    </cfoutput> 
    <!--- the query version disappears and now 
          the var version takes precedence --->

    <cfset local.request_method = "This is local">
    <!--- now we declare via local --->

    <cfoutput query="myData">
        <p><b>try 3:</b> #request_method#</p>
    </cfoutput>
    <!--- The local method takes precedence --->

    <cfoutput>
        <p><b>try 4:</b> #request_method#</p>
        <!--- in fact it even takes precedence over the var --->

        <p><b>try 5:</b> #local.request_method#</p>
        <!--- there is no question where this comes from --->   
   </cfoutput>
</cffunction>

<cfset himom()>

Results of the above

try 0: GET

try 1: This is via query

try 2: This is Var

try 3: This is local

try 4: This is local

try 5: This is local

In summary

When developing, you could use either to make sure that variables only exist inside of a function, but always prefixing your variables with local goes a long way in making sure that your code is clearly understood

How does local function myFunc() works in lua?

In Lua, when you write:

local function myFunc()
    --...
end

It is essentially the same thing as:

local myFunc = function()
    --...
end

In the same manner, the following:

function myFunc()
    --...
end

Is the same as:

myFunc = function()
    --...
end

It's simply a shortcut for variable declaration. That's because in Lua, functions are first class objects, there is no special place where declared functions are stored, they are held in variables the same as any other data type.

Caveat

It's worth noting that there is a very small difference in behavior when using local function myFunc() instead of local myFunc = function().

When you declare the function using the former syntax, code inside the function has access to the variable myFunc, so the function can refer to itself. With the latter syntax, accessing myFunc inside of myFunc will return nil - it's not in scope.

So that means the following code:

local function myFunc()
    --...
end

Is actually more accurately represented as:

local myFunc
myFunc = function()
    --..
end

This is a small difference, but may be worth keeping in mind e.g. if you need to write a recursive function.

Writing Functions in R, Keeping Scoping in Mind