Understanding Lm and Environment

Understanding lm and environment

(This has nothing to do with the real problem you have, [@DWin has addressed that, as have commentators on your Q] but is by way of explanation of the part of the documentation you quote)

The quoted help information means that the same process is used to find the variables/objects references in a model formula as is used to find variables/objects supplied to the arguments weights, subset etc.

R looks for for the objects referenced in the formula and by arguments weights, subset, and offset, first in the data object and then in the environment of the formula (which is usually the global environment during interactive use).

The reason why the docs mention this explicitly is because lm() as with many R functions that employ model-formula interfaces use the so-called standard non-standard evaluation. The up-shot is that say one supplies weights = foo, R won't necessarily look for object foo in evaluating the argument. Instead, it will look for an object with the name foo in the object supplied to the data argument, and if it doesn't find it there, then in the environment attached to the model formula, which as mentioned, doesn't always have to be the global environment.

R - how to pass a formula to a with(data, lm(y ~ x)) construction

An important "hidden" aspect of formulas is their associated environment.

When form_obj is created, its environment is set to where form_obj was created:

environment(form_obj)
# <environment: R_GlobalEnv>

For every other version, the formula's environment is created from within with(), and is set to that temporary environment. It's easiest to see this with the as.formula approach by splitting it into a few steps:

with(mtcars, {
  f = as.formula(text_obj)
  print(environment(f))
  lm(f)
})
# <environment: 0x7fbb68b08588>

We can make the form_obj approach work by editing its environment before calling lm:

with(mtcars, {
  # set form_obj's environment to the current one
  environment(form_obj) = environment()
  lm(form_obj)
})

The help page for ?formula is a bit long, but there's a section on environments:

Environments
A formula object has an associated environment, and this environment (rather than the parent environment) is used by model.frame to evaluate variables that are not found in the supplied data argument.
Formulas created with the ~ operator use the environment in which they were created. Formulas created with as.formula will use the env argument for their environment.

The upshot is, making a formula with ~ puts the environment part "under the rug" -- in more general settings, it's safer to use as.formula which gives you fuller control over the environment to which the formula applies.

You might also check Hadley's chapter on environments:

http://adv-r.had.co.nz/Environments.html

Call to weight in lm() within function doesn't evaluate properly

Formulas as special in R in that they not only keep track of symbol/variable names, they also keep track of the environment where they were created. Check out

ff <- mpg ~ cyl
environment(ff)
# <environment: R_GlobalEnv>
foo <- function() {
  ff <- mpg ~ cyl
  environment(ff)
}
foo()
# <environment: 0x0000026172e505d8> private function environment (different each time)

The problem is that lm will try to use the environment where the formula was created to look up variables rather than the parent frame. Since you create the formula in the call to wt_reg, the formula holds on the the global scope. But wts only exists in the function scope. You can alter your function to change the environment on the formula to the local function environment then everything should work

wt_reg <- function(form, data, wts) {
  ff <- as.formula(form)
  environment(ff) <- environment()
  lm(formula = ff, data = data,
     weights = wts)
}

wt_reg(mpg ~ cyl, data = mtcars, wts = 1:nrow(mtcars))

The eval(mf, parent.frame) you are referring to in lm() is calling model.frame() with your formula. And from the description on the ?model.frame help page: "All the variables in formula, subset and in ... are looked for first in data and then in the environment of formula (see the help for formula() for further details) and collected into a data frame". So it again is looking in the environment of the formula, not the calling frame.

How are environments, (en)closures, and frames related?

UPDATE R-lang defines an environment as having a frame. I tend to think about frames as stack frames, not as mapping from name to value - but then there is of course the data.frame which maps column names to vectors (and then some...). I think most of the confusion comes from the fact that the original S-language (and still S-Plus) did not have environment objects, so all "frames" were essentially what environment objects are now, except that they could only exists as part of the call stack.

For instance, in S-Plus the doc for sys.nframe says "sys.nframe returns the numerical index of the current frame in the list of all frames." ...that sounds an awful lot like stack frames to me... You can read more about stack frames here: http://en.wikipedia.org/wiki/Call_stack#Structure

I expanded some of the explanations below and use the term "stack frame" consistently (I hope).

END UPDATE

I'd explain them like this:

An environment is an object that maps variable names to values. Each mapping is called a binding. The value can be either a real value or a promise. An environment has a parent environment (except for the empty environment). When you look up a symbol in an environment and it isn't found, the parent environments are also searched.
A promise is an unevaluated expression and an environment in which to evaluate the expression. When the promise is evaluated it is replaced with the generated value.
A closure is a function and the environment that the function was defined in. A function like lm would have the stats namespace environment and a user defined function would have the global environment - but a function f defined within another function g would have the local environment for g as its environment.
A stack frame (or activation record) is what represents the entries on the call stack. Each stack frame has the local environment that the function is executed in, and the function call's expression (so that sys.call works).
When a function call is executed, a local environment is created with it's parent set to the closure's environment, the arguments are matched against the function's formal arguments and those bindings are added to the local environment (as promises). The unmatched formal arguments are assigned the default values (promises) of the function (if any) and marked as missing. A stack frame is then created with this local environment and the call expression. The stack frame is pushed on the call stack and then the body of the function is evaluated in this local environment.

...so all symbols in the body will be looked up in the local environment (formal arguments and local variables), and if not found in the parent environment (which is the closure enviroment) and the parent's parent environment and so on until found.

Note that the parent stack frame's environment is NOT searched in this case. The parent.frame, sys.frame functions gets the environments on the call stack - that is, the caller's environment and the caller's caller's environment etc...

# Here match.fun needs to look in the caller's caller's environment to find what "x" is...
f <- function(FUN) match.fun(FUN)(1:10)
g <- function() { x=sin; y="x"; f(y) }
g() # same as sin(1:10)

# Here we see that the stack frames must also contain the actual call expression
f <- function(...) sys.call()
g <- function(...) f(..., x=42)
g(a=2) # f(..., x = 42)

Model fitting functions and environemnts

In this case, you are right, the call to model.frame (actually, model.frame.default) is looking for mysubset in the .GlobalEnv. However, a better generalization would be to say that it is trying to evaluate various objects either in the object passed to data or, if they are not there, in the environment of the formula that you pass to it. And that environment is the .GlobalEnv.

So model.frame.default calls

eval(substitute(subset), data, env)

That translates to "evaluate the object mysubset in data or, if not there, in env (which is environment(formula)).

One way to get around this is to recreate your formula inside your function, where it will assume the environment created when your function is called, where mysubset exists:

mylm <- function(formula,data,subset=NULL){
  mysubset <- subset # some other clever manipulation
  lm(formula(deparse(formula)),data,subset=mysubset)
}

In that way, model.frame.default should be able to find mysubset.

using lm(my_formula) inside [.data.table's j

The way lm works it looks for the variables used in the environment of the formula supplied. Since you create your formula in the global environment, it's not going to look in the j-expression environment, so the only way to make the exact expression lm(frm) work would be to add the appropriate variables to the correct environment:

DT[, {assign('x', x, environment(frm));
      assign('y', y, environment(frm));
      lm(frm)}]

Now obviously this is not a very good solution, and both Arun's and Josh's suggestions are much better and I'm just putting it here for the understanding of the problem at hand.

edit Another (possibly more perverted, and quite fragile) way would be to change the environment of the formula at hand (I do it permanently here, but you could revert it back, or copy it and then do it):

DT[, {setattr(frm, '.Environment', get('SDenv', parent.frame(2))); lm(frm)}]

Btw a funny thing is happening here - whenever you use get in j-expression, all of the variables get constructed (so don't use it if you can avoid it), and this is why I don't need to also use x and y in some way for data.table to know that those variables are needed.

Understanding Lm and Environment