How Does The Subset Argument Work in The Lm() Function

How to subset a range of values in lm()

The subset parameter in lm() and other model fitting functions takes as its argument a logical vector the length of the dataframe, evaluated in the environment of the dataframe. So, if I understand you correctly, I would use the following:

fit <- lm(SP.RICH~SIZE, data=dat, subset=(SIZE>0.8 & SIZE<7))

How does the subset argument work in the lm() function?

As a general principle, vectors used in subsetting can either logical (e.g. a TRUE or FALSE for every element) or numeric (e.g. a number). As a feature to help with sampling, if it is numeric R will include the same element multiple times if it appears in a subsetting numeric vector.

Let's take a look at cyl:

> mtcars$cyl
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

So you're getting a data.frame of the same length, but it's comprised of row 6, row 6, row 4, row 6, etc.

You can see this if you do the subsetting yourself:

> head(mtcars[mtcars$cyl,])
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Valiant.1      18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Valiant.2      18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Valiant.3      18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1

Did you mean to do something like this?

summary(lm(mpg ~ wt, cyl==6, data=mtcars))

Why does lm() with the subset argument give a different answer than subsetting in advance?

tl;dr As suggested in other comments and answers, the characteristics of the orthogonal polynomial basis are computed before the subsetting is taken into account.

To add more technical detail to @JonManes's answer, let's look at lines 545-553 of the R code where 'model.frame' is defined.

First we have (lines 545-549)

 if(is.null(attr(formula, "predvars"))) {
        for (i in seq_along(varnames))
            predvars[[i+1L]] <- makepredictcall(variables[[i]], vars[[i+1L]])
        attr(formula, "predvars") <- predvars
    }

At this point in the code, formula will not be an actual formula (that would be too easy!), but rather a terms object that contains various useful-to-developers info about model structures ...
predvars is the attribute that defines the information needed to properly reconstruct data-dependent bases like orthogonal polynomials and splines (see ?makepredictcall for a little bit more information, or here, although in general this stuff is really poorly documented; I'd expect it to be documented here but it isn't ...). For example,

attr(terms(model.frame(mpg ~ poly(horsepower, 2), data = auto_train)),  "predvars")

gives

list(mpg, poly(horsepower, 2, coefs = list(alpha = c(102.612244897959, 
142.498828460405), norm2 = c(1, 196, 277254.530612245, 625100662.205702
))))

These are the coefficients for the polynomial, which depend on the distribution of the input data.

Only after this information has been established, on line 553, do we get

subset <- eval(substitute(subset), data, env)

In other words, the subsetting argument doesn't even get evaluated until after the polynomial characteristics are determined (all of this information is then passed to the internal C_modelframe function, which you really don't want to look at ...)

Note that this issue does not result in an information leak between training and testing sets in a statistical learning context: the parameterization of the polynomial doesn't affect the predictions of the model at all (in theory, although as usual with floating point the results are unlikely to be exactly identical). At worst (if the training and full sets were very different) it could reduce numerical stability a bit.

FWIW this is all surprising (to me) and seems worth raising on the r-devel@r-project.org mailing list (at least a note in the documentation seems in order).

Run lm() function over subsets of 2 different variables in data frame

One option using dplyr:

df_lm <- df %>%
  group_by(District,Crop) %>%
  do(mod = lm(Yield ~ Year,data = .))

df_coef <- df_lm %>%
  do(data.frame(
    District = .$District,
    Crop = .$Crop,
    var = names(coef(.$mod)),
    coef(summary(.$mod)))
    )

> df_coef
Source: local data frame [32 x 7]
Groups: <by row>

# A tibble: 32 × 7
   District   Crop         var      Estimate   Std..Error    t.value   Pr...t..
*    <fctr> <fctr>      <fctr>         <dbl>        <dbl>      <dbl>      <dbl>
1         A Barley (Intercept) -407.66953514 378.49788671 -1.0770722 0.36034462
2         A Barley        Year    0.20771336   0.19183872  1.0827499 0.35818046
3         A  Maize (Intercept)  159.81133118 212.90233600  0.7506321 0.50738515
4         A  Maize        Year   -0.08002266   0.10790790 -0.7415830 0.51211787
5         A   Rice (Intercept)  -68.01125454 117.60578244 -0.5782986 0.60361684
6         A   Rice        Year    0.03552764   0.05960758  0.5960255 0.59313364
7         A  Wheat (Intercept)  -59.61828825 134.67806297 -0.4426726 0.66972726
8         A  Wheat        Year    0.03125866   0.06826053  0.4579317 0.65918309
9         B Barley (Intercept) -319.99755207  57.14553545 -5.5996947 0.01125215
10        B Barley        Year    0.16332436   0.02896377  5.6389189 0.01103509
# ... with 22 more rows

Another thing to look at is the lmList function in nlme.

How to run lm for each subset of the data frame, and then aggreage the result?

Does this work for you?

    set.seed(1)
    df<-data.frame(income=rnorm(100,100,20),age=rnorm(100,40,10),country=factor(sample(1:3,100,replace=T),levels=1:3,labels=c("us","gb","france")))

    out<-lapply(levels(df$country) , function(z) {
        data.frame(country=z, age= coef(lm(income~0+age, data=df[df$country==z,])),row.names=NULL)
    })
do.call(rbind ,out)

R lm using subset of my data frame with c(index)

You can use reformulate :

with index_y the index of your y variable of interest in your dataframe df

model=lm(reformulate(colnames(df)[index],response=colnames(df)[index_y]),df)

Apply lm to subset of data frame defined by a third column of the frame

How about

library(nlme) ## OR library(lme4)
lmList(x~y|ID,data=d)

Can the subset() function within the lm() R function can be used to remove observations only of certain variables?

Assuming your data is in a data frame, the answer is "no." You cannot use subset on only part of a data.frame. That's because subset on a data frame returns another data frame, and in a data frame all of the variables must be the same length.

There are plenty of ways to work around this restriction, but they won't work with lm. Think about how regression works: every observation must be fully observed. If you have missing data, you have three options:

Delete the observations with missing data. This is called listwise deletion and it is the default in lm (by way of the na.omit function, buried inside the model.matrix function, which is inside lm)
Impute the missing data. This is a massive field and and area of active research
Use some kind of other method, like a Bayesian model that can integrate over the missing data

You should be able to get help in this area from Cross Validated. But the fact remains, there is simply no way to use lm on variables of unequal length, and there is no way to get subset to return a data frame containing variables of unequal length because all variables in a data frame must be the same length.