How to subset a range of values in lm()
The subset parameter in lm()
and other model fitting functions takes as its argument a logical vector the length of the dataframe, evaluated in the environment of the dataframe. So, if I understand you correctly, I would use the following:
fit <- lm(SP.RICH~SIZE, data=dat, subset=(SIZE>0.8 & SIZE<7))
How does the subset argument work in the lm() function?
As a general principle, vectors used in subsetting can either logical (e.g. a TRUE or FALSE for every element) or numeric (e.g. a number). As a feature to help with sampling, if it is numeric R will include the same element multiple times if it appears in a subsetting numeric vector.
Let's take a look at cyl
:
> mtcars$cyl
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
So you're getting a data.frame of the same length, but it's comprised of row 6, row 6, row 4, row 6, etc.
You can see this if you do the subsetting yourself:
> head(mtcars[mtcars$cyl,])
mpg cyl disp hp drat wt qsec vs am gear carb
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Valiant.1 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant.2 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Valiant.3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Did you mean to do something like this?
summary(lm(mpg ~ wt, cyl==6, data=mtcars))
Why does lm() with the subset argument give a different answer than subsetting in advance?
tl;dr As suggested in other comments and answers, the characteristics of the orthogonal polynomial basis are computed before the subsetting is taken into account.
To add more technical detail to @JonManes's answer, let's look at lines 545-553 of the R code where 'model.frame' is defined.
First we have (lines 545-549)
if(is.null(attr(formula, "predvars"))) {
for (i in seq_along(varnames))
predvars[[i+1L]] <- makepredictcall(variables[[i]], vars[[i+1L]])
attr(formula, "predvars") <- predvars
}
- At this point in the code,
formula
will not be an actual formula (that would be too easy!), but rather aterms
object that contains various useful-to-developers info about model structures ... predvars
is the attribute that defines the information needed to properly reconstruct data-dependent bases like orthogonal polynomials and splines (see?makepredictcall
for a little bit more information, or here, although in general this stuff is really poorly documented; I'd expect it to be documented here but it isn't ...). For example,
attr(terms(model.frame(mpg ~ poly(horsepower, 2), data = auto_train)), "predvars")
gives
list(mpg, poly(horsepower, 2, coefs = list(alpha = c(102.612244897959,
142.498828460405), norm2 = c(1, 196, 277254.530612245, 625100662.205702
))))
These are the coefficients for the polynomial, which depend on the distribution of the input data.
Only after this information has been established, on line 553, do we get
subset <- eval(substitute(subset), data, env)
In other words, the subsetting argument doesn't even get evaluated until after the polynomial characteristics are determined (all of this information is then passed to the internal C_modelframe
function, which you really don't want to look at ...)
Note that this issue does not result in an information leak between training and testing sets in a statistical learning context: the parameterization of the polynomial doesn't affect the predictions of the model at all (in theory, although as usual with floating point the results are unlikely to be exactly identical). At worst (if the training and full sets were very different) it could reduce numerical stability a bit.
FWIW this is all surprising (to me) and seems worth raising on the r-devel@r-project.org
mailing list (at least a note in the documentation seems in order).
Run lm() function over subsets of 2 different variables in data frame
One option using dplyr:
df_lm <- df %>%
group_by(District,Crop) %>%
do(mod = lm(Yield ~ Year,data = .))
df_coef <- df_lm %>%
do(data.frame(
District = .$District,
Crop = .$Crop,
var = names(coef(.$mod)),
coef(summary(.$mod)))
)
> df_coef
Source: local data frame [32 x 7]
Groups: <by row>
# A tibble: 32 × 7
District Crop var Estimate Std..Error t.value Pr...t..
* <fctr> <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
1 A Barley (Intercept) -407.66953514 378.49788671 -1.0770722 0.36034462
2 A Barley Year 0.20771336 0.19183872 1.0827499 0.35818046
3 A Maize (Intercept) 159.81133118 212.90233600 0.7506321 0.50738515
4 A Maize Year -0.08002266 0.10790790 -0.7415830 0.51211787
5 A Rice (Intercept) -68.01125454 117.60578244 -0.5782986 0.60361684
6 A Rice Year 0.03552764 0.05960758 0.5960255 0.59313364
7 A Wheat (Intercept) -59.61828825 134.67806297 -0.4426726 0.66972726
8 A Wheat Year 0.03125866 0.06826053 0.4579317 0.65918309
9 B Barley (Intercept) -319.99755207 57.14553545 -5.5996947 0.01125215
10 B Barley Year 0.16332436 0.02896377 5.6389189 0.01103509
# ... with 22 more rows
Another thing to look at is the lmList
function in nlme.
How to run lm for each subset of the data frame, and then aggreage the result?
Does this work for you?
set.seed(1)
df<-data.frame(income=rnorm(100,100,20),age=rnorm(100,40,10),country=factor(sample(1:3,100,replace=T),levels=1:3,labels=c("us","gb","france")))
out<-lapply(levels(df$country) , function(z) {
data.frame(country=z, age= coef(lm(income~0+age, data=df[df$country==z,])),row.names=NULL)
})
do.call(rbind ,out)
R lm using subset of my data frame with c(index)
You can use reformulate :
with index_y
the index of your y variable of interest in your dataframe df
model=lm(reformulate(colnames(df)[index],response=colnames(df)[index_y]),df)
Apply lm to subset of data frame defined by a third column of the frame
How about
library(nlme) ## OR library(lme4)
lmList(x~y|ID,data=d)
?
Can the subset() function within the lm() R function can be used to remove observations only of certain variables?
Assuming your data is in a data frame, the answer is "no." You cannot use subset
on only part of a data.frame
. That's because subset
on a data frame returns another data frame, and in a data frame all of the variables must be the same length.
There are plenty of ways to work around this restriction, but they won't work with lm
. Think about how regression works: every observation must be fully observed. If you have missing data, you have three options:
- Delete the observations with missing data. This is called listwise deletion and it is the default in
lm
(by way of thena.omit
function, buried inside themodel.matrix
function, which is insidelm
) - Impute the missing data. This is a massive field and and area of active research
- Use some kind of other method, like a Bayesian model that can integrate over the missing data
You should be able to get help in this area from Cross Validated. But the fact remains, there is simply no way to use lm
on variables of unequal length, and there is no way to get subset
to return a data frame containing variables of unequal length because all variables in a data frame must be the same length.
Related Topics
R - Error When Using Geturl from Curl After Site Was Changed
Combine (Bind) Existing PDF Files in R
Creating a Cumulative Step Graph in R
What's The Difference Between [1], [1,], [,1], [[1]] for a Dataframe in R
Processing The Input File Based on Range Overlap
How to Show Only The Lower Triangle in Ggpairs
Can't Install Any R Packages on Linux Server
Extract Only Folder Name Right Before Filename from Full Path
In R, Merge Two Data Frames, Fill Down The Blanks
Using Glmer for Logistic Regression, How to Verify Response Reference
Mlogit: Missing Value Where True/False Needed
Sed Directory Not Found When Running R with -E Flag
Using Discrete Custom Color in a Plotly Heatmap
How to Keep Track of Total Transaction Amount Sent from an Account Each Last 6 Month
How to Position Annotate Text in The Blank Area of Facet Ggplot
Is Ifelse Ever Appropriate in a Non-Vectorized Situation and Vice-Versa