What does the capital letter I in R linear regression formula mean?
I
isolates or insulates the contents of I( ... )
from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x
= the main effect ofx
, andx^2
= the main effect and the second order interaction ofx
",
not the intended x
plus x
-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))
y x
1 -1.4355144 -1.85374045
2 0.3620872 -0.07794607
3 -1.7590868 0.96856634
4 -0.3245440 0.18492596
5 -0.6515630 -1.37994358
This is because ^
is a special operator in a formula, as described in ?formula
. You end up only including x
in the model frame because the main effect of x
is already included from the x
term in the formula, and there is nothing to cross x
with to get the second-order interactions in the x^2
term.
To get the usual operator, you need to use I()
to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))
y x I(x^2)
1 -0.02881534 1.0865514 1.180593....
2 0.23252515 -0.7625449 0.581474....
3 -0.30120868 -0.8286625 0.686681....
4 -0.67761458 0.8344739 0.696346....
5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs
.)
In your example, -
when used in a formula would indicate removal of a term from the model, where you wanted -
to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))
Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), :
variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x)
is a length 1 vector and model.frame()
quite rightly tells you this doesn't match the length of the other variables. A way round this is I()
:
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))
y I(x - mean(x))
1 1.1727063 1.142200....
2 -1.4798270 -0.66914....
3 -0.4303878 -0.28716....
4 -1.0516386 0.542774....
5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( )
.
Read ?formula
for more on the special operators, and ?I
for more details on the function itself and its other main use-case within data frames (which is where the AsIs
bit originates from, if you are interested).
Significance of 'I' keyword in lm model in R
I()
prevents the formula-interface from interpreting the argument, so it gets passed along instead to the expression-parsing part.
In the formula interface -x
means 'remove x from the predictors'. So I can do y~.-x
to mean 'fit y against everything but x'.
You don't want it to do that - you actually want to make a variable that is the difference of two variables and regress on that, so you don't want the formula interface to parse that expression.
I()
achieves that for you.
Terms with squaring in them (x^2
) also need the same treatment. The formula interface does something special with powers, and if you actually want a variable squared you have to I()
it.
I()
has some other uses in other contexts as well. See ?I
In R formulas, why do I have to use the I() function on power terms, like y ~ I(x^3)
The tilde operator is actually a function that returns an unevaluated expression, a type of language object. The expression then gets interpreted by modeling functions in a manner that is different than the interpretation of operators operating on numeric objects.
The issue here is how formulas and specifically the "+, ":", and "^" operators in them are interpreted. (A side note: the correct statistical procedure would be to use the function poly
when attempting to make higher order terms in a regression formula.) Within R formulas the infix operators "+", "*", ":" and "^" have entirely different meanings than when used in calculations with numeric vectors. In a formula the tilde (~
) separates the left hand side from the right hand side. The ^
and :
operators are used to construct interactions so x
= x^2
= x^3
rather than becoming perhaps expected mathematical powers. (A variable interacting with itself is just the same variable.) If you had typed (x+y)^2
the R interpreter would have produced (for its own good internal use), not a mathematical: x^2 +2xy +y^2
, but rather a symbolic: x + y +x:y
where x:y
is an interaction term without its main effects. (The ^
gives you both main effects and interactions.)
?formula
The I()
function acts to convert the argument to "as.is", i.e. what you expect. So I(x^2) would return a vector of values raised to the second power.
The ~
should be thought of as saying "is distributed as" or "is dependent on" when seen in regression functions. The ~
is an infix function in its own right. You can see that LHS ~ RHS
is almost shorthand for formula(LHS, RHS)
by typing this at the console:
`~`(LHS,RHS)
#LHS ~ RHS
class( `~`(LHS,RHS) )
#[1] "formula"
identical( `~`(LHS,RHS), as.formula("LHS~RHS") )
#[1] TRUE # cannot use `formula` since it interprets its first argument
In regression functions the an error term in model descriptions will be in whatever form that regression function presumes or is specifically called for in the parameters for family
. The mean for the base level will generally be labelled (Intercept)
. The function context and arguments may also further determine a link function such as log() or logit() from the family
value, and it is also possible to have a non-canonical family/link combination.
The "+" symbol in a formula is not really adding two variables but is usually an implicit request to calculate a regression coefficient(s) for that variable in the context of the rest of the variables that are on the RHS of a formula. The regression functions use `model.matrix and that function will recognize the presence of factors or character vectors in the formula and build a matrix that expand the levels of the discrete components of the formula.
In plot()-ting functions it basically reverses the usual ( x, y )
order of arguments that the plot function usually takes. There was a plot.formula method written so that formulas could be used as a more "mathematical" mode of communicating with R. In the graphics::plot.formula
, curve
, and 'lattice' and 'ggplot' functions, it governs how multiple factors or numeric vectors are displayed and "facetted".
The overloading of the "+" operator is discussed in the comments below and is also done in the plotting packages: ggplot2 and gridExtra where is it separating functions that deliver object results. There it acting as a pass-through and layering operator. Some aggregation functions have a formula method which use "+" as an "arrangement" and grouping operator.
I() equivalent (used in R), what is the Python equivalent?
I found the answer, seems to be as simple as:
f = 'medv~lstat + I(lstat**2)'
fit3 = smf.ols(f, data=data).fit()
print(fit3.summary())
What does the capital letter I in R linear regression formula mean?
I
isolates or insulates the contents of I( ... )
from the gaze of R's formula parsing code. It allows the standard R operators to work as they would if you used them outside of a formula, rather than being treated as special formula operators.
For example:
y ~ x + x^2
would, to R, mean "give me:
x
= the main effect ofx
, andx^2
= the main effect and the second order interaction ofx
",
not the intended x
plus x
-squared:
> model.frame( y ~ x + x^2, data = data.frame(x = rnorm(5), y = rnorm(5)))
y x
1 -1.4355144 -1.85374045
2 0.3620872 -0.07794607
3 -1.7590868 0.96856634
4 -0.3245440 0.18492596
5 -0.6515630 -1.37994358
This is because ^
is a special operator in a formula, as described in ?formula
. You end up only including x
in the model frame because the main effect of x
is already included from the x
term in the formula, and there is nothing to cross x
with to get the second-order interactions in the x^2
term.
To get the usual operator, you need to use I()
to isolate the call from the formula code:
> model.frame( y ~ x + I(x^2), data = data.frame(x = rnorm(5), y = rnorm(5)))
y x I(x^2)
1 -0.02881534 1.0865514 1.180593....
2 0.23252515 -0.7625449 0.581474....
3 -0.30120868 -0.8286625 0.686681....
4 -0.67761458 0.8344739 0.696346....
5 0.65522764 -0.9676520 0.936350....
(that last column is correct, it just looks odd because it is of class AsIs
.)
In your example, -
when used in a formula would indicate removal of a term from the model, where you wanted -
to have it's usual binary operator meaning of subtraction:
> model.frame( y ~ x - mean(x), data = data.frame(x = rnorm(5), y = rnorm(5)))
Error in model.frame.default(y ~ x - mean(x), data = data.frame(x = rnorm(5), :
variable lengths differ (found for 'mean(x)')
This fails for reason that mean(x)
is a length 1 vector and model.frame()
quite rightly tells you this doesn't match the length of the other variables. A way round this is I()
:
> model.frame( y ~ I(x - mean(x)), data = data.frame(x = rnorm(5), y = rnorm(5)))
y I(x - mean(x))
1 1.1727063 1.142200....
2 -1.4798270 -0.66914....
3 -0.4303878 -0.28716....
4 -1.0516386 0.542774....
5 1.5225863 -0.72865....
Hence, where you want to use an operator that has special meaning in a formula, but you need its non-formula meaning, you need to wrap the elements of the operation in I( )
.
Read ?formula
for more on the special operators, and ?I
for more details on the function itself and its other main use-case within data frames (which is where the AsIs
bit originates from, if you are interested).
Time Series and Linear Regression
The reason you can't use
$
is that thez
object shown in the question is not a zoo object. It is ats
object. You can useclass(z)
,str(z)
anddput(z)
to determine what you have. Also,$
works onzoo
objects but not onts
objects. Convert it tozoo
and then$
will work.library(zoo)
zz <- zoo(z, as.yearqtr(time(z)))
zz$GDPGrowth
## 2482 Q1 2482 Q2 2482 Q3 2482 Q4 2483 Q1 2483 Q2
## 1.20000000 -0.20000000 -0.15000000 -0.10000000 0.40000000 0.33333333
## 2483 Q3 2483 Q4 2484 Q1 2484 Q2 2484 Q3 2484 Q4
## 0.26666667 0.20000000 0.50000000 0.80000000 1.10000000 1.40000000
## # ... snip ...The times in your object are way into the future but unless we know how you created them we cannot know how that happened. You possibly were playing with
Date
objects and made some error in converting them tots
.You have quarterly data and the 0, 0.25, 0.5 and 0.75 are how
ts
objects represent the 4 quarters internally. If this refers to the not wanting to applyna.approx
to certain columns then ifix
is a vector of column names or numbers to convert thenzz[, ix] <- na.approx(zz[, ix])
appliesna.approx
only to those columns.ts
andzoo
represent the index viatsp
andindex
attributes respectively so they are still there.time(z)
andtime(zz)
will retrieve the index.If you want to do statistical tests, compute confidence intervals, etc. then you need to take the correlations into account; however, if you just want to get point estimates you don't need to concern yourself with that. The dyn package (also the dynlm package) can be used to facilitate running
lm
with zoo objects.library(dyn)
fm <- dyn$lm(GDPGrowth ~ ApprovalGOV, zz)
fm
## Call:
## lm(formula = dyn(GDPGrowth ~ ApprovalGOV), data = zz)
##
## Coefficients:
## (Intercept) ApprovalGOV
## -1.9717 0.3575Either of these also work and make use of
with.zoo
andfortify.zoo
.with(zz, lm(GDPGrowth ~ ApprovalGOV))
lm(GDPGrowth ~ ApprovalGOV, fortify.zoo(zz))To plot the points and draw in a regression line:
plot(formula(fm), zz)
abline(fm)
Other points are:
R is case sensitive so
GDPGrowth
is not the same asGDPGROWTH
.do not use random code snippets that you have found on the net without first reading the help files for each function used so that you know whether it makes sense for your problem. Also read all the vignettes (pdf or html documents) for each package that you are using. In particular, the zoo package has 5 vignettes and a reference manual.
Related Topics
Check for Installed Packages Before Running Install.Packages()
How to Make Geom_Text Plot Within the Canvas's Bounds
Remove/Collapse Consecutive Duplicate Values in Sequence
How to Remove an Element from a List
What Are the Differences Between R's New Native Pipe '|>' and the Magrittr Pipe '%>%'
What's the Differences Between & and &&, | and || in R
How to Add Legend to Ggplot Manually? - R
Append Value to Empty Vector in R
Date Conversion from Posixct to Date in R
Error in If/While (Condition) {:Argument Is of Length Zero
Remove Everything After Space in String
Position of the Sun Given Time of Day, Latitude and Longitude
How to Get the Average (Mean) of Selected Columns
Opening Shiny App Directly in the Default Browser