poly() in lm(): difference between raw vs. orthogonal
By default, with raw = FALSE
, poly()
computes an orthogonal polynomial. It internally sets up the model matrix with the raw coding x, x^2, x^3, ... first and then scales the columns so that each column is orthogonal to the previous ones. This does not change the fitted values but has the advantage that you can see whether a certain order in the polynomial significantly improves the regression over the lower orders.
Consider the simple cars
data with response stopping dist
ance and driving speed
. Physically, this should have a quadratic relationship but in this (old) dataset the quadratic term is not significant:
m1 <- lm(dist ~ poly(speed, 2), data = cars)
m2 <- lm(dist ~ poly(speed, 2, raw = TRUE), data = cars)
In the orthogonal coding you get the following coefficients in summary(m1)
:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.980 2.146 20.026 < 2e-16 ***
poly(speed, 2)1 145.552 15.176 9.591 1.21e-12 ***
poly(speed, 2)2 22.996 15.176 1.515 0.136
This shows that there is a highly significant linear effect while the second order is not significant. The latter p-value (i.e., the one of the highest order in the polynomial) is the same as in the raw coding:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.47014 14.81716 0.167 0.868
poly(speed, 2, raw = TRUE)1 0.91329 2.03422 0.449 0.656
poly(speed, 2, raw = TRUE)2 0.09996 0.06597 1.515 0.136
but the lower order p-values change dramatically. The reason is that in model m1
the regressors are orthogonal while they are highly correlated in m2
:
cor(model.matrix(m1)[, 2], model.matrix(m1)[, 3])
## [1] 4.686464e-17
cor(model.matrix(m2)[, 2], model.matrix(m2)[, 3])
## [1] 0.9794765
Thus, in the raw coding you can only interpret the p-value of speed
if speed^2
remains in the model. And as both regressors are highly correlated one of them can be dropped. However, in the orthogonal coding speed^2
only captures the quadratic part that has not been captured by the linear term. And then it becomes clear that the linear part is significant while the quadratic part has no additional significance.
Regression in R using poly() function
Because they are not the same model. Your second one has one unique covariate, while the first has two.
> model_2
Call:
lm(formula = v ~ 1 + q + q^2)
Coefficients:
(Intercept) q
-15.251 7.196
You should use the I()
function to modify one parameter inside your formula in order the regression to consider it as a covariate:
model_2=lm(v~1+q+I(q^2))
> model_2
Call:
lm(formula = v ~ 1 + q + I(q^2))
Coefficients:
(Intercept) q I(q^2)
7.5612 -3.3323 0.8774
will give the same prediction
> predict(model_1)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
> predict(model_2)
1 2 3 4 5 6 7 8 9 10 11
5.106294 4.406154 5.460793 8.270210 12.834406 19.153380 27.227133 37.055664 48.638974 61.977063 77.069930
How do you make R poly() evaluate (or predict ) multivariate new data (orthogonal or raw)?
For the record, it seems that this has been fixed
> x1 = seq(1, 10, by=0.2)
> x2 = seq(1.1,10.1,by=0.2)
> t = poly(cbind(x1,x2),degree=2,raw=T)
>
> class(t) # has a class now
[1] "poly" "matrix"
>
> # does not throw error
> predict(t, newdata = cbind(x1,x2)[1:2, ])
1.0 2.0 0.1 1.1 0.2
[1,] 1.0 1.00 1.1 1.10 1.21
[2,] 1.2 1.44 1.3 1.56 1.69
attr(,"degree")
[1] 1 2 1 2 2
attr(,"class")
[1] "poly" "matrix"
>
> # and gives the same
> t[1:2, ]
1.0 2.0 0.1 1.1 0.2
[1,] 1.0 1.00 1.1 1.10 1.21
[2,] 1.2 1.44 1.3 1.56 1.69
>
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Is there an inverse function for poly?
cars$speed
must be of the form a + b * pp[, 1] for some scalars a and b and knowing that the coefs
attribute of poly objects contains values which can be used for reconstruction we find the following reconstruction of cars$speed
as speed
.
pp <- poly(cars$speed, 2)
speed <- with(attr(pp, "coefs"), alpha[1] + sqrt(norm2)[3] * pp[, 1])
all.equal(speed, cars$speed)
## [1] TRUE
Related Topics
Specifying Column Names in a Data.Frame Changes Spaces to "."
How to Get Ranks with No Gaps When There Are Ties Among Values
Cumulative Sum Until Maximum Reached, Then Repeat from Zero in the Next Row
R Package Lattice Won't Plot If Run Using Source()
Dplyr/R Cumulative Sum with Reset
Python's Xrange Alternative for R or How to Loop Over Large Dataset Lazilly
How to Create a "Macro" for Regressors in R
How to Access the Help/Documentation .Rd Source Files in R
Merge Rows in a Dataframe Where the Rows Are Disjoint and Contain Nas
How to Force Specific Order of the Variables on the X Axis
How to Avoid: Read.Table Truncates Numeric Values Beginning with 0
Convert String to Date, Format: "Dd.Mm.Yyyy"
Deleting Columns from a Data.Frame Where Na Is More Than 15% of the Column Length
Ggplot2 Shade Area Under Density Curve by Group