linear regression NA estimate just for last coefficient
You have to think a bit more about how your model is defined.
Here's your approach (edited for readability):
> set.seed(101)
> dat<-data.frame(one=c(sample(1000:1239)),
two=c(sample(200:439)),
three=c(sample(600:839)),
Jan=c(rep(1,20),rep(0,220)),
Feb=c(rep(0,20),rep(1,20),rep(0,200)),
Mar=c(rep(0,40),rep(1,20),rep(0,180)),
Apr=c(rep(0,60),rep(1,20),rep(0,160)),
May=c(rep(0,80),rep(1,20),rep(0,140)),
Jun=c(rep(0,100),rep(1,20),rep(0,120)),
Jul=c(rep(0,120),rep(1,20),rep(0,100)),
Aug=c(rep(0,140),rep(1,20),rep(0,80)),
Sep=c(rep(0,160),rep(1,20),rep(0,60)),
Oct=c(rep(0,180),rep(1,20),rep(0,40)),
Nov=c(rep(0,200),rep(1,20),rep(0,20)),
Dec=c(rep(0,220),rep(1,20)))
> summary(lm(one ~ two + three + Jan + Feb + Mar + Apr +
May + Jun + Jul + Aug + Sep + Oct + Nov + Dec,
data=dat))
And the answers:
[snip]
Coefficients: (1 not defined because of singularities)
note this line, it indicates that R (and any other statistical package you choose to use) can't estimate all the parameters because the predictor variables are not all linearly independent.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1149.55556 53.52499 21.477 <2e-16 ***
The intercept here represents the predicted value when all predictor variables are zero. In any particular case the interpretation of the intercept depends on how you have parameterized your model. The dummy variables you have defined for month are not all linearly independent; lm
is smart enough to detect this and drop some of the unidentifiable (linearly dependent) predictor variables. The details of which particular predictor(s) are discarded in this case are obscure and technical (you would probably have to look inside the lm.fit
function, but you probably don't want to do this). In this case, R decides to throw away the December
predictor. Therefore, if we set all the predictors (two
, three
, and all month dummies Jan-Nov) to zero, we end up with the expected value when two
=0 and three
=0 and when the month is not equal to any of Jan-Nov -- i.e., the expected value for December.
two -0.09670 0.06621 -1.460 0.1455
three 0.02446 0.06666 0.367 0.7141
Jan -19.49744 22.17404 -0.879 0.3802
Feb -28.22652 22.27438 -1.267 0.2064
Mar -6.05246 22.25468 -0.272 0.7859
Apr -5.60192 22.41204 -0.250 0.8029
May -13.19127 22.34289 -0.590 0.5555
Jun -19.69547 22.14274 -0.889 0.3747
Jul -44.45511 22.20837 -2.002 0.0465 *
Aug -2.08404 22.26202 -0.094 0.9255
Sep -10.13351 22.10252 -0.458 0.6470
Oct -31.80482 22.33335 -1.424 0.1558
Nov -20.35348 22.09953 -0.921 0.3580
Dec NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 69.81 on 226 degrees of freedom
Multiple R-squared: 0.04381, Adjusted R-squared: -0.01119
F-statistic: 0.7966 on 13 and 226 DF, p-value: 0.6635
Now do it again, this time setting up a model formula that uses -1
to discard the intercept term (we reset the random seed for reproducibility):
> set.seed(101)
> dat1 <- data.frame(one=c(sample(1000:1239)),two=c(sample(200:439)),
three=c(sample(600:839)),
month=factor(rep(month.abb,each=20),levels=month.abb))
> summary(lm(one ~ two + three + month-1, data=dat1))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
two -0.09670 0.06621 -1.460 0.146
three 0.02446 0.06666 0.367 0.714
The estimates for two
and three
are the same as before.
monthJan 1130.05812 52.79625 21.404 <2e-16 ***
monthFeb 1121.32904 55.18864 20.318 <2e-16 ***
monthMar 1143.50310 53.59603 21.336 <2e-16 ***
monthApr 1143.95365 54.99724 20.800 <2e-16 ***
monthMay 1136.36429 53.38218 21.287 <2e-16 ***
monthJun 1129.86010 53.85865 20.978 <2e-16 ***
monthJul 1105.10045 54.94940 20.111 <2e-16 ***
monthAug 1147.47152 54.57201 21.027 <2e-16 ***
monthSep 1139.42205 53.58611 21.263 <2e-16 ***
monthOct 1117.75075 55.35703 20.192 <2e-16 ***
monthNov 1129.20208 53.54934 21.087 <2e-16 ***
monthDec 1149.55556 53.52499 21.477 <2e-16 ***
The estimate for December is the same as the intercept estimate above. The other months' parameter estimates are equal to (intercept+previous value). The p values are different, because their meaning has changed. Previously, they were a test of differences of each month from December; now they are a test of the differences of each month from a baseline value of zero.
Unsure why getting NA for coefficient in linear regression
Your variables method
and lot
are categorical, and you can check how they occur or co-occur in your data:
table(data$method,data$lot)
2 3 4
1 20 0 0
2 0 16 10
If you look at the table above, you see that all your observations have that 1 as method, have only 2 as lot. And the same goes for method 3 and 4 only occurring in method 2.
This means that we cannot distinguish the effects of method 2 and lot 1, as they always go hand in hand in your data. Likewise, we cannot calculate the coefficient of all lot3, lot4 and method 2. We can try one simple model without the interactions:
coefficients(lm(conc ~method+lot,data=data))
(Intercept) method2 lot3 lot4
0.58356132 0.16418556 0.08506782 NA
If you put method first, method 1 is the reference and after estimating method 2 coefficient, there no observations to tell you what lot3 and lot4 are.
If you take lot first, lot 2 is set as reference and the model estimates 3 and 4, and you cannot estimate method 2.
coefficients(lm(conc ~lot+method,data=data))
(Intercept) lot3 lot4 method2
0.5835613 0.2492534 0.1641856 NA
When it's like this, you definitely cannot calculate the coefficients for interaction effects.
Does R always return NA as a coefficient as a result of linear regression with unnecessary variables?
There will be
NA
?
Yes. Adding these columns does not enlarge column space. The resulting matrix is rank-deficient.
How many
NA
?
It depends on the numerical rank.
number of NA = number of coefficients - rank of model matrix
In your example, after introducing ec
, there will be one NA
. Changing the specification order for covariates in the model formula is essentially doing column shuffling for the model matrix. This does not change the matrix rank, so you will always get only one NA
regardless of your specification order.
OK, but which one is
NA
?
lm
does LINPACK QR factorization with restricted column pivoting. The order of covariates affects which one is NA
. Generally, a "first comes, first serves" principle holds, and the position of NA
is quite predictable. Take your examples for illustration. In the first specification, these co-linear terms show up in Examination
, Catholic
, ec
order, so the third one ec
has NA
coefficient. In your second specification, these terms show up in ec
, Examination
, Catholic
order, and the third one Catholic
has NA
coefficient. Note that coefficients estimation is not invariant to specification order, although fitted values are invariant.
If LAPACK QR factorization with complete column pivoting is taken, coefficients estimation would be invariant to specification order. However, the position of NA
is not as predictable as in LINPACK case, and is purely decided numerically.
Numerical Examples
LAPACK based QR factorization is implemented in mgcv
package. Numerical rank is detected when REML estimation is used, and unidentifiable coefficients are reported as 0 (not NA
). So we can make a comparison between lm
and gam
/ bam
in linear model estimation. Let's first construct a toy dataset.
set.seed(0)
# an initial full rank matrix
X <- matrix(runif(500 * 10), 500)
# make the last column as a random linear combination of previous 9 columns
X[, 10] <- X[, -10] %*% runif(9)
# a random response
Y <- rnorm(500)
Now we shuffle columns of X
to see whether NA
changes its position under lm
estimation, or whether 0 changes its position under gam
and bam
estimation.
test <- function (fun = lm, seed = 0, ...) {
shuffleFit <- function (fun) {
shuffle <- sample.int(ncol(X))
Xs <- X[, shuffle]
b <- unname(coef(fun(Y ~ Xs, ...)))
back <- order(shuffle)
c(b[1], b[-1][back])
}
set.seed(seed)
oo <- t(replicate(10, shuffleFit(fun)))
colnames(oo) <- c("intercept", paste0("X", 1:ncol(X)))
oo
}
First we check with lm
:
test(fun = lm)
We see that NA
changes its position with column shuffling of X
. Estimated coefficients vary, too.
Now we check with gam
library(mgcv)
test(fun = gam, method = "REML")
We see that estimation is invariant to column shuffling of X
, and coefficient for X5
is always 0.
Finally we check bam
(bam
is slow for small dataset like here. It is designed for large or super large dataset. So the following is noticeably slower).
test(fun = bam, gc.level = -1)
The result is as same as what we see for gam
.
How do I interpret `NA` coefficients from a GLM fit with the quasipoisson family?
My guess is that the author (who says "the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model") is wrong (although it's hard to be 100% sure without having the full context).
The problem is almost certainly that you have some multicollinear predictors. The reason that different variables get dropped/have NA
coefficients returned is that R partly uses the order to determine which ones to drop (as far as the fitted model result goes, it doesn't matter - all of the top-level results (predictions, goodness of fit, etc.) are identical).
In comments the OP says:
The relationship between
log_a
andlog_gm_a
is that this is a multiplicative fixed-effects model. Solog_a
is the log of predictora
.log_gm_a
is the log of the geometric mean ofa
. So each of thelog_gm
terms is constant across all observations.
This is the key information needed to diagnose the problem. Because the intercept is excluded from this model (the formula contains 0+
, having one constant column in the model matrix is OK, but multiple constant columns is trouble; all but the first (in whatever order is specified by the formula) will be discarded. To go slightly deeper: the model requested is
Y = b1*C1 + b2*C2 + b3*C3 + [additional terms]
where C1
, C2
, C3
are constants. At the point in "data space" where the additional terms are 0 (i.e. for cases where log_a = log_b = log_c = ... = 0
), we're left with predicting a constant value from three separate constant terms. Suppose that the intercept in a regular model (~ 1 + log_a + log_b + log_c
) would have been m
. Then any combination of (b1, b2, b3)
that makes the sum equal to zero (and there are infinitely many) will fit the data equally well.
I still don't know much about the context, but it might be worth considering adding the constant terms as offsets in the model. Or scale the predictors by their geometric means/subtract the log-geom-means from the predictors?
In other cases, multicollinearity arises from unidentifiable interaction terms; nested variables; attempts to include all the levels of multiple categorical variables; or including the proportions of all levels of some compositional variable (e.g. proportions of habitat types, where the proportions add up to 1) in the model, e.g.
- Why do I get NA coefficients and how does `lm` drop reference level for interaction
- linear regression "NA" estimate just for last coefficient
Linear regression on 415 files, output just filename, regression coefficient, significance
First I simulate like 5 csv files, with columns that look like yours:
for(i in 1:5){
tab=data.frame(
YEAR=1950:2014,
FFD= rpois(65,100),
LFD= rnorm(65,100,10),
RAN= rnbinom(65,mu=100,size=1),
MEAN = runif(65,min=50,max=150)
)
write.csv(tab,paste0("data",i,".csv"))
}
Now, we need a vector of all the files in your directory, this will be different for yours, but try to create this somehow using the pattern argument:
csvfiles = dir(pattern="data[0-9]*.csv$")
So we use three libraries from tidyverse, and I guess each csv file is not so huge, so the code below reads in all the files, group them by the source and performs the regression, note you can use call the columns from the data frame, not having to rename them:
library(dplyr)
library(purrr)
library(broom)
csvfiles %>%
map_df(function(i){df = read.csv(i);df$data = i;df}) %>%
group_by(data) %>%
do(tidy(lm(FFD ~ YEAR,data=.))) %>%
filter(term!="(Intercept)")
# A tibble: 5 x 6
# Groups: data [5]
data term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 data1.csv YEAR -0.0228 0.0731 -0.311 0.756
2 data2.csv YEAR -0.139 0.0573 -2.42 0.0182
3 data3.csv YEAR -0.175 0.0650 -2.70 0.00901
4 data4.csv YEAR -0.0478 0.0628 -0.762 0.449
5 data5.csv YEAR 0.0204 0.0648 0.315 0.754
You can just change the formula inside lm(FFD ~ YEAR,data=.)
to get the other regressions
Within a simple linear regression in R, how do I rescale age to estimate it's beta-coefficient per per year/5 years/10 years?
The beta coefficient shows the amount that the dependent variable (DV, in this case continuous_outcome
) will increase for every one unit increase in your independent variable (IV, in this case age
in years).
If you want to show the relationship per 1/10th of a year, multiply your age
column before fitting the model, or divide the beta coefficient by 10.
For your specific requests, since the beta coefficient is 86.06, you can multiply this by the number of years to get the increase of the continuous variable. So:
- 1 year increase = +86.06
- 5 year increase = +430.3 (86.06 * 5)
- 10 year increase = +860.6 (86.06 * 10)
To answer the last question (The estimate for the effect of age per 5 years), that would be 430.3, which is 86.06 * 5. So for every 5 years that a persons age
increases, the continuous_outcome
increases by 430.3 on average.
Related Topics
How to Flatten a List of Lists
Align Ggplot2 Plots Vertically
Reverse Order of Discrete Y Axis in Ggplot2
How to Read Only Lines That Fulfil a Condition from a CSV into R
How to Obtain an 'Unbalanced' Grid of Ggplots
How to Make R Beep/Play a Sound at the End of a Script
Automatically Delete Files/Folders
Code to Import Data from a Stack Overflow Query into R
Non-Equi Join Using Data.Table: Column Missing from the Output
How to Add Legend to Ggplot Manually? - R
Min for Each Row in a Data Frame
Count Number of Zeros Per Row, and Remove Rows with More Than N Zeros
Error in Grid.Call(L_Textbounds, As.Graphicsannot(X$Label), X$X, X$Y,:Polygon Edge Not Found
How to Install Development Version of R Packages Github Repository