Linear Regression "Na" Estimate Just for Last Coefficient

linear regression NA estimate just for last coefficient

You have to think a bit more about how your model is defined.

Here's your approach (edited for readability):

> set.seed(101)
> dat<-data.frame(one=c(sample(1000:1239)),
two=c(sample(200:439)),
three=c(sample(600:839)),
Jan=c(rep(1,20),rep(0,220)),
Feb=c(rep(0,20),rep(1,20),rep(0,200)),
Mar=c(rep(0,40),rep(1,20),rep(0,180)),
Apr=c(rep(0,60),rep(1,20),rep(0,160)),
May=c(rep(0,80),rep(1,20),rep(0,140)),
Jun=c(rep(0,100),rep(1,20),rep(0,120)),
Jul=c(rep(0,120),rep(1,20),rep(0,100)),
Aug=c(rep(0,140),rep(1,20),rep(0,80)),
Sep=c(rep(0,160),rep(1,20),rep(0,60)),
Oct=c(rep(0,180),rep(1,20),rep(0,40)),
Nov=c(rep(0,200),rep(1,20),rep(0,20)),
Dec=c(rep(0,220),rep(1,20)))
> summary(lm(one ~ two + three + Jan + Feb + Mar + Apr +
May + Jun + Jul + Aug + Sep + Oct + Nov + Dec,
data=dat))

And the answers:

[snip]
Coefficients: (1 not defined because of singularities)

note this line, it indicates that R (and any other statistical package you choose to use) can't estimate all the parameters because the predictor variables are not all linearly independent.

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1149.55556 53.52499 21.477 <2e-16 ***

The intercept here represents the predicted value when all predictor variables are zero. In any particular case the interpretation of the intercept depends on how you have parameterized your model. The dummy variables you have defined for month are not all linearly independent; lm is smart enough to detect this and drop some of the unidentifiable (linearly dependent) predictor variables. The details of which particular predictor(s) are discarded in this case are obscure and technical (you would probably have to look inside the lm.fit function, but you probably don't want to do this). In this case, R decides to throw away the December predictor. Therefore, if we set all the predictors (two, three, and all month dummies Jan-Nov) to zero, we end up with the expected value when two=0 and three=0 and when the month is not equal to any of Jan-Nov -- i.e., the expected value for December.

two           -0.09670    0.06621  -1.460   0.1455    
three 0.02446 0.06666 0.367 0.7141
Jan -19.49744 22.17404 -0.879 0.3802
Feb -28.22652 22.27438 -1.267 0.2064
Mar -6.05246 22.25468 -0.272 0.7859
Apr -5.60192 22.41204 -0.250 0.8029
May -13.19127 22.34289 -0.590 0.5555
Jun -19.69547 22.14274 -0.889 0.3747
Jul -44.45511 22.20837 -2.002 0.0465 *
Aug -2.08404 22.26202 -0.094 0.9255
Sep -10.13351 22.10252 -0.458 0.6470
Oct -31.80482 22.33335 -1.424 0.1558
Nov -20.35348 22.09953 -0.921 0.3580
Dec NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 69.81 on 226 degrees of freedom
Multiple R-squared: 0.04381, Adjusted R-squared: -0.01119
F-statistic: 0.7966 on 13 and 226 DF, p-value: 0.6635

Now do it again, this time setting up a model formula that uses -1 to discard the intercept term (we reset the random seed for reproducibility):

> set.seed(101)
> dat1 <- data.frame(one=c(sample(1000:1239)),two=c(sample(200:439)),
three=c(sample(600:839)),
month=factor(rep(month.abb,each=20),levels=month.abb))
> summary(lm(one ~ two + three + month-1, data=dat1))

Coefficients:
Estimate Std. Error t value Pr(>|t|)
two -0.09670 0.06621 -1.460 0.146
three 0.02446 0.06666 0.367 0.714

The estimates for two and three are the same as before.

monthJan 1130.05812   52.79625  21.404   <2e-16 ***
monthFeb 1121.32904 55.18864 20.318 <2e-16 ***
monthMar 1143.50310 53.59603 21.336 <2e-16 ***
monthApr 1143.95365 54.99724 20.800 <2e-16 ***
monthMay 1136.36429 53.38218 21.287 <2e-16 ***
monthJun 1129.86010 53.85865 20.978 <2e-16 ***
monthJul 1105.10045 54.94940 20.111 <2e-16 ***
monthAug 1147.47152 54.57201 21.027 <2e-16 ***
monthSep 1139.42205 53.58611 21.263 <2e-16 ***
monthOct 1117.75075 55.35703 20.192 <2e-16 ***
monthNov 1129.20208 53.54934 21.087 <2e-16 ***
monthDec 1149.55556 53.52499 21.477 <2e-16 ***

The estimate for December is the same as the intercept estimate above. The other months' parameter estimates are equal to (intercept+previous value). The p values are different, because their meaning has changed. Previously, they were a test of differences of each month from December; now they are a test of the differences of each month from a baseline value of zero.

Unsure why getting NA for coefficient in linear regression

Your variables method and lot are categorical, and you can check how they occur or co-occur in your data:

table(data$method,data$lot)

2 3 4
1 20 0 0
2 0 16 10

If you look at the table above, you see that all your observations have that 1 as method, have only 2 as lot. And the same goes for method 3 and 4 only occurring in method 2.

This means that we cannot distinguish the effects of method 2 and lot 1, as they always go hand in hand in your data. Likewise, we cannot calculate the coefficient of all lot3, lot4 and method 2. We can try one simple model without the interactions:

coefficients(lm(conc ~method+lot,data=data))
(Intercept) method2 lot3 lot4
0.58356132 0.16418556 0.08506782 NA

If you put method first, method 1 is the reference and after estimating method 2 coefficient, there no observations to tell you what lot3 and lot4 are.

If you take lot first, lot 2 is set as reference and the model estimates 3 and 4, and you cannot estimate method 2.

 coefficients(lm(conc ~lot+method,data=data))
(Intercept) lot3 lot4 method2
0.5835613 0.2492534 0.1641856 NA

When it's like this, you definitely cannot calculate the coefficients for interaction effects.

Does R always return NA as a coefficient as a result of linear regression with unnecessary variables?

There will be NA?

Yes. Adding these columns does not enlarge column space. The resulting matrix is rank-deficient.

How many NA?

It depends on the numerical rank.

number of NA = number of coefficients - rank of model matrix

In your example, after introducing ec, there will be one NA. Changing the specification order for covariates in the model formula is essentially doing column shuffling for the model matrix. This does not change the matrix rank, so you will always get only one NA regardless of your specification order.

OK, but which one is NA?

lm does LINPACK QR factorization with restricted column pivoting. The order of covariates affects which one is NA. Generally, a "first comes, first serves" principle holds, and the position of NA is quite predictable. Take your examples for illustration. In the first specification, these co-linear terms show up in Examination, Catholic, ec order, so the third one ec has NA coefficient. In your second specification, these terms show up in ec, Examination, Catholic order, and the third one Catholic has NA coefficient. Note that coefficients estimation is not invariant to specification order, although fitted values are invariant.

If LAPACK QR factorization with complete column pivoting is taken, coefficients estimation would be invariant to specification order. However, the position of NA is not as predictable as in LINPACK case, and is purely decided numerically.



Numerical Examples

LAPACK based QR factorization is implemented in mgcv package. Numerical rank is detected when REML estimation is used, and unidentifiable coefficients are reported as 0 (not NA). So we can make a comparison between lm and gam / bam in linear model estimation. Let's first construct a toy dataset.

set.seed(0)

# an initial full rank matrix
X <- matrix(runif(500 * 10), 500)
# make the last column as a random linear combination of previous 9 columns
X[, 10] <- X[, -10] %*% runif(9)

# a random response
Y <- rnorm(500)

Now we shuffle columns of X to see whether NA changes its position under lm estimation, or whether 0 changes its position under gam and bam estimation.

test <- function (fun = lm, seed = 0, ...) {
shuffleFit <- function (fun) {
shuffle <- sample.int(ncol(X))
Xs <- X[, shuffle]
b <- unname(coef(fun(Y ~ Xs, ...)))
back <- order(shuffle)
c(b[1], b[-1][back])
}
set.seed(seed)
oo <- t(replicate(10, shuffleFit(fun)))
colnames(oo) <- c("intercept", paste0("X", 1:ncol(X)))
oo
}

First we check with lm:

test(fun = lm)

We see that NA changes its position with column shuffling of X. Estimated coefficients vary, too.


Now we check with gam

library(mgcv)
test(fun = gam, method = "REML")

We see that estimation is invariant to column shuffling of X, and coefficient for X5 is always 0.


Finally we check bam (bam is slow for small dataset like here. It is designed for large or super large dataset. So the following is noticeably slower).

test(fun = bam, gc.level = -1)

The result is as same as what we see for gam.

How do I interpret `NA` coefficients from a GLM fit with the quasipoisson family?

My guess is that the author (who says "the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model") is wrong (although it's hard to be 100% sure without having the full context).

The problem is almost certainly that you have some multicollinear predictors. The reason that different variables get dropped/have NA coefficients returned is that R partly uses the order to determine which ones to drop (as far as the fitted model result goes, it doesn't matter - all of the top-level results (predictions, goodness of fit, etc.) are identical).

In comments the OP says:

The relationship between log_a and log_gm_a is that this is a multiplicative fixed-effects model. So log_a is the log of predictor a. log_gm_a is the log of the geometric mean of a. So each of the log_gm terms is constant across all observations.

This is the key information needed to diagnose the problem. Because the intercept is excluded from this model (the formula contains 0+, having one constant column in the model matrix is OK, but multiple constant columns is trouble; all but the first (in whatever order is specified by the formula) will be discarded. To go slightly deeper: the model requested is

Y = b1*C1 + b2*C2 + b3*C3 + [additional terms]

where C1, C2, C3 are constants. At the point in "data space" where the additional terms are 0 (i.e. for cases where log_a = log_b = log_c = ... = 0), we're left with predicting a constant value from three separate constant terms. Suppose that the intercept in a regular model (~ 1 + log_a + log_b + log_c) would have been m. Then any combination of (b1, b2, b3) that makes the sum equal to zero (and there are infinitely many) will fit the data equally well.

I still don't know much about the context, but it might be worth considering adding the constant terms as offsets in the model. Or scale the predictors by their geometric means/subtract the log-geom-means from the predictors?


In other cases, multicollinearity arises from unidentifiable interaction terms; nested variables; attempts to include all the levels of multiple categorical variables; or including the proportions of all levels of some compositional variable (e.g. proportions of habitat types, where the proportions add up to 1) in the model, e.g.

  • Why do I get NA coefficients and how does `lm` drop reference level for interaction
  • linear regression "NA" estimate just for last coefficient

Linear regression on 415 files, output just filename, regression coefficient, significance

First I simulate like 5 csv files, with columns that look like yours:

for(i in 1:5){
tab=data.frame(
YEAR=1950:2014,
FFD= rpois(65,100),
LFD= rnorm(65,100,10),
RAN= rnbinom(65,mu=100,size=1),
MEAN = runif(65,min=50,max=150)
)
write.csv(tab,paste0("data",i,".csv"))
}

Now, we need a vector of all the files in your directory, this will be different for yours, but try to create this somehow using the pattern argument:

csvfiles = dir(pattern="data[0-9]*.csv$")

So we use three libraries from tidyverse, and I guess each csv file is not so huge, so the code below reads in all the files, group them by the source and performs the regression, note you can use call the columns from the data frame, not having to rename them:

library(dplyr)
library(purrr)
library(broom)

csvfiles %>%
map_df(function(i){df = read.csv(i);df$data = i;df}) %>%
group_by(data) %>%
do(tidy(lm(FFD ~ YEAR,data=.))) %>%
filter(term!="(Intercept)")

# A tibble: 5 x 6
# Groups: data [5]
data term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 data1.csv YEAR -0.0228 0.0731 -0.311 0.756
2 data2.csv YEAR -0.139 0.0573 -2.42 0.0182
3 data3.csv YEAR -0.175 0.0650 -2.70 0.00901
4 data4.csv YEAR -0.0478 0.0628 -0.762 0.449
5 data5.csv YEAR 0.0204 0.0648 0.315 0.754

You can just change the formula inside lm(FFD ~ YEAR,data=.) to get the other regressions

Within a simple linear regression in R, how do I rescale age to estimate it's beta-coefficient per per year/5 years/10 years?

The beta coefficient shows the amount that the dependent variable (DV, in this case continuous_outcome) will increase for every one unit increase in your independent variable (IV, in this case age in years).

If you want to show the relationship per 1/10th of a year, multiply your age column before fitting the model, or divide the beta coefficient by 10.

For your specific requests, since the beta coefficient is 86.06, you can multiply this by the number of years to get the increase of the continuous variable. So:

  • 1 year increase = +86.06
  • 5 year increase = +430.3 (86.06 * 5)
  • 10 year increase = +860.6 (86.06 * 10)

To answer the last question (The estimate for the effect of age per 5 years), that would be 430.3, which is 86.06 * 5. So for every 5 years that a persons age increases, the continuous_outcome increases by 430.3 on average.



Related Topics



Leave a reply



Submit