Screening (Multi)Collinearity in a Regression Model

Screening (multi)collinearity in a regression model

The kappa() function can help. Here is a simulated example:

> set.seed(42)
> x1 <- rnorm(100)
> x2 <- rnorm(100)
> x3 <- x1 + 2*x2 + rnorm(100)*0.0001 # so x3 approx a linear comb. of x1+x2
> mm12 <- model.matrix(~ x1 + x2) # normal model, two indep. regressors
> mm123 <- model.matrix(~ x1 + x2 + x3) # bad model with near collinearity
> kappa(mm12) # a 'low' kappa is good
[1] 1.166029
> kappa(mm123) # a 'high' kappa indicates trouble
[1] 121530.7

and we go further by making the third regressor more and more collinear:

> x4 <- x1 + 2*x2 + rnorm(100)*0.000001  # even more collinear
> mm124 <- model.matrix(~ x1 + x2 + x4)
> kappa(mm124)
[1] 13955982
> x5 <- x1 + 2*x2 # now x5 is linear comb of x1,x2
> mm125 <- model.matrix(~ x1 + x2 + x5)
> kappa(mm125)
[1] 1.067568e+16
>

This used approximations, see help(kappa) for details.

Capturing high multi-collinearity in statsmodels

You can detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vector shows which variables are collinear.

If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:

>>> xs = np.random.randn(100, 5)      # independent variables
>>> corr = np.corrcoef(xs, rowvar=0) # correlation matrix
>>> w, v = np.linalg.eig(corr) # eigen values & eigen vectors
>>> w
array([ 1.256 , 1.1937, 0.7273, 0.9516, 0.8714])

However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then

>>> noise = np.random.randn(100)                      # white noise
>>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise # collinearity
>>> corr = np.corrcoef(xs, rowvar=0)
>>> w, v = np.linalg.eig(corr)
>>> w
array([ 0.0083, 1.9569, 1.1687, 0.8681, 0.9981])

one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:

>>> v[:,0]
array([-0.4077, 0.0059, -0.5886, 0.0018, 0.6981])

Ignoring almost zero coefficients, above basically says x[0], x[2] and x[4] are colinear (as expected). If one standardizes xs values and multiplies by this eigen vector, the result will hover around zero with small variance:

>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0)  # standardized values
>>> ys = std_xs.dot(v[:,0])
>>> ys.mean(), ys.var()
(0, 0.0083)

Note that ys.var() is basically the eigen value which was close to zero.

So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.

R - Testing for homo/heteroscedasticity and collinearity in a multivariate regression model

I would like to try to give a first help

The answer to the first question: Yes, you can use the Breusch-Pagan test and the Durbin Watson test for mutlivariate models. (However, I have always used the dwtest() instead of the durbinWatsonTest()).

Also note that the dwtest() checks only the first-order autocorrelation. Unfortunately, I do not know how to find out which variable is causing heteroscedasticity or auto-correlation. However, if you encounter these problems, then one possible solution is that you use a robust estimation method, e.g. after NeweyWest (using: coeftest (regression model, vcov = NeweyWest)) at autocorrelation or with coeftest(regression model, vcov = vcovHC) at heteroscedasticity, both from the AER package.

Multicollinearity test with car::vif

This is telling you that some set(s) of predictors is/are perfectly (multi)collinear; if you looked at coef(reg1) you would see at least one NA value, and if you ran summary(lm) you would see the message

([n] not defined because of singularities)

(for some n>=1). Examining the pairwise correlations of the predictor variables is not enough, because if you have (e.g.) predictors A, B, C where (the absolute values of) none of the pairwise correlations are exactly 1, they can still be multicollinear. (Probably the most common case is where A, B, C are dummy variables that describe a mutually exclusive and complete set of possibilities [i.e. for each observation exactly one of A, B, C is 1 and the other two are 0]. I strongly suspect that this is what's going on with your last 16 or so variables, which seem to be boroughs of Oslo ...)

Checking to see which coefficients of the regression are NA (as suggested by @Axeman) can suggest where the problem is;
this answer explains how you can use model.matrix() and caret::findLinearCombos to figure out exactly which sets of predictors are causing the problem. (If all of your predictors are simple numeric variables you can skip model.matrix().)

If your problem is indeed caused by including a dummy variable for every possible geographic region, the simplest/best solution is to include geographic region (borough) in the model as a factor: if you do this, R will automatically generate a set of dummies/contrasts, but it will leave one dummy out automatically to avoid this kind of problem. If you later want to go back and get predicted values for every borough, you can use tools from the emmeans or effects packages.

Test for Multicollinearity in Panel Data R

This question has been asked with reference to other statistical packages such as SAS https://communities.sas.com/thread/47675 and Stata http://www.stata.com/statalist/archive/2005-08/msg00018.html and the common answer has been to use pooled model to get VIF. The logic is that since multicollinearity is only about independent variable there is no need to control for individual effects using panel methods.

Here's some code extracted from another site:

mydata=read.csv("US Panel Data.csv")
attach(mydata) # not sure is that's really needed
Y=cbind(Return) # not sure what that is doing
pdata=plm.data(mydata, index=c("id","t"))
model=plm(Y ~ 1+ESG+Beta+Market.Cap+PTBV+Momentum+Dummy1+Dummy2+Dummy3+Dummy4+Dummy5+
Dummy6+Dummy7+Dummy8+Dummy9,
data=pdata,model="pooling")
vif(model)


Related Topics



Leave a reply



Submit