Using R's Lm on a Dataframe with a List of Predictors

Using R's lm on a dataframe with a list of predictors

Using the formula notation y ~ . specifies that you want to regress y on all of the other variables in the dataset.

df = data.frame(y = 1:10, x1 = runif(10), x2 = rnorm(10))
# fits a model using x1 and x2
fit <- lm(y ~ ., data = df)
# Removes the column containing x1 so regression on x2 only
fit <- lm(y ~ ., data = df[, -2])

Incorporate all columns of a dataframe into one regression

Instead of attach use. Here . signifies all the other columns

lm(y ~ ., data = data)

e.g. a reproducible example with mtcars

lm(mpg ~ ., data = mtcars)

Or another option is reformulate to construct the formula

lm(reformulate('.', response = 'mpg'), data = mtcars)

prediction using linear model and the importance of data.frame

When you call predict on a lm object, the function called is predict.lm. When you run it like:

predict(model_1, Sepal.Width=c(1,3,4,5))

What you are doing is providing c(1,3,4,5) an argument or parameter to Sepal.Width, which predict.lm ignores since this argument does not exist for this function.

When there is no new input data, you are running predict.lm(model_1), and getting back the fitted values:

table(predict(model_1) == predict(model_1, Sepal.Width=c(1,3,4,5)))

TRUE
150

In this case, you fitted the model with a formula, the predict.lm function needs your data frame to reconstruct the independent or exogenous matrix, matrix multiply with the coefficients and return you the predicted values.

This is briefly what predict.lm is doing:

newdata = data.frame(Sepal.Width=c(1,3,4,5))
Terms = delete.response(terms(model_1))
X = model.matrix(Terms,newdata)

X
(Intercept) Sepal.Width
1 1 1
2 1 3
3 1 4
4 1 5

X %*% coefficients(model_1)
[,1]
1 6.302861
2 5.856139
3 5.632778
4 5.409417

predict(model_1,newdata)

1 2 3 4
6.302861 5.856139 5.632778 5.409417

How to succinctly write a formula with many variables from a data frame?

There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.

y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)

You can also do things like this, to use all variables but one (in this case x3 is excluded):

mod <- lm(y ~ . - x3, data = d)

Technically, . means all variables not already mentioned in the formula. For example

lm(y ~ x1 * x2 + ., data = d)

where . would only reference x3 as x1 and x2 are already in the formula.

how to run lm regression for every column in R

Your code looks fine except when you call i within lm, R will read i as a string, which you can't regress things against. Using get will allow you to pull the column corresponding to i.

df=data.frame(x=rnorm(100),y1=rnorm(100),y2=rnorm(100),y3=rnorm(100))

storage <- list()
for(i in names(df)[-1]){
storage[[i]] <- lm(get(i) ~ x, df)
}

I create an empty list storage, which I'm going to fill up with each iteration of the loop. It's just a personal preference but I'd also advise against how you've written your current loop:

 for(i in names(df[,-1])){
model = lm(i~x, data=df)
}

You will overwrite model, thus returning only the last iteration results. I suggest you change it to a list, or a matrix where you can iteratively store results.

Hope that helps



Related Topics



Leave a reply



Submit