Specifying Formula in R with Glm Without Explicit Declaration of Each Covariate

How to succinctly write a formula with many variables from a data frame?

There is a special identifier that one can use in a formula to mean all the variables, it is the . identifier.

y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)

You can also do things like this, to use all variables but one (in this case x3 is excluded):

mod <- lm(y ~ . - x3, data = d)

Technically, . means all variables not already mentioned in the formula. For example

lm(y ~ x1 * x2 + ., data = d)

where . would only reference x3 as x1 and x2 are already in the formula.

short formula call for many variables when building a model

You can use . as described in the help page for formula. The . stands for "all columns not otherwise in the formula".

lm(output ~ ., data = myData).

Alternatively, construct the formula manually with paste. This example is from the as.formula() help page:

xnam <- paste("x", 1:25, sep="")
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))

You can then insert this object into regression function: lm(fmla, data = myData).

Linear Regression in R with variable number of explanatory variables

Three ways, in increasing level of flexibility.

Method 1

Run your regression using the formula notation:

fit <- lm( Y ~ . , data=dat )

Method 2

Put all your data in one data.frame, not two:

dat <- cbind(data.frame(Y=Y),as.data.frame(X))

Then run your regression using the formula notation:

fit <- lm( Y~. , data=dat )

Method 3

Another way is to build the formula yourself:

model1.form.text <- paste("Y ~",paste(xvars,collapse=" + "),collapse=" ")
model1.form <- as.formula( model1.form.text )
model1 <- lm( model1.form, data=dat )

In this example, xvars is a character vector containing the names of the variables you want to use.

Variable declaration / option explicit in R

A very "dirty" solution would be to systematically check that the function's local environment is not changed after the "preamble".

fun <- function(x) {
  a <- x
  ls1 <- ls()
  if (a<0) {
    if (a > -50) x <- -1 else x <- -2
  }
  ls2 <- ls()
  print(list(c(ls1,"ls1"),ls2))
  if (!setequal(c(ls1,"ls1"), ls2)) stop("Something went terribly wrong!")
  return(x)
}

fun.typo <- function(x) {
  a <- x
  ls1 <- ls()
  if (a<0) {
    if (a > -50) x <- -1 else X <- -2
  }
  ls2 <- ls()
  print(list(c(ls1,"ls1"),ls2))
  if (!setequal(c(ls1,"ls1"), ls2)) stop("Something went terribly wrong!")
  return(x)
}

With this "solution", fun.typo(-60) no longer silently gives a wrong answer...

Inputting a whole data frame as independent variables in a logistic regression

You want the . special symbol in the formula notation. Also, it is probably better to have the response and predictors in the single data frame.

Try:

MFDU <- cbind(MFDUdep, MFDUind)
ft <- glm(y ~ ., data = MFDU, family = binomial)

Now that I have given you the rope, I am obliged to at least warn you about the potential for hanging...

The approach you are taking is usually not the recommended one, unless perhaps prediction is the purpose of the model. Regression coefficient for selected variables may be strongly biased so if you are using this for enlightenment, then rethink your approach.

You will also need a lot of observations to allow 100+ terms in a model.

Better alternative exist; e.g. see the glmnet package for one such approach which allows for ridge, lasso or both (elastic net) constraints on the set of coefficients, which allows one to minimise model error at the expense of a small amount of additional bias.

glm formula : operator? What does it do?

If first is sex and second is eye colour it means your analysis is divided into sex/eye colour categories, so your output parameters relate to blue eyed males, green eyes females etc. You get this if your formula is Y~first:second.

With the first*second formula you get on overall parameter (or set of) for eyecolour, another for sex, and the paired factors. You get this if your formula is Y~first*second.

If you do Y~first + second you get separate parameters for each of the factors.

Actually this is probably a stats.stackexchange.com question...