Is There a Better Alternative Than String Manipulation to Programmatically Build Formulas

Is there a better alternative than string manipulation to programmatically build formulas?

reformulate will do what you want.

reformulate(termlabels = c('x','z'), response = 'y')
## y ~ x + z

Or without an intercept

reformulate(termlabels = c('x','z'), response = 'y', intercept = FALSE)
## y ~ x + z - 1

Note that you cannot construct formulae with multiple reponses such as x+y ~z+b

reformulate(termlabels = c('x','y'), response = c('z','b'))
z ~ x + y

To extract the terms from an existing formula (given your example)

attr(terms(RHS), 'term.labels')
## [1] "a" "b"

To get the response is slightly different, a simple approach (for a single variable response).

as.character(LHS)[2]
## [1] 'y'


combine_formula <- function(LHS, RHS){
  .terms <- lapply(RHS, terms)
  new_terms <- unique(unlist(lapply(.terms, attr, which = 'term.labels')))
  response <- as.character(LHS)[2]

  reformulate(new_terms, response)


}


combine_formula(LHS, list(RHS, RHS2))

## y ~ a + b + c
## <environment: 0x577fb908>

I think it would be more sensible to specify the response as a character vector, something like

combine_formula2 <- function(response, RHS, intercept = TRUE){
  .terms <- lapply(RHS, terms)
  new_terms <- unique(unlist(lapply(.terms, attr, which = 'term.labels')))
  response <- as.character(LHS)[2]

  reformulate(new_terms, response, intercept)


}
combine_formula2('y', list(RHS, RHS2))

you could also define a + operator to work with formulae (update setting an new method for formula objects)

`+.formula` <- function(e1,e2){
  .terms <- lapply(c(e1,e2), terms)
  reformulate(unique(unlist(lapply(.terms, attr, which = 'term.labels'))))
}

RHS + RHS2
## ~a + b + c

You can also use update.formula using . judiciously

 update(~a+b, y ~ .)
 ##  y~a+b

Any pitfalls to using programmatically constructed formulas?

I'm always hesitant to claim there are no situations in which something involving R environments and scoping might bite, but ... after some more exploration, my first usage above does look safe.

It turns out that the printed call is a bit of red herring.

The formula that actually gets used by other functions (and the one extracted by formula() and as.formula()) is the one stored in the terms element of the fit object, and it gets the actual formula right. (The terms element contains an object of class "terms", which is just a "formula" with a bunch of attached attributes.)

To see that all of the proposals in my question and the associated comments store the same "formula" object (up to the associated environment), run the following.

## First the three approaches in my post
formula(fun(XX=c("cyl", "disp")))
# mpg ~ cyl + disp
# <environment: 0x026d2b7c>

formula(lm(mpg ~ cyl + disp, data=mtcars))
# mpg ~ cyl + disp

formula(fun2(XX=c("cyl", "disp"))$call)
# mpg ~ cyl + disp
# <environment: 0x02c4ce2c>

## Then Gabor Grothendieck's idea
XX = c("cyl", "disp")
ff <- reformulate(response="mpg", termlabels=XX)
formula(do.call("lm", list(ff, quote(mtcars))))  
## mpg ~ cyl + disp

To confirm that formula() really is deriving its output from the terms element of the fit object, have a look at stats:::formula.lm and stats:::formula.terms.

Is there a better reference for r formulas than ?formula?

R comes with several manuals, which are accessible from vanilla R's "Help" menu at the top right when running R and are also in several places on-line.

Chapter 11 of "An Introduction to R" has a couple of pages on formulas, for example.

I don't know that it constitutes a "comprehensive" resource but it covers much* of what you need to know about how formulas work.

* Indeed, pretty much all of what perhaps 95% of users will ever use

The canonical reference to formulas in the S language might be

Chambers J.M., and Hastie T.J., eds. (1992),
Statistical Models in S. Chapman & Hall, London.

though the origin of the approach comes from

Wilkinson G.N., and Rogers C.E. (1973). "Symbolic Description of Factorial Models for Analysis of Variance." Applied Statistics, 22, 392–399

A number of recent books related to R discuss formulas but I don't know that I'd call any of them comprehensive.

There are also numerous on-line resources (for example here) often with a good deal of very useful information.

That said, once you get comfortable with using formulas in R and so have a context into which more knowledge can be placed, the help page contains a surprising amount of information (along with other pages it links to). It is a bit terse and cryptic, but once you have the broader base of knowledge of R's particular way of working, it can be quite useful.

Specific questions relating to R formulas (depending on their content) are likely to be on topic either at StackOverflow or at CrossValidated - indeed there are some quite advanced questions relating to formulas to be found already (use of searches like [r] formula might be fruitful), and it would be handy to have more such questions to help users struggling with these issues; if you have specific questions I'd encourage you to ask.

As for 'redundant' and 'conflicting', I suppose you mean things like the fact that there is more than one way to specify a no-intercept model : y ~ . -1 and y ~ . +0 both work, for example, but in slightly different contexts each makes sense.

In addition, there's the common bugbear of having to isolate quadratic and higher order terms from the formula interface (to use I(x^2) as a predictor so it's passed through the formula interface unharmed and survives far enough to be interpreted as an algebraic expression). Again, once you get a picture of what's going on 'behind the scenes' that seems much less of a nuisance.

Specific examples of the things I just mentioned:

lm(dist ~ . -1, data=cars) # "remove-intercept-term" form of no-intercept
lm(dist ~ . +0, data=cars) # "make-intercept-zero" form of no-intercept
lm(dist ~ speed + speed^2, data=cars) # doesn't do what we want here
lm(dist ~ speed + I(speed^2), data=cars) # gets us a quadratic term
lm(dist ~ poly(speed,2), data=cars) # avoid potential multicollinearity

I agree that the formula interface could at least use a little further guidance and better examples in the ?formula help.

How do I make the number of two-sided formulas in case_when depend on the number of arguments?

likert_score <- function(x, a){
   recode(x, !!!setNames(as.character(seq_along(a)), a))
}

Select everything right of `~`

Perhaps most obvious (which didn't occur to me until just now...how sad)

form <- y~x+tx+x*tx
update(form, new_y ~ .)

There are a few ways to approach this, but this might be my preferred (at least for now).

form <- y~x+tx+x*tx
rhs <- sub(".+~", "", deparse(form))
as.formula(paste0("new_y ~", rhs))

You can also get the right hand side with

tail(as.character(form), 1)

But that assume that there is a right hand side of the formula.

How to call lm with variables?

You may use the formula function. The following should work :

f <- function (x, y, data) {
    linm <- lm(formula(paste(y,"~",x)), data)
    summary(linm)$r.squared
}

How to wrap RHS terms of a formula with a function

If I borrow some functions I originally wrote here, you could do something like this. First, the helper functions...

extract_rhs_symbols <- function(x) {
    as.list(attr(delete.response(terms(x)), "variables"))[-1]
}
symbols_to_formula <- function(x) {
    as.call(list(quote(`~`), x))    
}
sum_symbols <- function(...) {
    Reduce(function(a,b) bquote(.(a)+.(b)), do.call(`c`, list(...), quote=T))
}
transform_terms <- function(x, f) {
    symbols_to_formula(sum_symbols(sapply(extract_rhs_symbols(x), function(x) do.call("substitute",list(f, list(x=x))))))
}

And then you can use

update(form1, transform_terms(form1, quote(poly(x, 2))))
# Y ~ poly(A, 2) + poly(B, 2)

update(form1, transform_terms(form1, quote(pspline(x, 4))))
# Y ~ pspline(A, 4) + pspline(B, 4)

how to select variables to use them in a formula with R

Create the formula with reformulate:

form <- reformulate(termlabels = variables$model1, response = "wage", intercept = TRUE)
rpart(form, ...)

Note the intercept term that you have ignored so far: it is an additional modelling choice.

How to pass the right-hand side of a formula to another formula?

It seems that you should use the built in functionality of R, namely update.formula, no need to write a new function:

> form <- ~s(x0)+s(x1)+s(x2)+s(x3)
> form
~s(x0)+s(x1)+s(x2)+s(x3)
> update.formula(form, z ~ .)
z ~ s(x0) + s(x1) + s(x2) + s(x3)

Is There a Better Alternative Than String Manipulation to Programmatically Build Formulas