Is there a better alternative than string manipulation to programmatically build formulas?
reformulate
will do what you want.
reformulate(termlabels = c('x','z'), response = 'y')
## y ~ x + z
Or without an intercept
reformulate(termlabels = c('x','z'), response = 'y', intercept = FALSE)
## y ~ x + z - 1
Note that you cannot construct formulae with multiple reponses
such as x+y ~z+b
reformulate(termlabels = c('x','y'), response = c('z','b'))
z ~ x + y
To extract the terms from an existing formula
(given your example)
attr(terms(RHS), 'term.labels')
## [1] "a" "b"
To get the response is slightly different, a simple approach (for a single variable response).
as.character(LHS)[2]
## [1] 'y'
combine_formula <- function(LHS, RHS){
.terms <- lapply(RHS, terms)
new_terms <- unique(unlist(lapply(.terms, attr, which = 'term.labels')))
response <- as.character(LHS)[2]
reformulate(new_terms, response)
}
combine_formula(LHS, list(RHS, RHS2))
## y ~ a + b + c
## <environment: 0x577fb908>
I think it would be more sensible to specify the response as a character vector, something like
combine_formula2 <- function(response, RHS, intercept = TRUE){
.terms <- lapply(RHS, terms)
new_terms <- unique(unlist(lapply(.terms, attr, which = 'term.labels')))
response <- as.character(LHS)[2]
reformulate(new_terms, response, intercept)
}
combine_formula2('y', list(RHS, RHS2))
you could also define a +
operator to work with formulae (update setting an new method for formula objects)
`+.formula` <- function(e1,e2){
.terms <- lapply(c(e1,e2), terms)
reformulate(unique(unlist(lapply(.terms, attr, which = 'term.labels'))))
}
RHS + RHS2
## ~a + b + c
You can also use update.formula
using .
judiciously
update(~a+b, y ~ .)
## y~a+b
Any pitfalls to using programmatically constructed formulas?
I'm always hesitant to claim there are no situations in which something involving R environments and scoping might bite, but ... after some more exploration, my first usage above does look safe.
It turns out that the printed call is a bit of red herring.
The formula that actually gets used by other functions (and the one extracted by formula()
and as.formula()
) is the one stored in the terms
element of the fit object, and it gets the actual formula right. (The terms
element contains an object of class "terms"
, which is just a "formula"
with a bunch of attached attributes.)
To see that all of the proposals in my question and the associated comments store the same "formula"
object (up to the associated environment), run the following.
## First the three approaches in my post
formula(fun(XX=c("cyl", "disp")))
# mpg ~ cyl + disp
# <environment: 0x026d2b7c>
formula(lm(mpg ~ cyl + disp, data=mtcars))
# mpg ~ cyl + disp
formula(fun2(XX=c("cyl", "disp"))$call)
# mpg ~ cyl + disp
# <environment: 0x02c4ce2c>
## Then Gabor Grothendieck's idea
XX = c("cyl", "disp")
ff <- reformulate(response="mpg", termlabels=XX)
formula(do.call("lm", list(ff, quote(mtcars))))
## mpg ~ cyl + disp
To confirm that formula()
really is deriving its output from the terms
element of the fit object, have a look at stats:::formula.lm
and stats:::formula.terms
.
Is there a better reference for r formulas than ?formula?
R comes with several manuals, which are accessible from vanilla R's "Help" menu at the top right when running R and are also in several places on-line.
Chapter 11 of "An Introduction to R" has a couple of pages on formulas, for example.
I don't know that it constitutes a "comprehensive" resource but it covers much* of what you need to know about how formulas work.
* Indeed, pretty much all of what perhaps 95% of users will ever use
The canonical reference to formulas in the S language might be
Chambers J.M., and Hastie T.J., eds. (1992),
Statistical Models in S. Chapman & Hall, London.
though the origin of the approach comes from
Wilkinson G.N., and Rogers C.E. (1973). "Symbolic Description of Factorial Models for Analysis of Variance." Applied Statistics, 22, 392–399
A number of recent books related to R discuss formulas but I don't know that I'd call any of them comprehensive.
There are also numerous on-line resources (for example here) often with a good deal of very useful information.
That said, once you get comfortable with using formulas in R and so have a context into which more knowledge can be placed, the help page contains a surprising amount of information (along with other pages it links to). It is a bit terse and cryptic, but once you have the broader base of knowledge of R's particular way of working, it can be quite useful.
Specific questions relating to R formulas (depending on their content) are likely to be on topic either at StackOverflow or at CrossValidated - indeed there are some quite advanced questions relating to formulas to be found already (use of searches like [r] formula
might be fruitful), and it would be handy to have more such questions to help users struggling with these issues; if you have specific questions I'd encourage you to ask.
As for 'redundant' and 'conflicting', I suppose you mean things like the fact that there is more than one way to specify a no-intercept model : y ~ . -1
and y ~ . +0
both work, for example, but in slightly different contexts each makes sense.
In addition, there's the common bugbear of having to isolate quadratic and higher order terms from the formula interface (to use I(x^2)
as a predictor so it's passed through the formula interface unharmed and survives far enough to be interpreted as an algebraic expression). Again, once you get a picture of what's going on 'behind the scenes' that seems much less of a nuisance.
Specific examples of the things I just mentioned:
lm(dist ~ . -1, data=cars) # "remove-intercept-term" form of no-intercept
lm(dist ~ . +0, data=cars) # "make-intercept-zero" form of no-intercept
lm(dist ~ speed + speed^2, data=cars) # doesn't do what we want here
lm(dist ~ speed + I(speed^2), data=cars) # gets us a quadratic term
lm(dist ~ poly(speed,2), data=cars) # avoid potential multicollinearity
I agree that the formula interface could at least use a little further guidance and better examples in the ?formula
help.
How do I make the number of two-sided formulas in case_when depend on the number of arguments?
likert_score <- function(x, a){
recode(x, !!!setNames(as.character(seq_along(a)), a))
}
Select everything right of `~`
Perhaps most obvious (which didn't occur to me until just now...how sad)
form <- y~x+tx+x*tx
update(form, new_y ~ .)
There are a few ways to approach this, but this might be my preferred (at least for now).
form <- y~x+tx+x*tx
rhs <- sub(".+~", "", deparse(form))
as.formula(paste0("new_y ~", rhs))
You can also get the right hand side with
tail(as.character(form), 1)
But that assume that there is a right hand side of the formula.
How to call lm with variables?
You may use the formula
function. The following should work :
f <- function (x, y, data) {
linm <- lm(formula(paste(y,"~",x)), data)
summary(linm)$r.squared
}
How to wrap RHS terms of a formula with a function
If I borrow some functions I originally wrote here, you could do something like this. First, the helper functions...
extract_rhs_symbols <- function(x) {
as.list(attr(delete.response(terms(x)), "variables"))[-1]
}
symbols_to_formula <- function(x) {
as.call(list(quote(`~`), x))
}
sum_symbols <- function(...) {
Reduce(function(a,b) bquote(.(a)+.(b)), do.call(`c`, list(...), quote=T))
}
transform_terms <- function(x, f) {
symbols_to_formula(sum_symbols(sapply(extract_rhs_symbols(x), function(x) do.call("substitute",list(f, list(x=x))))))
}
And then you can use
update(form1, transform_terms(form1, quote(poly(x, 2))))
# Y ~ poly(A, 2) + poly(B, 2)
update(form1, transform_terms(form1, quote(pspline(x, 4))))
# Y ~ pspline(A, 4) + pspline(B, 4)
how to select variables to use them in a formula with R
Create the formula with reformulate
:
form <- reformulate(termlabels = variables$model1, response = "wage", intercept = TRUE)
rpart(form, ...)
Note the intercept term that you have ignored so far: it is an additional modelling choice.
How to pass the right-hand side of a formula to another formula?
It seems that you should use the built in functionality of R, namely update.formula
, no need to write a new function:
> form <- ~s(x0)+s(x1)+s(x2)+s(x3)
> form
~s(x0)+s(x1)+s(x2)+s(x3)
> update.formula(form, z ~ .)
z ~ s(x0) + s(x1) + s(x2) + s(x3)
Related Topics
How to Sort Letters in a String
Remove All Punctuation Except Apostrophes in R
Calculate Cumulative Average (Mean)
How to Add a General Label to Facets in Ggplot2
Drop-Down Checkbox Input in Shiny
Filter Data Frame Rows Based on Values in Vector
Error ".Onload Failed in Loadnamespace() for 'Tcltk'"
Insert Picture/Table in R Markdown
How to Change Library Location in R
Line Break When No Data in Ggplot2
R - Add Column That Counts Sequentially Within Groups But Repeats for Duplicates
Ggplot2 Pie and Donut Chart on Same Plot
Angle Between Two Vectors in R
How to Draw a Line Across a Multiple-Figure Environment in R