How to Create a "Macro" for Regressors in R

How do I create a macro for regressors in R?

Here are some alternatives. No packages are used in the first 3.

1) reformulate

fo <- reformulate(regressors, response = "income")
lm(fo, Duncan)

or you may wish to write the last line as this so that the formula that is shown in the output looks nicer:

do.call("lm", list(fo, quote(Duncan)))

in which case the Call: line of the output appears as expected, namely:

Call:
lm(formula = income ~ education + prestige, data = Duncan)

2) lm(dataframe)

lm( Duncan[c("income", regressors)] )

The Call: line of the output look like this:

Call:
lm(formula = Duncan[c("income", regressors)])

but we can make it look exactly as in the do.call solution in (1) with this code:

fo <- formula(model.frame(income ~., Duncan[c("income", regressors)]))
do.call("lm", list(fo, quote(Duncan)))

3) dot

An alternative similar to that suggested by @jenesaisquoi in the comments is:

lm(income ~., Duncan[c("income", regressors)])

The approach discussed in (2) to the Call: output also works here.

4) fn$ Prefacing a function with fn$ enables string interpolation in its arguments. This solution is nearly identical to the desired syntax shown in the question using $ in place of @ to perform substitution and the flexible substitution could readily extend to more complex scenarios. The quote(Duncan) in the code could be written as just Duncan and it will still run but the Call: shown in the lm output will look better if you use quote(Duncan).

library(gsubfn)

rhs <- paste(regressors, collapse = "+")
fn$lm("income ~ $rhs", quote(Duncan))

The Call: line looks almost identical to the do.call solutions above -- only spacing and quotes differ:

Call:
lm(formula = "income ~ education+prestige", data = Duncan)

If you wanted it absolutely the same then:

fo <- fn$formula("income ~ $rhs")
do.call("lm", list(fo, quote(Duncan)))

How to write a function that will run multiple regression models of the same type with different dependent variables and then store them as lists?

Consider reformulate to dynamically change model formulas using character values for lm calls:

# VECTOR OF COLUMN NAMES (NOT VALUES)
dep.vars <- c("dep.var1", "dep.var2")

# USER-DEFINED METHOD TO PROCESS DIFFERENT DEP VAR
run_model <- function(dep.var) {
fml <- reformulate(c("x1", "x2"), dep.var)
lm(fml, data=data)
}

# NAMED LIST OF MODELS
all_models <- sapply(dep.vars, run_model, simplify = FALSE)

# OUTPUT RESULTS
all_models$dep.var1
all_models$dep.var2
...

From there, you can run further extractions or processes across model objects:

# NAMED LIST OF MODEL SUMMARIES
all_summaries <- lapply(all_models, summary)

all_summaries$dep.var1
all_summaries$dep.var2
...

# NAMED LIST OF MODEL COEFFICIENTS
all_coefficients <- lapply(all_models, `[`, "coefficients")

all_coefficients$dep.var1
all_coefficients$dep.var2
...

Add string of control variables in formula expressions in R

The easiest workaround might be to avoid strings and just keep everything as a formula. Then you can use update() to change the formula as needed

control_set_1 = ~. + education + income + sex + birth + race + trust_daily
control_set_2 = ~. + sex + birth + race + trust_daily

fit_controls <- lm(data = data, update(dv ~ politics*treatment, control_set_1))
fit_controls_2 <- lm(data = data, update(dv ~ politics*treatment, control_set_2))

The . in the control_set formulas keep all existing predictors and just adds the new values in.

How do I use a macro variable in R? (Similar to %LET in SAS)

how about this:

reg<-lm(formula(paste(depvar ,'~  var1 + var2')), data=mydata)

Regression, list of all variables in dataset

Below is an inelegant but effective method. First, since your data seem a little messier than is necessary to get the point of this method across: I've created a toy example dataset:

# Set the seed for reproducibility
set.seed(123)
# Create our variables
i <- rnorm(30) # explanatory variable of interest
A <- 5 * i + rnorm(30) # DV 1 -- true coefficient of 5
B <- -2 * i + rnorm(30) # DV 2 -- true coefficient of -2
C <- rnorm(30) # DV 3 -- true coefficient of 0 (independent of i)
# Make a dataframe out of them
dataset <- data.frame(A, B, C, i)

Do run these regressions, we'll get the names of each column we want to use as a DV, then use a combination of as.formula() and paste0() to create the appropriate formulas inside lm():

# And do the regressions
DVnames <- setdiff(colnames(dataset), "i")
models <- lapply(DVnames, function(j) {
# For every column name j of your dataframe *except* i,
# Run a linear regression with j as the DV and i as the IV
lm(as.formula(paste0(j, " ~ i")), data = dataset)
})

Now we can take a look at the results and do the predictions you're looking for:

# Check the results
for ( model in models ) {
print(summary(model))
}
#>
#> Call:
#> lm(formula = as.formula(paste0(j, " ~ i")), data = dataset)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.6085 -0.5056 -0.2152 0.6932 2.0118
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.1720 0.1534 1.121 0.272
#> i 4.8660 0.1589 30.629 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.8393 on 28 degrees of freedom
#> Multiple R-squared: 0.971, Adjusted R-squared: 0.97
#> F-statistic: 938.1 on 1 and 28 DF, p-value: < 2.2e-16
#>
#> Call:
#> lm(formula = as.formula(paste0(j, " ~ i")), data = dataset)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.39301 -0.56909 0.03468 0.51764 2.08387
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.0313 0.1596 0.196 0.846
#> i -1.8540 0.1653 -11.218 7.18e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.8731 on 28 degrees of freedom
#> Multiple R-squared: 0.818, Adjusted R-squared: 0.8115
#> F-statistic: 125.8 on 1 and 28 DF, p-value: 7.177e-12
#>
#> Call:
#> lm(formula = as.formula(paste0(j, " ~ i")), data = dataset)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.4989 -0.6435 -0.1436 0.5917 2.2613
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.09205 0.16855 -0.546 0.589
#> i 0.03914 0.17454 0.224 0.824
#>
#> Residual standard error: 0.9221 on 28 degrees of freedom
#> Multiple R-squared: 0.001793, Adjusted R-squared: -0.03386
#> F-statistic: 0.05029 on 1 and 28 DF, p-value: 0.8242
# And do your predictions
lapply(models, function(model) {
predict(model, newdata = data.frame(i = 42))
})
#> [[1]]
#> 1
#> 204.5426
#>
#> [[2]]
#> 1
#> -77.8355
#>
#> [[3]]
#> 1
#> 1.551869

Created on 2019-02-08 by the reprex package (v0.2.1)

You should note that this approach is fairly similar to this answer to a related question (how to mimic a Stata macro for explanatory variables). The main difference, and why I felt like your question deserved a separate answer, is that since you have multiple DVs, you need to run lm() multiple times, which changes the approach somewhat. Whereas in that question, using as.formula() and paste() in one lm() call was sufficient (i.e., lm(as.formula(paste("income~", paste(regressors, collapse="+"))), data = Duncan)), here we need to use that approach in a loop or *apply() call (i.e., lapply(DVnames, function(j) { lm(as.formula(paste0(j, " ~ i")), data = dataset) })). We also then need to use an *apply() function to do the predictions you need.

Bring R list into Stata as macro?

Below i will try to consolidate the comments in a -hopefully- useful answer.

Unfortunately, rcall does not appear to play nicely with large matrices like the one you need. I think it would be best to call R to run your script using the shell command and save the string(s) as variables in a dta file. This requires a bit more work but it is certainly programmable.

Then you could read these variables into Stata and manipulate them easily using built-in functions. For example, you could save the strings in separate variables or in one and use levelsof as @Dimitriy recommended.

Consider the following toy example:

clear
set obs 5

input str50 string
"this is a string"
"A longer string is this"
"A string that is even longer is this one"
"How many strings do you have?"
end

levelsof string, local(newstr)
`"A longer string is this"' `"A string that is even longer is this one"' `"How many strings do you have?"' `"this is a string"'

tokenize `"`newstr'"'

forvalues i = 1 / `: word count `newstr'' {
display "``i''"
}

A longer string is this
A string that is even longer is this one
How many strings do you have?
this is a string

From my experience, programs like rcall and rsource are useful for simple tasks. However, they can become a real hassle for more complicated work in which case i personally just resort to the real thing, that is using the other software directly.

As @Dimitriy also indicated, there are now some community-contributed commands available for lasso, ehich may cover your need so you do not have to fiddle with R:

search lasso

5 packages found (Stata Journal and STB listed first)
-----------------------------------------------------

elasticregress from http://fmwww.bc.edu/RePEc/bocode/e
'ELASTICREGRESS': module to perform elastic net regression, lasso
regression, ridge regression / elasticregress calculates an elastic
net-regularized / regression: an estimator of a linear model in which
larger / parameters are discouraged. This estimator nests the LASSO / and

lars from http://fmwww.bc.edu/RePEc/bocode/l
'LARS': module to perform least angle regression / Least Angle Regression
is a model-building algorithm that / considers parsimony as well as
prediction accuracy. This / method is covered in detail by the paper
Efron, Hastie, Johnstone / and Tibshirani (2004), published in The Annals

lassopack from http://fmwww.bc.edu/RePEc/bocode/l
'LASSOPACK': module for lasso, square-root lasso, elastic net, ridge,
adaptive lasso estimation and cross-validation / lassopack is a suite of
programs for penalized regression / methods suitable for the
high-dimensional setting where the / number of predictors p may be large

pdslasso from http://fmwww.bc.edu/RePEc/bocode/p
'PDSLASSO': module for post-selection and post-regularization OLS or IV
estimation and inference / pdslasso and ivlasso are routines for
estimating structural / parameters in linear models with many controls
and/or / instruments. The routines use methods for estimating sparse /

sivreg from http://fmwww.bc.edu/RePEc/bocode/s
'SIVREG': module to perform adaptive Lasso with some invalid instruments /
sivreg estimates a linear instrumental variables regression / where some
of the instruments fail the exclusion restriction / and are thus invalid.
The LARS algorithm (Efron et al., 2004) is / applied as long as the Hansen


Related Topics



Leave a reply



Submit