Modify Glm Function to Adopt User-Specified Link Function in R

modify glm function to adopt user-specified link function in R

I'm basically following the form of the example in ?family which shows a user-specified link of the form qlogis(mu^(1/days)).

We want a link of the form eta = log(exp(y)-1) (so the inverse link is y=log(exp(eta)+1), and mu.eta = dy/d(eta) = 1/(1+exp(-eta))

vlog <- function() {
    ## link
    linkfun <- function(y) log(exp(y)-1)
    ## inverse link
    linkinv <- function(eta)  log(exp(eta)+1)
    ## derivative of invlink wrt eta
    mu.eta <- function(eta) { 1/(exp(-eta) + 1) }
    valideta <- function(eta) TRUE
    link <- "log(exp(y)-1)"
    structure(list(linkfun = linkfun, linkinv = linkinv,
                   mu.eta = mu.eta, valideta = valideta, 
                   name = link),
              class = "link-glm")
}

Basic checks:

vv <- vlog()
vv$linkfun(vv$linkinv(27))  ## check invertibility
library("numDeriv")
all.equal(grad(vv$linkinv,2),vv$mu.eta(2))  ## check derivative

Example:

set.seed(101)
n <- 1000                       
x <- runif(n)
sh <- 2                        
y <- rgamma(n,scale=vv$linkinv(2+3*x)/sh,shape=sh)
glm(y~x,family=Gamma(link=vv))                       
## 
## Call:  glm(formula = y ~ x, family = Gamma(link = vv))
## 
## Coefficients:
## (Intercept)            x  
##       1.956        3.083  
## 
## Degrees of Freedom: 999 Total (i.e. Null);  998 Residual
## Null Deviance:       642.2 
## Residual Deviance: 581.8     AIC: 4268 
##

calling the glm() function within a user-defined function

You could use any of the following:

Using substitute:

to_analyze <- function(dep, indep, data){
  glm(substitute(dep ~ factor(indep)), data=data)
}

to_analyze(dep=age, indep=sex, data=dsn)

Advantage: Can write the independent as a formula.

 to_analyze(Petal.Width, Sepal.Length + Sepal.Width, data = iris)

Using reformulate as stated by @NelsonGon

to_analyze <- function(dep, indep, data){ 
  glm(reformulate(sprintf("factor(%s)",indep), dep),  data = data) 
 }

Note that to call this function, the variables aught to be of type character

 to_analyze(dep= "age", indep="sex", data=dsn)

Recall glm can also take a string that can be parsed to a formula:

to_analyze <- function(dep, indep, data){ 
  glm(sprintf("%s~factor(%s)", dep, indep),  data = data) 
}

to_analyze("age", "sex", data=dsn)

or even:

to_analyze <- function(dep, indep, data){ 
  glm(paste(dep,"~ factor(",indep,")"),  data = data) 
}

to_analyze("age", "sex", data=dsn)

LASTLY: to combine both the substitute and paste:

to_analyze <- function(dep, indep, data){ 
  glm(paste(substitute(dep),"~ factor(",substitute(indep),")"),  data = data) 
}

will work for both symbols and characters. eg:

to_analyze(age, sex, data=dsn)
to_analyze("age", "sex", data=dsn)

Shall I request author premision to modify r function

As far as I know, if the pkg is on CRAN and if the Licence is GPL (>=2), your are allowed to copy and modify the content as long as the modified content is still in GPL and that you state that you modified the content. So you don't need to ask for the permission of the pkg creator.

A good practice would be to create your own package, calling it 'pkgextra' (where pkg is the name of the package) and stating in the DESCRIPTION that the package is built on top of another package e.g tidystringdist which is built on top of stringdist or ggExtra which is built on top of ggpot. Also, as R packages have a Dependencies component, you're clearly stating in the DESCRIPTION that you built your package depending on other packages.

To wrap up, no, you don't need the permission from the package author for as long as you distribute the created work with the same licence and that you state that you depend on this package.

R GLM function omitting data

You are correct in that na.omit will omit the missing values and run your model. In fact, you should see identical outputs when you run summary(model_1) and summary(model_2).

However, the nagelkerke function that you are using runs into issues when there are NA values in one variable from the original dataset. From there documentation...

The fitted model and the null model should be properly nested. That is, the terms of one need to be a subset of the the other, and they should have the same set of observations. One issue arises when there are NA values in one variable but not another, and observations with NA are removed in the model fitting. The result may be fitted and null models with different sets of observations. Setting restrictNobs to TRUE ensures that only observations in the fit model are used in the null model. This appears to work for lm and some glm models, but causes the function to fail for other model object types

If you set restrictNobs to TRUE you should see the same output

How to dynamically name variables in formula in lm() function?

The core issue to understand here is that lm() takes a type formula as the first parameter that specifies the regression.

You've created a vector of strings (characters) but R won't dynamically generated the formula for you in the function call - the ability to just type variable names as a formula is a convenience but not practical when you are attempting to be dynamic.

To simplify your example, start with:

y1 <- (rnorm(n = 10, mean = 0, sd = 1))
x1 <- (rnorm(n = 10, mean = 0, sd = 1))
x2 <- (rnorm(n = 10, mean = 0, sd = 1))
x3 <- (rnorm(n = 10, mean = 0, sd = 1))

df <- as.data.frame(cbind(y1,x1,x2,x3))

predictors = c("x1", "x2", "x3")

Now you can dynamically create a formula as as concatenated string (paste0) and convert it to a formula. Then pass this formula to your lm() call:

form1 = as.formula(paste0("y1~", predictors[1]))

lm(form1, data = df)

As akrun pointed out, you can then start doing things like create loops to dynamically generate these.

You can also do things like:

my_formula = as.formula(paste0("y1~", paste0(predictors, collapse="+")))

## generates y1 ~ x1 + x2 + x3
lm(my_formula, data = df)

See also: Formula with dynamic number of variables

One of the answers on that page also mentions akrun's alternative way of doing this, using the function reformulate. From ?reformulate:

reformulate creates a formula from a character vector. If length(termlabels) > 1, its elements are concatenated with +. Non-syntactic names (e.g. containing spaces or special characters; see make.names) must be protected with backticks (see examples). A non-parseable response still works for now, back compatibly, with a deprecation warning.

meaning of family quasi and the link function inverse in glm

The inverse link function is just f(x) = 1/x. If you create a family object with the command

fam <- quasi(link = "inverse")

the link function is set to the inverse function:

fam$linkfun
# function (mu) 
# 1/mu
# <environment: namespace:stats>

By default, the link function for quasi is "identity", i.e., f(x) = x.

The details of quasi can be found in the function. Have a look at the structure with

str(quasi())
# List of 12
#  $ family    : chr "quasi"
#  $ link      : chr "identity"
#  $ linkfun   :function (mu)  
#  $ linkinv   :function (eta)  
#  $ variance  :function (mu)  
#  $ dev.resids:function (y, mu, wt)  
#  $ aic       :function (y, n, mu, wt, dev)  
#  $ mu.eta    :function (eta)  
#  $ initialize:  expression({     n <- rep.int(1, nobs)     mustart <- y })
#  $ validmu   :function (mu)  
#  $ valideta  :function (eta)  
#  $ varfun    : chr "constant"
#  - attr(*, "class")= chr "family"

You can access the elements with $, for example

quasi()$variance
# function (mu) 
# rep.int(1, length(mu))
# <bytecode: 0x100f30060>
# <environment: 0x101be4940>

to find details of quasi. By default, quasi assumes constant variance.

Modify Glm Function to Adopt User-Specified Link Function in R