Mgcv: How to Set Number And/Or Locations of Knots for Splines

mgcv: How to set number and / or locations of knots for splines

While setting k is the correct way to go, fx = TRUE is definitely not right: it will force using pure regression spline without penalization.

locations of knots

For penalized regression spline, the exact locations are not important, as long as:

k is adequately big;
the spread of knots has good, reasonable coverage.

By default:

natural cubic regression spline bs = 'cr' places knots by quantile;
B-splines family (bs = 'bs', bs = 'ps', bs = 'ad') place knots evenly.

Compare the following:

library(mgcv)

## toy data
set.seed(0); x <- sort(rnorm(400, 0, pi))  ## note, my x are not uniformly sampled
set.seed(1); e <- rnorm(400, 0, 0.4)
y0 <- sin(x) + 0.2 * x + cos(abs(x))
y <- y0 + e

## fitting natural cubic spline
cr_fit <- gam(y ~ s(x, bs = 'cr', k = 20))
cr_knots <- cr_fit$smooth[[1]]$xp  ## extract knots locations

## fitting B-spline
bs_fit <- gam(y ~ s(x, bs = 'bs', k = 20))
bs_knots <- bs_fit$smooth[[1]]$knots  ## extract knots locations

## summary plot
par(mfrow = c(1,2))
plot(x, y, col= "grey", main = "natural cubic spline");
lines(x, cr_fit$linear.predictors, col = 2, lwd = 2)
abline(v = cr_knots, lty = 2)
plot(x, y, col= "grey", main = "B-spline");
lines(x, bs_fit$linear.predictors, col = 2, lwd = 2)
abline(v = bs_knots, lty = 2)

Sample Image

You can see the difference in knots placement.

Setting your own knots locations:

You can also provide your customized knots locations via the knots argument of gam() (yes, knots are not fed to s(), but to gam()). For example, you can do evenly spaced knots for cr:

xlim <- range(x)  ## get range of x
myfit <- gam(y ~ s(x, bs = 'cr', k = 20),
         knots = list(x = seq(xlim[1], xlim[2], length = 20)))

Now you can see that:

my_knots <- myfit$smooth[[1]]$xp
plot(x, y, col= "grey", main = "my knots");
lines(x, myfit$linear.predictors, col = 2, lwd = 2)
abline(v = my_knots, lty = 2)

Sample Image

However, there is usually no need to set knots yourself. But if you do want to do this, you must be clear what you are doing. In particular, the number of knots you provide must not conflict with the k in s().

This is a very rich answer. The length of bs_knots is 24. The "dimension" of the spline basis is in bs_fit$smooth[[1]]$bs.dim, which is 20.

Yes, for B-splines family, the number of B-splines does not equal the number of knots. Knots placement for B-splines is a dirty work and error-prone. See https://stackoverflow.com/a/72723391/4891738 for a demonstration with B-splines.

mgcv: Extract Knot Locations for `tp` smooth from a GAM model

Comments:

You should have tagged your question with R and mgcv when asking;
At first I want to flag your question as duplicate to mgcv: how to extract knots, basis, coefficients and predictions for P-splines in adaptive smooth? raised yesterday, and my answer there should be pretty useful. But then I realized that there is actually some difference. So I will make some brief explanation here.

Answer:

In your gam call:

mod <- gam(Used ~ s(Open), binomial, data = data)

you did not specify bs argument in s(), therefore the default basis: bs = 'tp' will be used.

'tp', short for thin-plate regression spline, is not a smooth class that has conventional knots. Thin plate spline does have knots: it places knots exactly at data points. For example, if you have n unique Open values, then it has n knots. In univariate case, this is just a smoothing spline.

However, thin plate regression spline is a low rank approximation to full thin-plate spline, based on truncated eigen decomposition. This is a similar idea to principal components analysis(PCA). Instead of using the original n number of thin-plate spline basis, it uses the first k principal components. This reduces computation complexity from O(n^3) down to O(nk^2), while ensuring optimal rank-k approximation.

As a result, there is really no knots you can extract for a fitted thin-plate regression spline.

Since you work with univariate spline, there is really no need to go for 'tp'. Just use bs = 'cr', the cubic regression spline. This used to be the default in mgcv before 2003, when tp became available. cr has knots, and you can extract knots as I showed in my answer. Don't be confused by the bs = 'ad' in that question: P-splines, B-splines, natural cubic splines, are all knots-based splines.

gam() in R: Is it a spline model with automated knots selection?

The term GAM covers a broad church of models and approaches to solve the smoothness selection problem.

mgcv uses penalized regression spline bases, with a wiggliness penalty to choose the complexity of the fitted smooth(s). As such, it doesn't choose the number of knots as part of the smoothness selection.

Basically, you as the user choose how large a basis to use for each smooth function (by setting argument k in the s(), te(), etc functions used in the model formula). The value(s) for k set the upper limit on the wiggliness of the smooth function(s). The penalty measures the wiggliness of the function (it is typically the squared second derivative of the smooth summed over the range of the covariate). The model then estimates values for the coefficients for the basis functions representing each smooth and chooses smoothness parameter(s) by maximizing the penalized log likelihood criterion. The penalized log likelihood is the log likelihood plus some amount of penalty for wiggliness for each smooth.

Basically, you set the upper limit of expected complexity (wiggliness) for each smooth and when the model is fitted, the penalty(ies) shrink the coefficients behind each smooth so that excess wiggliness is removed from the fit.

In this way, the smoothness parameters control how much shrinkage happens and hence how complex (wiggly) each fitted smooth is.

This approach avoids the problems of choosing where to put the knots.

This doesn't mean the bases used to represent the smooths don't have knots. In the cubic regression spline basis you mention, the value you give to k sets the dimensionality of the basis, which implies a certain number of knots. These knots are placed at the boundaries of the covariate involved in the smooth and then evenly over the range of the covariate, unless the user supplies a different set of knot locations. However, once the number of knots and their locations are set, thus forming the basis, they are fixed, with the wiggliness of the smooth being controlled by the wiggliness penalty, not by varying the number of knots.

You have to be very careful also with R as there are two packages providing a gam() function. The original gam package provides an R version of the software and approach described in the original GAM book by Hastie and Tibshirani. This package doesn't fit GAMs using penalized regression splines as I describe above.

R ships with the mgcv package, which fits GAMs using penalized regression splines as I outline above. You control the size (dimensionality) of the basis for each smooth using the argument k. There is no argument df.

Like I said, GAMs are a broad church and there are many ways to fit them. It is important to know what software you are using and what methods that software is employing to estimate the GAM. Once you have that info in hand, you can home in on specific material for that particular approach to estimating GAMs. In this case, you should look at Simon Wood's book GAMs: an introduction with R as this describes the mgcv package and is written by the author of the mgcv package.

How to extract fitted splines from a GAM (`mgcv::gam`)

In mgcv::gam there is a way to do this (your Q2), via the predict.gam method and type = "lpmatrix".

?predict.gam even has an example, which I reproduce below:

 library(mgcv)
 n <- 200
 sig <- 2
 dat <- gamSim(1,n=n,scale=sig)

 b <- gam(y ~ s(x0) + s(I(x1^2)) + s(x2) + offset(x3), data = dat)

 newd <- data.frame(x0=(0:30)/30, x1=(0:30)/30, x2=(0:30)/30, x3=(0:30)/30)

 Xp <- predict(b, newd, type="lpmatrix")

 ##################################################################
 ## The following shows how to use use an "lpmatrix" as a lookup 
 ## table for approximate prediction. The idea is to create 
 ## approximate prediction matrix rows by appropriate linear 
 ## interpolation of an existing prediction matrix. The additivity 
 ## of a GAM makes this possible. 
 ## There is no reason to ever do this in R, but the following 
 ## code provides a useful template for predicting from a fitted 
 ## gam *outside* R: all that is needed is the coefficient vector 
 ## and the prediction matrix. Use larger `Xp'/ smaller `dx' and/or 
 ## higher order interpolation for higher accuracy.  
 ###################################################################

 xn <- c(.341,.122,.476,.981) ## want prediction at these values
 x0 <- 1         ## intercept column
 dx <- 1/30      ## covariate spacing in `newd'
 for (j in 0:2) { ## loop through smooth terms
   cols <- 1+j*9 +1:9      ## relevant cols of Xp
   i <- floor(xn[j+1]*30)  ## find relevant rows of Xp
   w1 <- (xn[j+1]-i*dx)/dx ## interpolation weights
   ## find approx. predict matrix row portion, by interpolation
   x0 <- c(x0,Xp[i+2,cols]*w1 + Xp[i+1,cols]*(1-w1))
 }
 dim(x0)<-c(1,28) 
 fv <- x0%*%coef(b) + xn[4];fv    ## evaluate and add offset
 se <- sqrt(x0%*%b$Vp%*%t(x0));se ## get standard error
 ## compare to normal prediction
 predict(b,newdata=data.frame(x0=xn[1],x1=xn[2],
         x2=xn[3],x3=xn[4]),se=TRUE)

That goes through the entire process even the prediction step which would be done outside R or of the GAM model. You are going to have to modify the example a bit to do what you want as the example evaluates all terms in the model and you have two other terms besides the spline - essentially you do the same thing, but only for the spline terms, which involves finding the relevant columns and rows of the Xp matrix for the spline. Then also you should note that the spline is centred so you may or may not want to undo that too.

For your Q1, choose appropriate values for the xn vector/matrix in the example. These correspond to values for the nth term in the model. So set the ones you want fixed to some mean value and then vary the one associated with the spline.

If you are doing all of this in R, it would be easier to just evaluate the spline at the values of the spline covariate that you have data for that is going into the other model. You do that by creating a data frame of values at which to predict at, then use

predict(mod, newdata = newdat, type = "terms")

where mod is the fitted GAM model (via mgcv::gam), newdat is the data frame containing a column for each variable in the model (including the parametric terms; set the terms you don't want to vary to some constant mean value [say the average of the variable in the data set] or certain level if a factor). The type = "terms" part will return a matrix for each row in newdat with the "contribution" to the fitted value for each term in the model, including the spline term. Just take the column of this matrix that corresponds to the spline - again it is centered.

Perhaps I misunderstood your Q1. If you want to control the knots, see the knots argument to mgcv::gam. By default, mgcv::gam places a knot at the extremes of the data and then the remaining "knots" are spread evenly over the interval. mgcv::gam doesn't find the knots - it places them for you and you can control where it places them via the knots argument.

mgcv: How to identify exact knot values in a gam and gamm model?

To get the knots, you can extract the xp component of the marginal smooth terms (note it is lowercase xp as there is an XP at the top level of the smooth which is something else).

Here's an example

library('mgcv')
## simulate some data
set.seed(1729)
df <- gamSim(2) # this is a bivariate example
## fit the model
mod <- gam(y ~ ti(x, bs = 'cr', k = 5) + 
               ti(z, bs = 'cr', k = 5) +
               ti(x, z, bs = rep('cr', 2), k = 5),
           data = df$data, method = 'REML')
## extract the 3rd smooth
sm <- mod[['smooth']][[3]]

The marginal bases are in sm$margin, which is simply a list of two smooth objects:

r$> str(sm$margin, max = 1)                          
List of 2
 $ :List of 21
  ..- attr(*, "class")= chr [1:2] "cr.smooth" "mgcv.smooth"
  ..- attr(*, "qrc")=List of 4
  .. ..- attr(*, "class")= chr "qr"
  ..- attr(*, "nCons")= int 1
 $ :List of 21
  ..- attr(*, "class")= chr [1:2] "cr.smooth" "mgcv.smooth"
  ..- attr(*, "qrc")=List of 4
  .. ..- attr(*, "class")= chr "qr"
  ..- attr(*, "nCons")= int 1

Each of these has a xp component:

sm_x <- sm$margin[[1]]
sm_z <- sm$margin[[2]]

Hence the knots for the marginal CRS of x are:

r$> sm_x$xp
          0%          25%          50%          75%         100%
0.0005697084 0.2477067126 0.4704501621 0.7121602102 0.9960833385

and for z are

r$> sm_z$xp
         0%         25%         50%         75%        100% 
0.007381999 0.244705125 0.488819070 0.717802322 0.991505836

Why these values? They are at the quintiles of the observed covariate values:

r$> with(df$data, quantile(x, probs = seq(0, 1, length = 5)))
          0%          25%          50%          75%         100%
0.0005697084 0.2477067126 0.4704501621 0.7121602102 0.9960833385
r$> with(df$data, quantile(z, probs = seq(0, 1, length = 5)))
         0%         25%         50%         75%        100% 
0.007381999 0.244705125 0.488819070 0.717802322 0.991505836

Which is how mgcv places knots for the CRS basis. The exact locations can be recovered using place.knots():

r$> with(df$data, place.knots(x, 5))
[1] 0.0005697084 0.2477067126 0.4704501621 0.7121602102 0.9960833385
r$> with(df$data, place.knots(z, 5))
[1] 0.007381999 0.244705125 0.488819070 0.717802322 0.991505836

but it is safer to pull the knots from the marginal smooth objects as a user could always specify knots via the knots argument to gam().

Mgcv: How to Set Number And/Or Locations of Knots for Splines