Mgcv Gam() Error: Model Has More Coefficients Than Data

mgcv gam() error: model has more coefficients than data

This dataset has 32 observations.

Actually, only 30 as two rows have NA.

From this error message, I infer that I have more predictor variables as compared to the number of observations.

Yes. By default, the s() choose basis dimension (or rank) to be 10 for 1D smoother, giving 10 raw parameters. After centering constraint (see ?identifiability) you get one fewer parameter, but you still have 9 parameters for each smooth. Note that you have 5 smooths! So you have 45 parameters for smooth terms, plus an intercept. This is greater than your 30 data.

I guess this error is generated during cross-validation procedures.

No. This error is detected as soon as GAM formula has been interpreted and model frame been constructed. Even before real basis construction we can already know what is n (number of data) and what is p (number of parameters).

Is there any way to handle this error?

Reduce k manually rather than using default. However for cubic spline the minimum k is 3. For example, use s(temperature, bs = 'cr', k = 3). Note I have also set bs = 'cr' to use natural cubic spline, not the default bs = 'tp' for thin-plate regression spline. You can use it of course, but for 1D smooth "cr" is a natural choice.

mgcv: Error Model has more coefficients than data, related to the argument by in the gam()

When you pass a continuous variable to by, what you are getting is varying coefficient model where the effect of x1 varies as a smooth function of x0.

What is happening in the first case is that because of identifiability constraints being applied to the basis expansion for x0, you requested num_knots basis functions but actually got num_knots - 1 basis functions. When you add the intercept you get num_knots coefficients which is OK to fit with this model as it is a penalised spline (though you probably want method = 'REML'). The identifiability constraint is applied because there is a basis function (or combination) that is confounded with the model intercept and you can't fit two constant terms in the model and have them be uniquely identified.

In the second case, the varying coefficient model, there isn't the same issue, so when you ask for num_knots basis functions plus an intercept you are trying to fit a model with 401 coefficients with 400 observations which isn't going to work.

mgcv GAM: more than one variable in `by` argument (smooth varying by more than 1 factor)

One of the issue created by interaction() is that it changes the model's matrix, meaning that some variables contained in the model's data are changed:

m <- mgcv::gam(body_mass_g ~ s(flipper_length_mm, by = interaction(species, sex)), data = palmerpenguins::penguins)
head(insight::get_data(m))
#>   body_mass_g flipper_length_mm       species    sex
#> 1        3750               181   Adelie.male   male
#> 2        3800               186 Adelie.female female
#> 3        3250               195 Adelie.female female
#> 5        3450               193 Adelie.female female
#> 6        3650               190   Adelie.male   male
#> 7        3625               181 Adelie.female female

^{Created on 2021-08-06 by the reprex package (v2.0.1)}

This can lead to some issues when using postprocessing functions, for instance for visualisation.

However, following Gavin's and IRTFM's answers, this can be easily addressed by adding the variables as fixed effects in the model.

Here is a demonstration, also illustrating the differences between two separate smooths and the interaction:

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5

set.seed(1)

# Create data
data <- data.frame(x = rep(seq(-10, 10, length.out = 500), 2),
                   fac1 = as.factor(rep(c("A", "B", "C"), length.out = 1000)),
                   fac2 = as.factor(rep(c("X", "Y"), each = 500)))
data$y <- data$x^2 + rnorm(nrow(data), sd = 5)
data$y[data$fac1 == "A"] <- sign(data$x[data$fac1 == "A"]) * data$y[data$fac1 == "A"] + 50
data$y[data$fac1 == "B"] <- datawizard::change_scale(data$y[data$fac1 == "B"]^3, c(-50, 100))
data$y[data$fac2 == "X" & data$fac1 == "C"] <- data$y[data$fac2 == "X" & data$fac1 == "C"] - 100
data$y[data$fac2 == "X" & data$fac1 == "B"] <- datawizard::change_scale(data$y[data$fac2 == "X" & data$fac1 == "B"] ^ 2, c(-50, 100))
data$y[data$fac2 == "X" & data$fac1 == "A"] <- datawizard::change_scale(data$y[data$fac2 == "X" & data$fac1 == "A"] * -3, c(0, 100))

# Real trends
ggplot(data, aes(x = x, y = y, color = fac1, shape = fac2)) + 
  geom_point()

Sample Image

# Two smooths
m <- mgcv::gam(y ~ fac1 * fac2 + s(x, by = fac1) + s(x, by = fac2), data = data)
plot(modelbased::estimate_relation(m, length = 100, preserve_range = F))

Sample Image

# Interaction
m <- mgcv::gam(y ~ fac1 * fac2 + s(x, by = interaction(fac1, fac2)), data = data)
plot(modelbased::estimate_relation(m, length = 100, preserve_range = F))

Sample Image

^{Created on 2021-08-06 by the reprex package (v2.0.1)}

The last model manages to recover the trends for each of the factors' combination.

GAM model error

From your comments it became clear that you passed a character variable to by in the smoother. You must pass a factor variable there. This has been a frequent gotcha for me too and I consider it a design flaw (because base R regression functions deal with character variables just fine).

Mgcv Gam() Error: Model Has More Coefficients Than Data