mgcv gam() error: model has more coefficients than data
This dataset has 32 observations.
Actually, only 30 as two rows have NA
.
From this error message, I infer that I have more predictor variables as compared to the number of observations.
Yes. By default, the s()
choose basis dimension (or rank) to be 10 for 1D smoother, giving 10 raw parameters. After centering constraint (see ?identifiability
) you get one fewer parameter, but you still have 9 parameters for each smooth. Note that you have 5 smooths! So you have 45 parameters for smooth terms, plus an intercept. This is greater than your 30 data.
I guess this error is generated during cross-validation procedures.
No. This error is detected as soon as GAM formula has been interpreted and model frame been constructed. Even before real basis construction we can already know what is n
(number of data) and what is p
(number of parameters).
Is there any way to handle this error?
Reduce k
manually rather than using default. However for cubic spline the minimum k
is 3. For example, use s(temperature, bs = 'cr', k = 3)
. Note I have also set bs = 'cr'
to use natural cubic spline, not the default bs = 'tp'
for thin-plate regression spline. You can use it of course, but for 1D smooth "cr"
is a natural choice.
mgcv: Error Model has more coefficients than data, related to the argument by in the gam()
When you pass a continuous variable to by
, what you are getting is varying coefficient model where the effect of x1
varies as a smooth function of x0
.
What is happening in the first case is that because of identifiability constraints being applied to the basis expansion for x0
, you requested num_knots
basis functions but actually got num_knots - 1
basis functions. When you add the intercept you get num_knots
coefficients which is OK to fit with this model as it is a penalised spline (though you probably want method = 'REML'
). The identifiability constraint is applied because there is a basis function (or combination) that is confounded with the model intercept and you can't fit two constant terms in the model and have them be uniquely identified.
In the second case, the varying coefficient model, there isn't the same issue, so when you ask for num_knots
basis functions plus an intercept you are trying to fit a model with 401 coefficients with 400 observations which isn't going to work.
mgcv GAM: more than one variable in `by` argument (smooth varying by more than 1 factor)
One of the issue created by interaction()
is that it changes the model's matrix, meaning that some variables contained in the model's data are changed:
m <- mgcv::gam(body_mass_g ~ s(flipper_length_mm, by = interaction(species, sex)), data = palmerpenguins::penguins)
head(insight::get_data(m))
#> body_mass_g flipper_length_mm species sex
#> 1 3750 181 Adelie.male male
#> 2 3800 186 Adelie.female female
#> 3 3250 195 Adelie.female female
#> 5 3450 193 Adelie.female female
#> 6 3650 190 Adelie.male male
#> 7 3625 181 Adelie.female female
Created on 2021-08-06 by the reprex package (v2.0.1)
This can lead to some issues when using postprocessing functions, for instance for visualisation.
However, following Gavin's and IRTFM's answers, this can be easily addressed by adding the variables as fixed effects in the model.
Here is a demonstration, also illustrating the differences between two separate smooths and the interaction:
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
set.seed(1)
# Create data
data <- data.frame(x = rep(seq(-10, 10, length.out = 500), 2),
fac1 = as.factor(rep(c("A", "B", "C"), length.out = 1000)),
fac2 = as.factor(rep(c("X", "Y"), each = 500)))
data$y <- data$x^2 + rnorm(nrow(data), sd = 5)
data$y[data$fac1 == "A"] <- sign(data$x[data$fac1 == "A"]) * data$y[data$fac1 == "A"] + 50
data$y[data$fac1 == "B"] <- datawizard::change_scale(data$y[data$fac1 == "B"]^3, c(-50, 100))
data$y[data$fac2 == "X" & data$fac1 == "C"] <- data$y[data$fac2 == "X" & data$fac1 == "C"] - 100
data$y[data$fac2 == "X" & data$fac1 == "B"] <- datawizard::change_scale(data$y[data$fac2 == "X" & data$fac1 == "B"] ^ 2, c(-50, 100))
data$y[data$fac2 == "X" & data$fac1 == "A"] <- datawizard::change_scale(data$y[data$fac2 == "X" & data$fac1 == "A"] * -3, c(0, 100))
# Real trends
ggplot(data, aes(x = x, y = y, color = fac1, shape = fac2)) +
geom_point()
# Two smooths
m <- mgcv::gam(y ~ fac1 * fac2 + s(x, by = fac1) + s(x, by = fac2), data = data)
plot(modelbased::estimate_relation(m, length = 100, preserve_range = F))
# Interaction
m <- mgcv::gam(y ~ fac1 * fac2 + s(x, by = interaction(fac1, fac2)), data = data)
plot(modelbased::estimate_relation(m, length = 100, preserve_range = F))
Created on 2021-08-06 by the reprex package (v2.0.1)
The last model manages to recover the trends for each of the factors' combination.
GAM model error
From your comments it became clear that you passed a character
variable to by
in the smoother. You must pass a factor
variable there. This has been a frequent gotcha for me too and I consider it a design flaw (because base R regression functions deal with character variables just fine).
Related Topics
R Create Function to Add Water Year Column
Pivot_Longer into Multiple Columns
Understanding Lm and Environment
Shiny Splitlayout and Selectinput Issue
R Shiny: How to Write Loop for Observeevent
Subtract Values in One Dataframe from Another
Data.Table Join and J-Expression Unexpected Behavior
How to Set Factor Levels to the Order They Appear in a Data Frame
How to Plot Histogram/ Frequency-Count of a Vector with Ggplot
How to Represent Polynomials with Numeric Vectors in R
How to Write a Data-Frame with One Column a List to a File
R Cmd Check Latex Error: Fatal PDFlatex - Gui Framework Cannot Be Initialized
How to Print the Name of Current Row When Using Apply in R
No Dimensions of Non-Empty Numeric Vector in R
R Equivalent of Stata Local or Global MACros
Manually Colouring Plots with 'Scale_Fill_Manual' in Ggplot2 Not Working