R - Calculate Test Mse Given a Trained Model from a Training Set and a Test Set

R - Calculate Test MSE given a trained model from a training set and a test set

In this case, it is more precise to call it MSPE (mean squared prediction error):

mean((test_set$y - predict.lm(model, test_set)) ^ 2)

This is a more useful measure as all models aim at prediction. We want a model with minimal MSPE.

In practice, if we do have a spare test data set, we can directly compute MSPE as above. However, very often we don't have spare data. In statistics, the leave-one-out cross-validation is an estimate of MSPE from the training dataset.

There are also several other statistics for assessing prediction error, like Mallows's statistic and AIC.

Compute the MSE of this set of points (X, Y) with respect to the given regression model

That must be simple !!!
MSE means mean square loss error.
so assume your regression function is f(x) where x is the feature vector of dimension d.
the output of f(x) is scaler.
square error for one data sample(let it be x1,y1 ; x1 is a vector in d-dimensional space and y1 is scaler) is ( f(x1) - y )^2.

To calculate MSE, calculate the square error of each data point and then, add all square errors, divide the sum of square error by the number of data samples.

In your case, the dimension of the feature vector(x) is 1.
and f(x) = 7.93 + 1.12*x.

----CODE----

X = (23,34,45,56,67,78)

Y = (41,45,49,67,84,100)

SE = 0.0

for x,y in X,Y :

 SE = SE + ( 7.93 + 1.12*x - y)**2

MSE = SE/ len(X)

Calculated Mean Squared Error with dataframe of residuals with grouping in dplyr

library(tidyverse)
df_example %>%
  group_by(ID) %>%
  summarize(across(everything(), ~sum(.x^2)/n()))

which gives:

# A tibble: 3 x 5
     ID       A       B       C      D
  <int>   <dbl>   <dbl>   <dbl>  <dbl>
1     1   0.065 3190.      7.76 3090. 
2     2 587.     547.   1596.   2927. 
3     3 543.       1.69  529      11.8

Note that this gives different results compared to @Bruno's solution. It does give the same results, though, as Neeraj's solution.

I understand the TO in a way that his input already are the residuals in which case I only need to square each of them, create the sum per ID (and for each column) and divide by the observations per ID.

One example for column "A" and ID 2:

Residuals are 2.3 and 34.2
Squared residuals are 5.29 and 1169.64
Sum of squared residuals is 1174.93
MSE is sum of squared residuals divided by 2 = 587.465

Is that correct?

Cross validation for a multiple linear regression in R

You can calculate the mean squared error and the root mean squared error to see how well your model did.

1) Take your coefficients and multiply them by your matrix of covariates in your training data. yhat = (X*b)

2) Take your training set y's and take the difference between these and the yhat above.

3) Square the error

4) Take the square root of the answer = Root Mean Squared Error

Lower values means better fit overall

R-squared on test data

There are a couple of problems here. First, this is not a good way to use lm(...). lm(...) is meant to be used with a data frame, with the formula expressions referencing columns in the df. So, assuming your data is in two vectors x and y,

set.seed(1)    # for reproducible example
x <- 1:11000
y <- 3+0.1*x + rnorm(11000,sd=1000)

df <- data.frame(x,y)
# training set
train <- sample(1:nrow(df),0.75*nrow(df))   # random sample of 75% of data

fit <- lm(y~x,data=df[train,])

Now fit has the model based on the training set. Using lm(...) this way allows you, for example to generate predictions without all the matrix multiplication.

The second problem is the definition of R-squared. The conventional definition is:

1 - SS.residuals/SS.total

For the training set, and the training set ONLY,

SS.total = SS.regression + SS.residual

SS.regression = SS.total - SS.residual,

and therefore

R.sq = SS.regression/SS.total

so R.sq is the fraction of variability in the dataset that is explained by the model, and will always be between 0 and 1.

You can see this below.

SS.total      <- with(df[train,],sum((y-mean(y))^2))
SS.residual   <- sum(residuals(fit)^2)
SS.regression <- sum((fitted(fit)-mean(df[train,]$y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 1.907349e-06
SS.regression/SS.total     # fraction of variation explained by the model
# [1] 0.08965502
1-SS.residual/SS.total     # same thing, for model frame ONLY!!! 
# [1] 0.08965502          
summary(fit)$r.squared     # both are = R.squared
# [1] 0.08965502

But this does not work with the test set (e.g., when you make predictions from a model).

test <- -train
test.pred <- predict(fit,newdata=df[test,])
test.y    <- df[test,]$y

SS.total      <- sum((test.y - mean(test.y))^2)
SS.residual   <- sum((test.y - test.pred)^2)
SS.regression <- sum((test.pred - mean(test.y))^2)
SS.total - (SS.regression+SS.residual)
# [1] 8958890

# NOT the fraction of variability explained by the model
test.rsq <- 1 - SS.residual/SS.total  
test.rsq
# [1] 0.0924713

# fraction of variability explained by the model
SS.regression/SS.total 
# [1] 0.08956405

In this contrived example there is not much difference, but it is very possible to have an R-sq. value less than 0 (when defined this way).

If, for example, the model is a very poor predictor with the test set, then the residuals can actually be larger than the total variation in test set. This is equivalent to saying that the test set is modeled better using it's mean, than using the model derived from the training set.

I noticed that you use the first three quarters of your data as the training set, rather than taking a random sample (as in this example). If the dependance of y on x is non-linear, and the x's are in order, then you could get a negative R-sq with the test set.

Regarding OP's comment below, one way to assess the model with a test set is by comparing in-model to out-of-model mean squared error (MSE).

mse.train <- summary(fit)$sigma^2
mse.test  <- sum((test.pred - test.y)^2)/(nrow(df)-length(train)-2)

If we assume that the training and test set are both normally distributed with the same variance and having means which follow the same model formula, then the ratio should have an F-distribution with (n.train-2) and (n.test-2) degrees of freedom. If the MSE's are significantly different based on an F-test, then the model does not fit the test data well.

Have you plotted your test.y and pred.y vs x?? This alone will tell you a lot.

Changing the method of calculating the line of best fit

I'm going to suggest an alternative approach, robust linear models; these don't use mean (or sum) of absolute deviations, but rather downweight the effect of outliers. MASS::rlm has essentially the same syntax as lm: here I'm illustrating it in a ggplot context.

You could also use robustbase::lmrob() for a different implementation of the same approach, or (as suggested by G. Grothendieck) quantreg::rq() to fit a straight-line model for the median (which basically corresponds to what you asked for in the first place, a MAD regression).

library(MASS)
set.seed(101)
## generate correlated data (positive slope)
dd <- as.data.frame(MASS::mvrnorm(20, mu=c(0,0), 
                    Sigma=matrix(c(1,0.95,0.95,1),2)))  
dd <- rbind(dd, c(5,-5))  ## add an outlier
library(ggplot2); theme_set(theme_classic())
ggplot(dd, aes(V1,V2)) + 
     geom_point() + geom_smooth(method="lm") + 
     geom_smooth(method="rlm", colour="red")

Sample Image

Machine learning: training model from test data

The best way to evaluate how well a model will perform in the 'wild' is to evaluate its performance on a data set it has not seen (i.e., been trained on) -- assuming you have the labels in a supervised learning problem.

People split their data into train/test/eval and use the training data to estimate/learn the model parameters and the test set to tune the model (e.g., by trying different hyperparameter combinations). A model is usually selected based on the hyperparameter combination that optimizes a test metric (regression - MSE, R^2, etc.; classification - AUC, accuracy, etc.). Then the selected model is usually retrained on the combined train + test data set. After retraining, the model is evaluated based on its performance on the eval data set (assuming you have some ground truth labels to evaluate your predictions). The eval metric is what you report as the generalization metric -- that is, how well your model performs on novel data.

Does this help?

R - Calculate Test Mse Given a Trained Model from a Training Set and a Test Set