Scale back linear regression coefficients in R from scaled and centered data
If I understand your description (that is unfortunately at the moment code-free), you are getting standardized regression coefficients for Y ~ As + Bs*Xs where all those "s" items are scaled variables. The coefficients then are the predicted change on a std deviation scale of Y associated with a change in X of one standard deviation of X. The scale
function would have recorded the means and standard deviations in attributes for hte scaled object. If not, then you will have those estimates somewhere in your console log. The estimated change in dY for a change dX in X should be: dY*(1/sdY) = Bs*dX*(1/sdX). Predictions should be something along these lines:
Yest = As*(sdX) + Xmn + Bs*(Xs)*(sdX)
You probably should not have needed to standardize the Y values, and I'm hoping that you didn't because it makes dealing with the adjustment for the means of the X's easier. Put some code and example data in if you want implemented and checked answers. I think @DanielGerlance is correct in saying to multiply rather than divide by the SD's.
Back-transform coefficients from glmer with scaled independent variables for prediction
The problem with your approach is that it only "unscales" based on the wt
variable, whereas you scaled all of the variables in your regression model. One approach that works is to adjust all of the variables in your new (prediction) data frame using the centering/scaling values that were used on the original data frame:
## scale variable x using center/scale attributes
## of variable y
scfun <- function(x,y) {
scale(x,
center=attr(y,"scaled:center"),
scale=attr(y,"scaled:scale"))
}
## scale prediction frame
xweight_sc <- transform(xweight,
wt = scfun(wt, database$wt),
am = scfun(am, database$am))
## predict
p_unsc <- predict(model.1,
newdata=xweight_sc,
type="response", re.form=NA)
Comparing this p_unsc
to your predictions from the unscaled model (b
in your code), i.e. all.equal(b,p_unsc)
, gives TRUE.
Another reasonable approach would be to
- unscale/uncenter all of your parameters using the "unscaling" approaches presented in one of the linked question (such as this one), generating a coefficient vector
beta_unsc
- construct the appropriate model matrix from your prediction frame:
X <- model.matrix(formula(model,fixed.only=TRUE),
newdata=pred_frame)
- compute the linear predictor and back-transform:
pred <- plogis(X %*% beta_unsc)
How to unscale the coefficients from an lmer()-model fitted with a scaled response
Updated: generalized to allow for scaling of the response as well as the predictors.
Here's a fairly crude implementation.
If our original (unscaled) regression is
Y = b0 + b1*x1 + b2*x2 ...
Then our scaled regression is
(Y0-mu0)/s0 = b0' + (b1'*(1/s1*(x1-mu1))) + b2'*(1/s2*(x2-mu2))+ ...
This is equivalent to
Y0 = mu0 + s0((b0'-b1'/s1*mu1-b2'/s2*mu2 + ...) + b1'/s1*x1 + b2'/s2*x2 + ...)
So bi = s0*bi'/si
for i>0
and
b0 = s0*b0'+mu0-sum(bi*mui)
Implement this:
rescale.coefs <- function(beta,mu,sigma) {
beta2 <- beta ## inherit names etc.
beta2[-1] <- sigma[1]*beta[-1]/sigma[-1]
beta2[1] <- sigma[1]*beta[1]+mu[1]-sum(beta2[-1]*mu[-1])
beta2
}
Try it out for a linear model:
m1 <- lm(Illiteracy~.,as.data.frame(state.x77))
b1 <- coef(m1)
Make a scaled version of the data:
ss <- scale(state.x77)
Scaled coefficients:
m1S <- update(m1,data=as.data.frame(ss))
b1S <- coef(m1S)
Now try out rescaling:
icol <- which(colnames(state.x77)=="Illiteracy")
p.order <- c(icol,(1:ncol(state.x77))[-icol])
m <- colMeans(state.x77)[p.order]
s <- apply(state.x77,2,sd)[p.order]
all.equal(b1,rescale.coefs(b1S,m,s)) ## TRUE
This assumes that both the response and the predictors are scaled.
- If you scale only the response and not the predictors, then you should submit (
c(mean(response),rep(0,...))
form
andc(sd(response),rep(1,...))
fors
(i.e.,m
ands
are the values by which the variables were shifted and scaled). - If you scale only the predictors and not the response, then submit
c(0,mean(predictors))
form
andc(1,sd(predictors))
fors
.
Unscale coefficient of scaled continuous variable in negative binomial regression
Set up data:
set.seed(1)
dep <- dnbinom(seq(1:150), size = 150, prob = 0.75)
ind.1 <- ifelse(sign(rnorm(150))==-1,0,1)
ind.2 <- rnorm(150, 10, 1.7)
df <- data.frame(dep, ind.1, ind.2)
sc <- sd(df$ind.2)
Fit unscaled and scaled models:
m_unsc <- MASS::glm.nb(dep ~ ind.1 + ind.2, data = df)
m_sc <- update(m_unsc, data = transform(df, ind.2 = drop(scale(df$ind.2))))
Compare coefficients:
cbind(coef(m_unsc), coef(m_sc))
[,1] [,2]
(Intercept) -5.50449624 -5.13543854
ind.1 0.24445805 0.24445805
ind.2 0.03662308 0.06366992
Check equivalence (we divide the scaled coefficient by the scaling factor (sc
=sd(ind.2)
) to get back the unscaled coefficient):
all.equal(coef(m_sc)["ind.2"]/sc, coef(m_unsc)["ind.2"])
The negative binomial model uses a log link, not a logit link, so if you want to back-transform the coefficient to get proportional or "fold" changes per unit of ind2
:
exp(coef(m_sc)["ind.2"]/sc)
this gives 1.0373, a 4% change in the response per unit change in ind.2
(you can confirm that it's the same as exponentiating the unscaled coefficient).
Note that 2/3 of the answers in the linked question, including the currently accepted answer, are wrong: you should be dividing the scaled coefficient by the scaling factor, not multiplying ...
unscale predictor coefficients lmer model fit with an unscaled response
The scale
function does a z-transform of the data, which means it takes the original values, subtracts the mean, and then divides by the standard deviation.
to_scale <- 1:10
using_scale <- scale(to_scale, center = TRUE, scale = TRUE)
by_hand <- (to_scale - mean(to_scale))/sd(to_scale)
identical(as.numeric(using_scale), by_hand)
[1] TRUE
Therefore, to reverse the model coefficients all you need to do is multiply the coefficient by the standard deviation of the covariate and add the mean. The scale function holds onto the mean and sd for you. So, if we assume that your covariate values are the using_scale
vector for the scale.t6
regression coefficient we can write a function to do the work for us.
get_real <- function(coef, scaled_covariate){
# collect mean and standard deviation from scaled covariate
mean_sd <- unlist(attributes(scaled_covariate)[-1])
# reverse the z-transformation
answer <- (coef * mean_sd[2]) + mean_sd[1]
# this value will have a name, remove it
names(answer) <- NULL
# return unscaled coef
return(answer)
}
get_real(-0.3440, using_scale)
[1] 4.458488
In other words, it is the same thing as unscaling your predictor variables because it is a monotonic transformation.
Standardizing regression coefficients changed significance
tldr; The p-values characterising the statistical significance of parameters in a linear model may change following scaling (standardising) variables.
As an example, I will work with the mtcars
dataset, and regress mpg
on disp
and drat
; or in R's formula
language mpg ~ disp + drat
.
1. Three linear models
We implement three different (OLS) linear models, the difference being different scaling strategies of the variables.
To start, we don't do any scaling.
m1 <- lm(mpg ~ disp + drat, data = mtcars)
Next, we scale values using
scale
which by default does two things: (1) It centers values at0
by subtracting the mean, and (2) it scales values to have unit variance by dividing the (centered) values by their standard deviation.m2 <- lm(mpg ~ disp + drat, data = as.data.frame(scale(mtcars)))
Note that we can apply
scale
to thedata.frame
directly, which will scale values by column.scale
returns amatrix
so we need to transform the resulting object back to adata.frame
.Finally, we scale values using
scale
without centering, but scaling values to have unit variancem3 <- lm(mpg ~ disp + drat, data = as.data.frame(scale(mtcars, center = F)))
2. Comparison of parameter estimates and statistical significance
Let's inspect the parameter estimates for m1
summary(m1)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 21.84487993 6.747971087 3.237252 3.016655e-03
#disp -0.03569388 0.006652672 -5.365345 9.191388e-06
#drat 1.80202739 1.542091386 1.168561 2.520974e-01
We get the t values from the ratio of the parameter estimates and standard errors; the p-values then follow from the area under the curve of the pdf for df = nrow(mtcars) - 3
(as we have 3 parameters) where x > |t|
(corresponding to a two-sided t-test). So for example, for disp
we confirm the t value
summary(m1)$coef["disp", "Estimate"] / summary(m1)$coef["disp", "Std. Error"]
#[1] -5.365345
and the p-value
2 * pt(summary(m1)$coef["disp", "Estimate"] / summary(m1)$coef["disp", "Std. Error"], nrow(mtcars) - 3)
#[1] 9.191388e-06
Let's take a look at results from m2
:
summary(m2)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -1.306994e-17 0.09479281 -1.378790e-16 1.000000e+00
#disp -7.340121e-01 0.13680614 -5.365345e+00 9.191388e-06
#drat 1.598663e-01 0.13680614 1.168561e+00 2.520974e-01
Notice how the t values (i.e. the ratios of the estimates and standard errors) are different compared to those of m1
, due to the centering and scaling of data to have unit variance.
If however, we don't center values and only scale them to have unit variance
summary(m3)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.0263872 0.31705513 3.237252 3.016655e-03
#disp -0.4446985 0.08288348 -5.365345 9.191388e-06
#drat 0.3126834 0.26757994 1.168561 2.520974e-01
we can see that while estimates and standard errors are different compared to (unscaled) results from m1
, their respective ratios (i.e. the t values) are identical. So (default) scale(...)
will change the statistical significance of parameter estimates while scale(..., center = FALSE)
will not.
It's easy to see why dividing values by their standard deviation does not change the ratio of OLS parameter estimates and standard errors when taking a look at the closed form for the OLS parameter estimate and standard error, see e.g. here.
Related Topics
How to Reverse the Order of a Dataframe in R
Dplyr 0.7 Equivalent for Deprecated Mutate_
How to Reference Column Names That Start with a Number, in Data.Table
R: Ifelse Function Returns Vector Position Instead of Value (String)
"Error: Continuous Value Supplied to Discrete Scale" in Default Data Set Example Mtcars and Ggplot2
Control Number Formatting in Shiny's Implementation of Datatable
Rotate X Axis Labels 45 Degrees on Grouped Bar Plot R
Rscript Detect If R Script Is Being Called/Sourced from Another Script
How to Set Axis Ranges in Ggplot2 When Using a Log Scale
How to Read a Subset of Large Dataset in R
Calculate Elapsed Time Since Last Event
Making Binned Scatter Plots for Two Variables in Ggplot2 in R
Split a Data Frame Column Containing a List into Multiple Columns Using Dplyr (Or Otherwise)
How to Log an R Session to a File
Get List of Available Data Frames
How to Define the Version of a Package in R Install.Packages